PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

Sun, Hui; Zheng, Yingfeng; Xie, Haonan; Ma, Huidong; Liu, Xiaoguang; Wang, Gang

doi:10.1186/s12859-023-05566-9

Software
Open access
Published: 30 November 2023

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

Hui Sun¹,
Yingfeng Zheng¹,
Haonan Xie²,
Huidong Ma¹,
Xiaoguang Liu¹ &
…
Gang Wang¹

BMC Bioinformatics volume 24, Article number: 454 (2023) Cite this article

739 Accesses
1 Altmetric
Metrics details

Abstract

Background

Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space.

Results

We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors.

Conclusions

PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC.

Peer Review reports

Background

With the prosperity of next-generation sequencing (NGS) technologies, genomic sequencing data holds the characteristics of large data volumes, multiple files, and prosperous species sources [1]. For example, until September 2023, through 829,822 sequencing times of 845,109 bio-samples, the bio-sequencing data compressed by the gzip (http://www.gzip.org) in the CNGB Sequence Archive (https://db.cngb.org/cnsa) was up to 11,920 terabytes (TB), if stored in Amazon cloud at $0.125 each gigabyte (GB), will cost $1,525,760 per year. Therefore, compressing large-scale genomic sequencing data is essential in reducing the storage infrastructure construction expenditure and data transmission consumption, especially for sequencing reads, which account for 42% of the whole genomic sequencing data [2, 3].

Different from ordinary textual data (such as big social data), genomic reads data has the following bioinformatic attributes. (i) The formatting of reads characters is uncomplicated, which consists of the alphabet Σ = {A, C, G, T, N} simply. (ii) The redundancy reads data is high, mainly relying on sequencing coverage of diverse platforms. (iii) Homologous species reads data retains heightened similitude. Recently, reference-based and reference-free dedicated genomic sequencing reads compressors have been proposed by taking full advantage of reads' bioinformatic attributes [4, 5].

The first category, of reference-based compressors, acquires the position details of reads in the reference genome via sequence alignment technology to achieve redundant substrings replacement, such as RENANO [6], HRCM [7], and NRGC [8]. Reference-based methods are more helpful in compressing reference-matched reads, but their high compression ratio relies on reference genome selection. In contrast to reference-free compressors, the weaknesses of reference-based approaches can be summarized as follows. (i) Selecting an appropriate reference genome for a target dataset to be compressed can be challenging, and using a mismatched reference genome may result in inferior compression performance. (ii) Reference-based methods incorporate additional knowledge for encoding and decoding. This flexibility-limited technology hinders compressed data sharing and the widespread adoption of compressors. (iii) Aligning short reads data to a selected reference genome is a CPU-intensive and memory-unfriendly computational process, which demands higher time and memory consumption. However, in contrast, reference-free compressors utilize reads' redundant details and character formatting attributes for compression optimization, which have heightened application prospects in multi-species reads compression, such as Quip(-a) [9], BEETL [10], DSR2C [11], FQSqueezer [12], and NanoSpring [13]. Here, we will introduce some state-of-the-art and reference-free short reads compressors relevant to our work. See references [4] and [5] for more detailed reviews of fastq format genomic sequencing reads data compressors.

HARC [14] reorders short reads via their genome position details and then removes the redundancy sub-strings between consecutive reads, which achieves 1.4–2 × compression ratio improvement over Orcom [15] and Leon [16] in 3 billion Illumina short reads data. From the same research team, SPRING [17] enhances HARC by supporting variable-length reads and handles all fastq streams. In their experiment, SPRING compresses 195 GB of 25 × whole genome human data from Illumina’s NovaSeq platform to 7 GB, around 1.6 × smaller than Fastore [18].

PgRC [19] assembles pseudo-genomes through approximate common superstring reads and then encodes them by reads' mapping sites on the pseudo-genome, which bests in compression ratio over Minicom [20] and SPRING by up to 20% and 15% on average, respectively. CURC [21] improves PgRC by CPU and GPU collaborative computing, which achieves 2.76–3.14× and 4.15–6.54× speedup in compression and 1.26–1.65× and 1.6–2.52× decompression speedup compared with SPRING and PgRC on 18 single-end and 13 paired-end sequencing datasets.

Mstcom [22] introduces the concept of Hamming-shifting graphs to encode similar redundant reads and employ compressors BSC (http://libbsc.com) and LZMA (http://www.7zip.org) for compressing the encoded file streams. Like Coil [23] and ReCoil [24], Mstcom reduces redundant information by cross-referencing similar reads. Because the similarity comparison between different string reads is time-consuming, such methods consume more time and memory space. The compression performance of Mstcom can be 10–30% better than SPRING, Minicom, and PgRC on 14 single-end and 7 paired-end datasets.

FastqCLS [25] is the latest compressor, handy for compressing long and short genomic sequencing reads. It first extracts the features of reads via a novel scoring model and reorders them according to the numeral eigenvalues. Once the similarity reads are aggregated, the reads are compressed by the general-purpose compressor ZPAQ (http://mattmahoney.net/dc/zpaq.html) to improve the overall compression ratio. The experimental results on the MPEG [2] and LFastqC [26] benchmark datasets demonstrate that FastqCLS achieves compression ratios of 2.28–3.37 for long reads datasets, and 4.01–17.17 for short reads datasets. Compared with the pure ZPAQ compressor, FastqCLS achieves an average compression ratio boost of 4% (up to 18%) via the reads scoring model.

According to our investigation, existing reference-free short reads compressors rarely use redundancy information between different fastq files and large equipment memory in actual compression scenes to improve compression ratio. For example, in medium and long-term genomic data backup systems, the memory of the compression server often reaches hundreds or thousands of gigabytes (GB). Suppose the peak memory of the compressor holds only hundreds of megabytes (MB). In that case, it has advantages in memory usage but is ineffective for large memory to improve the compression ratio in multi-fastq files. To this end, we take the genomic sequencing reads compression ratio as the optimization objective and propose a large-scale reads compression optimization method named PMFFRC (Parallel Multi-Fastq-File Reads Clustering). The experimental results of actual 982 GB fastq format sequencing data, with the size of 274 GB and 3.3 billion reads, show that the average maximum compression ratio gains of compressors HARC, SPRING, Mstcom, and FastqCLS optimized by PMFFRC are 77.89%, 77.56%, 73.51%, and 29.36%, respectively. The maximum percent storage savings of the four cascaded compressors are 39.41%, 41.62%, 40.99%, and 20.19%, respectively. In the current version, the PMFFRC optimizer only compresses short reads, which accepts a folder containing multiple fastq files and outputs a compressed file *.pmffrc.

Implementation

Let F = {F₀, F₁, …, F_v-1} denotes a group of fastq files, R_i = {${R}_{0}^{i}, {R}_{1}^{i}, {R}_{2}^{i}, \dots ,$ ${R}_{\left(\left|{F}_{i}\right|/4\right)-1}^{i}$} denotes a collection of string short reads in the i-th fastq sequencing file, |F_i| denotes the number of lines in F_i, n denotes the average length of sequencing reads, Y represents the cascaded optimization compressor, U_ram represents the user-preset safe memory threshold (less than system memory), and K represents the number of clusters, where i = 0, 1, 2, …, v-1. The proposed optimization method PMFFRC includes two parts: sequencing file clustering (Section “Sequencing files clustering”), joint compression and decompression (Section “Joint compression and decompression”). PMFFRC benefits from the redundant short reads information entropy of similar fastq format sequencing files. Additional file 1: Section S1 gives more details for theoretical entropy analysis.

Sequencing files clustering

The PMFFRC optimizer achieves high compression ratios by clustering redundant reads together in different fastq files and increases cascaded compressor robustness by memory modeling. In the PMFFRC workflow, sequencing files clustering includes four stages. (i) Pre-compression: estimates the peak memory consumption on the overall fastq files by using random sampling technology. (ii) Feature extraction: converts short reads strings into numerical features, simplifying calculations and improving the optimization efficiency of PMFFRC. (iii) Similarity calculation: evaluates the similarity of each group file through the collection similarity assessment method. (iv) Fastq files clustering: ensures the algorithm robustness of the optimized compressor via a two-level clustering parameter selection strategy.

Pre-compression

The purpose of pre-compression is to evaluate the maximum memory consumption of the compressed short reads data in dataset F. Let Y_res denotes the resident memory of compressor Y (such as dictionaries and hash tables), and Y_reads denotes the extra memory space opened for genomic sequencing reads. To evaluate compression peak memory Y_cpm = Y_res + Y_reads of Y on dataset F, PMFFRC performs the following steps:

Step 1: Randomly select x₁ and x₂ sets of sequencing data from F_i to construct pre-compression fastq files X₁ = {${x}_{0}^{1}$, ${x}_{1}^{1}$, ${x}_{2}^{1}$, …, ${x}_{{v\times x}_{1}-1}^{1}$} and X₂ = {${x}_{0}^{2}$, ${x}_{1}^{2}$, ${x}_{2}^{2}$, …, ${x}_{{v\times x}_{2}-1}^{2}$}, each group ${x}_{j}^{e}$ contains description information, sequencing reads, and quality scores, where j = 0, 1, 2, …, v × x_e-1, i = 0, 1, 2, …, v-1, x₂ > > x₁ and e = 1, 2.

Step 2: Run compressor Y for pre-compressing datasets X₁ and X₂, getting Y's peak memory ${Y}_{peak}^{1}$ and ${Y}_{peak}^{2}$ on datasets X₁ and X₂.

Step 3: Estimate the compression peak memory Y_cpm of algorithm Y on datasets F = {F₀, F₁, …, F_v-1} according to the formula (1):

$$Y_{cpm} = Y_{res} + Y_{reads} = Y_{peak}^{1} + \frac{{\left| {Y_{peak}^{2} - Y_{peak}^{1} } \right|}}{{v \times \left( {x_{2} - x_{1} } \right)}} \times \mathop \sum \limits_{i = 0}^{v} \frac{{\left| {F_{i} } \right|}}{4}$$

(1)

In formula (1), |F_i|/4 denotes the total number of reads in F_i, where i = 0, 1, 2, …, v-1.

Feature extraction

In pre-compression stage, PMFFRC estimated the maximum memory consumption Y_cpm of compressor Y on F. However, the calculated Y_cpm might exceed the system memory. Thus, PMFFRC compresses similar fastq files in batches via redundant reads clustering method, using reads' feature vectors and user-preset safe memory threshold. However, converting string reads to digitized feature vectors is time-consuming. Therefore, PMFFRC utilizes CPU multi-cores to accelerate this stage in parallel.

Let Pr denotes the number of utilized CPU cores, ${\overline{R} }_{i}$= [${\dot{R}}_{0}^{i},{\dot{R}}_{1}^{i},{\dot{R}}_{2}^{i},\dots ,{\dot{R}}_{\left(\left|{F}_{i}\right|/4\right)-1}^{i}$] denotes a numeral feature vector of R_i, and ${\dot{R}}_{j}^{i}$ is the feature value of read ${R}_{j}^{i}$. Referring to the design idea of the reads scoring model in FastqCLS [25], PMFFRC employs Pr CPU cores to extract sequencing reads feature values in parallel through the data cycle division strategy [27], as shown in the formula (2):

$$\dot{R}_{j}^{i,p} = \mathop \prod \limits_{e = 0}^{3} \left( {\mathop \sum \limits_{r = 0}^{{\left| {R_{j}^{i,p} } \right|}} I\left( {R_{j}^{i,p} \left[ r \right] = E\left[ e \right]} \right) + 1} \right)/n$$

(2)

In formula (2), E = {A, C, G, T}, $I\left({R}_{j}^{i,p}\left[r\right]=E\left[e\right]\right)$ is an indicator function, p denotes the p-th CPU core, and p = i % Pr, where j = 0, 1, 2, …,$\left(\left|{F}_{i}\right|/4\right)$-1, i = 0, 1, 2, …, v-1.

Similarity calculation

After obtaining the feature vector ${\overline{R} }_{i}$= [${\dot{R}}_{0}^{i},{\dot{ R}}_{1}^{i},{ \dot{R}}_{2}^{i}, \dots ,{\dot{ R}}_{\left(\left|{F}_{i}\right|/4\right)-1}^{i}$] of R_i in F_i, PMFFRC evaluates the redundant reads similarity between different fastq files. Let S denotes a similarity collection of feature vectors ${\overline{R} }_{a}$ and ${\overline{R} }_{b}$. In practical scenarios, the data scale between F_a and F_b is unbalanced, usually. To offset the impact of fastq file size differences on similarity calculation between F_a and F_b, PMFFRC introduces correction parameters α and dice coefficient [28], uses Pr CPU cores, and improves the parallel similarity calculation model as shown in formula (3):

$$sim\left({\overline{R} }_{a}^{p},{\overline{R} }_{b}^{p}\right)=\alpha \times \frac{2\times \left|{\overline{R} }_{a}^{p}\cap {\overline{R} }_{b}^{p}\right|}{\left|{\overline{R} }_{a}^{p}\right|+\left|{\overline{R} }_{b}^{p}\right|}$$

(3)

In formula (3), $sim({\overline{R} }_{a}, {\overline{R} }_{b})$ denotes the similarity value between the feature vectors ${\overline{R} }_{a}$ and ${\overline{R} }_{b}$. Where $\alpha =1/(1-\frac{\left|\left|{\overline{R} }_{a}^{p}\right|-\left|{\overline{R} }_{b}^{p}\right|\right|}{\left|{\overline{R} }_{a}^{p}\right|+\left|{\overline{R} }_{b}^{p}\right|})$, p = s % Pr, s = 0, 1, 2, …, |S|-1, |S|= $\frac{v\times (v-1)}{2}$, a = 1, 2, 3, …, v-1, b = 0, 1, 2, …, v-2, and a > b.

Fastq files clustering

For computing friendly, PMFFRC converts the string reads files into numerical vectors to calculate S via formula (3). After that, it sorts S in descending order using the quicksort algorithm [29] and performs fastq files clustering. In order to select an appropriate parameter K, PMFFRC utilizes a two-level clustering parameter selection strategy. Specifically, PMFFRC first determines the files-level parameter K₁ and then slightly adjusts it to get the reads-level parameter K₂.

In actual experimental observations, the peak memory Y_cpm of Y shows a nonlinear growth trend with the fastq data scale growth of F in some cases (such as FastqCLS). Considering algorithm robustness, by modeling U_ram and Y_cpm, PMFFRC introduces an empirical correction factor β to artificial-fixed the files-level clustering parameter K₁, the calculation model as shown in formula (4):

$${K}_{1}=\lceil\beta \times \frac{{Y}_{reads}}{{U}_{ram}-{Y}_{res}}\rceil$$

(4)

In formula (4), the parameter K₁ represents the number of files-level clusters determined by the memory modeling method. However, there is a significant variation in the number of reads across different sequencing files. When the number of short reads within a cluster becomes too large, it may exceed the predefined memory threshold U_ram for compression. Therefore, optimizer PMFFRC dynamically adjusts t K₁ to get the number of clusters K₂ at reads-level as the final clustering parameter K.

Figure 1 illustrates the clustering of five sequencing fastq files using the proposed two-level (where v = 5 and K₁ = 2) clustering parameter selection strategy.

In Fig. 1, K₁ = 2, and M = 1060 denotes the overall reads number. According to K₁ and M, PMFFRC gets the average number of reads in each cluster as ave =$\lfloor {{M}/{K}_{1} } \rfloor$ = 530. However, the value of ave is calculated under the K₁ condition, and due to significant variations of the reads number in different fastq files, it is challenging for the optimizer PMFFRC to achieve the ideal reads number average state for each cluster. Therefore, PMFFRC fine-tunes at the reads level to ensure that the actual number of reads within each cluster is close to the ave value. Figure 1a shows the similarity matrix calculated from formula (3). In Fig. 1b, our optimizer PMFFRC initiates the analysis from the matrix element with the highest similarity score and employs a straightforward "first cluster first priority" principle when clustering fastq files. This straightforward implementation strategy guarantees an overall time advantage for the optimizer. PMFFR first detects that max(S) is sim(${\overline{R} }_{3}, {\overline{R} }_{0}$) = 0.92, which indicates the redundant reads between fastq files F₃ and F₀ have the highest similarity, so it adds F₀ and F₃ to the cluster C₀. Then, PMFFRC calculates the total number of reads |C₀| is 340 in C₀, which is less than ave, so it tries to add the fastq file F₁ in C₀. However, |C₀| is 650 at this time, which is greater than ave, so it discards F₁ and obtains the first cluster C₀ = {F₀, F₃}. After the above steps, the first cluster C₀ has been received. Thus, PMFFRC removes the elements in collection C₀ from the similarity matrix and selects the second cluster files. In Fig. 1c, d, after the first cluster has been built, the remaining components are sim(${\overline{R} }_{2}, {\overline{R} }_{1}$), sim(${\overline{R} }_{4}, {\overline{R} }_{1}$), sim(${\overline{R} }_{4}, {\overline{R} }_{2}$), and sim(${\overline{R} }_{4}, {\overline{R} }_{3}$), where sim(${\overline{R} }_{4}, {\overline{R} }_{1}$) = 0.90 is the highest similarity score. Therefore, PMFFRC adds F₄ and F₁ to a new cluster C₁. Now, |C₁|= 460, which is less than ave. If PMFFRC adds F₂ to cluster C₁, |C₁| will exceed ave. Consequently, PMFFRC ends the second clustering stage to obtain cluster C₁ = {F₁, F₄} and adds fastq file F₂ to another cluster C₂ = {F₂}.

Via user-preset safe memory threshold and reads feature vectors, PMFFRC clusters similar fastq files together. Thus, the high-similarity files in the same cluster generate more highly similar redundant reads, which is more helpful for subsequent cascaded compressors. Additional file 1: Algorithm S1 summarizes the proposed optimization method PMFFRC. Additional file 1: Section S2 details the algorithm description and analysis.

Joint compression and decompression

When obtaining the cluster record files, C_k.info, of the sequencing files collection F = {F₀, F₁, …, F_v-1}, the fastq files in each cluster C_k are merged according to temp files C_k.info, where k = 0, 1, 2,…, K-1. After that, dedicated state-of-the-art short reads compressors can be used to compress these clustered files.

The PMFFRC optimizer further improves the compression ratio for the optimized algorithm on the cluster files at the joint compression and decompression phase. Due to its heightened scalability, PMFFRC can be applied to the latest short reads compressors. To better embody the optimization idea, Fig. 2 shows the overall processing workflow of the PMFFRC optimizes algorithm Y to compress and decompress fastq format files collection F = {F₀, F₁, …, F_v-1}.

In Fig. 2a, C_k.ycom is the compressed files by compressor Y, and the *.pmffrc file is the optimized compressed files, where k = 0, 1, 2, …, K-1. In Fig. 2b, F_i.reads denote the recovered files corresponding to F_i.fastq, where i = 0, 1, 2, …, v-1.

Results

Experimental configuration and datasets

Evaluation experiments were carried out on a Sugon-7000A supercomputer system of the National High-Performance Computing Center Nanning Branch (https://hpc.gxu.edu.cn). Computing nodes are equipped with 2*Intel Xeon Gold 6230 CPU (2.1 Ghz, 40 cores), 512 GB DDR4 SDRAM, and 8*900 GB disk space. The experimental node runs the 64-bit version of centos 7.4. The PMFFRC was implemented by C + + 11 and OpenMP. The overall workflow has been encapsulated into the PMFFRC toolkit. Additional file 1: Section S3 gives the installation and configuration details of the PMFFRC optimizer. The PMFFRC toolkit follows the Apache-2.0 License, available freely at https://github.com/fahaihi/PMFFRC.

Three large-scale datasets from the NCBI (https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/sra) database were used to assess optimization effects. For paired-end sequencing files, PMFFRC arranges them as two separate single-end files. Table 1 gives the details of the experimental datasets.

Table 1 Datasets used for optimizing large-scale genomic sequencing reads compression

Full size table

The proposed algorithm PMFFRC was tested on three large-scale datasets, and its compression size (cs), compression ratio (cr), compression peak memory (cpm), compression time (ct), decompression time (dt), and decompression peak memory (dpm) were compared to most state-of-the-art and reference-free reads compressors, more specifically, HARC [14], SPRING [17], Mstcom [22] and FastqCLS [25]. We also employed the compression ratio gains (crg) and percent storage savings (pss) to evaluate the optimization results. The crg and pss calculations are as follows:

$$crg = \frac{cr\_with\_PMFFRC}{{cr\_without\_PMFFRC}} - 1$$

(5)

$$pss = 1 - \frac{cs\_with\_PMFFRC}{{cs\_without\_PMFFRC}}$$

(6)

All experiments assume the compression server has a maximum memory of 300 GB and utilizes the default 8 CPU cores for joint compression and decompression.

Performance on Homo sapiens dataset

Table 2 shows the results for HARC, SPRING, Mstcom, and FastqCLS using the proposed algorithm PMFFRC on the H. sapiens dataset.

Table 2 Performance of PMFFRC on H. sapiens dataset

Full size table

As seen in Table 2, using PMFFRC for compression optimizing, when K = 1, the crg of HARC, SPRING, Mstcom, and FastqCLS is increased up to 118.54%, 126.95%, 108.51%, and 4.76%, respectively. The four special compressors save 45.74%, 55.87%, 52.04%, and 4.51% storage spaces compared with the without-mode.

In Table 2, we also observed that HARC, SPRING, and Mstcom held compression ratio gains of over 100% (when k = 1), while FastqCLS only achieved 4.76%. This phenomenon is attributed to the particular reads reorder model employed by FastqCLS. In reference [25], FastqCLS introduces the concept of "constructed score" for reordering reads collection, which relies on calculating the probability distribution of each nucleotide base within the reads range. When the length of the reads collection is shorter, it becomes challenging for the sequence scoring model of FastqCLS to generate local clustering effects among similar short reads. As a result, this limitation hampers the optimization efficiency of PMFFRC. Moreover, for reference-free compressors based on reads reorder technologies, the index that records the relative positions of short reads is crucial for restoring the original data. For a collection of shorter reads, the abovementioned record file occupies a higher proportion in the compressed file, which is another potential reason for the low optimization efficiency of PMFFRC. Although PMFFRC did not achieve the expected optimization results for FastqCLS on the H. sapiens dataset, PMFFRC remains highly competitive in saving storage space. As shown in Table 2, the proposed PMFFRC saved 4.51% of storage space for FastqCLS merely through joint compression.

PMFFRC requires extra time for fastq files clustering, merging, and splitting, so ct and dt are slightly higher than the without-mode. The parallel clustering speedup and relative memory consumption by employing multiple CPU cores on the H. sapiens dataset are also evaluated. See Additional file 1: Table S1 for more details.

Performance on Salvelinus fontinalis dataset

To evaluate the optimizing efficiency of PMFFRC on large-scale fastq files, Table 3 shows the compression performance of the four cascaded algorithms by running PMFFRC on the S. fontinalis dataset, which contains 360 sequencing files with a total size of 190 GB. Additional file 1: Table S2 details the clustering speedup and memory consumption on the S. fontinalis dataset using the different numbers of CPU cores.

Table 3 Performance of PMFFRC on S. fontinalis dataset

Full size table

Table 3 shows that the compression ratio cr using the PMFFRC optimizer for the four cascaded compression algorithms reaches the peak at K = 1. The cr increases by 56.22%, 57.45%, 71.12%, and 62.56% compared with without-mode. By sacrificing system memory, the MPFFRC optimizer makes four special compressors save storage space of 36.02%, 36.43%, 41.60%, and 38.44%, respectively. The optimizer PMFFRC delivers remarkable results on the S. fontinalis dataset due to its maximal utilization of the redundancy information within the compressed datasets.

Performance on Cicer arietinum dataset

Table 4 presents the optimization results of HARC, SPRING, Mstcom, and FastqCLS by running PMFFRC on C. arietinum dataset, which contains 60 fastq files, and the average data size of every single fastq file is 11 GB (the total size of the Cicer arietinum dataset is 681 GB). The speedup and time consumption by using different CPU cores on the C. arietinum dataset are shown in Additional file 1: Table S3.

Table 4 Performance of PMFFRC on C. arietinum dataset

Full size table

As shown in Table 4, for HARC and SPRING compressors, when the safe threshold U_ram takes 80–100 GB, the clustering parameter K = 1. Simultaneously, the compression ratio cr of HARC and SPRING increased by 58.90% and 48.29%. By leveraging system memory, the PMFFRC optimizer saved 36.44% and 32.55% storage space sizes for HARC and SPRING, respectively. For compressors Mstcom and FastqCLS, when the safe threshold U_ram takes 300 GB (maximum system memory), the clustering parameter K = 4, and the compression ratio cr is improved by 40.91% and 20.76%. In this case, PMFFRC preserved 28.98% and 17.16% of storage space sizes for Mstcom and FastqCLS, respectively.

In C. arietinum dataset, the compression time ct of four cascaded compressors was reduced by running PMFFRC compared with without-mode. This contribution can be attributed to the following reasons: (i) PMFFRC accelerates computation-intensive stages by multi-core CPU parallelism. Therefore, the time consumption of fastq files clustering and merging steps is far lower than the total time consumption. (ii) Similar sequencing reads are gathered by redundant clustering in different fastq files, which favors reference-free compressors in the joint compressing stage. For example, HARC [14] utilizes a redundant substitute step for replacing reverses complimentary and direct repeat reads, which benefit from our clustering algorithm. (iii) The joint compression stage reduces the preprocessing time consumption, such as building hash tables and dictionaries. For example, in the without-mode, HARC needs to perform 60 hash table construction processes on the Cicer arietinum dataset. However, joint compression only needs to build hash table once when K = 1.

Discussion

We presented PMFFRC, a CPU-parallel algorithm for optimizing large-scale genomic sequencing reads lossless compression. On real datasets, PMFFRC achieves a 20–80% maximum compression ratio gains and an additional 20–40% file size reduction compared with non-optimized reference-free compressors. Our work pointed out a feasible idea for large-scale genome sequencing reads compression, which is to use the redundant information of reads in different fastq files to improve the compression ratio. In some cases, the PMFFRC algorithm can also reduce the compression and decompression time. Compared with non-optimized reference-free compressors, the significant compression advantage for optimized algorithms is achieved through a "three-stage redundancy utilization strategy" (fastq files, reads string, and nucleotide character) provided by the optimizer and the optimized algorithm. A typical sample implementation is PMFFRC + FastqCLS. (i) The PMFFRC optimizer employs a two-level clustering method to group similar fastq format data, resulting in highly similar data within each group. This clustering strategy allows the FastqCLS algorithm to take advantage of redundant information at the files level in the compressed datasets. (ii) The FastqCLS algorithm uses a novel scoring model to reorder short reads by leveraging redundancy information at the short reads string level. This preprocessing phase takes advantage of the redundancy between short reads. (iii) The FastqCLS compressor incorporates the ZPAQ algorithm, which employs context modelling and arithmetic coding. This enables FastqCLS to detect patterns and character dependencies in the reads data, utilizing context models and exploiting redundancy at the nucleotide character level to improve compression ratios.

Our optimizer PMFFRC showed promising results on 444 sequencing files. However, it is undeniable that PMFFRC nonetheless has some limitations. On the one hand, the PMFFRC algorithm may not achieve the expected optimization results for small memory devices and small-scale genome sequencing data. Since PMFFRC utilizes system memory and files-level redundancy information, it is more suitable for compressing medium to large-scale sequencing datasets. On the other hand, the high compression ratio achieved by PMFFRC restricts its flexibility in decompression, making it more suitable for medium and long-term backup applications of sequencing short reads data. Based on current work, potential future work includes: (i) Improving computation of PMFFRC by using CPU and GPU (Graphics Processing Unit) collaborative computing for large-scale and long sequencing reads optimization compression. (ii) Achieving the accurate compression and decompression of clustering files to enhance application flexibility. (iii) Creating a block index for each joint compressed file and reducing the overall compression memory consumption through block compression, so that small memory devices can also benefit from the proposed PMFFRC algorithm. (iv) Another interesting direction is to explore machine learning and deep learning techniques for predicting the compression peak memory Y_cpm and clustering parameters K accurately, to maximize utilization the user-preset safe memory threshold U_ram.

Availability and requirements

Project name: PMFFRC
Project home page: https://github.com/fahaihi/PMFFRC
Operating system(s): Linux
Programming language: C + + , OpenMP
Other requirements: FastqCLS, Mstcom, Harc, Spring, and gcc 5.4.0.
License: Apache-2.0 license
Any restrictions to use by non-academics: For commercial use, please contact the authors.

Availability of data and materials

The datasets generated and analyzed during the current study are available in the PMFFRC repository, https://github.com/fahaihi/PMFFRC/tree/master/data.

Abbreviations

MB:: Megabyte
GB:: Gigabyte
TB:: Terabyte

References

Voges J, Hernaez M, Mattavelli M, Ostermann J. An introduction to MPEG-G: the first open ISO/IEC standard for the compression and exchange of genomic sequencing data. Proc IEEE. 2021;109(9):1607–22. https://0-doi-org.brum.beds.ac.uk/10.1109/JPROC.2021.3082027.
Article CAS Google Scholar
Numanagić I, Bonfield JK, Hach F, Voges J, Ostermann J, Alberti C, et al. Comparison of high-throughput sequencing data compression tools. Nat Methods. 2016;13(12):1005–8. https://0-doi-org.brum.beds.ac.uk/10.1038/nmeth.4037.
Article CAS PubMed Google Scholar
Kokot M, Gudyś A, Li H, Deorowicz S. CoLoRd: compressing long reads. Nat Methods. 2022;19(4):441–4. https://0-doi-org.brum.beds.ac.uk/10.1038/s41592-022-01432-3.
Article CAS PubMed PubMed Central Google Scholar
Zhu Z, Zhang Y, JiZ HS, Yang X. High-throughput DNA sequence data compression. Brief Bioinform. 2015;16(1):1–15. https://0-doi-org.brum.beds.ac.uk/10.1093/bib/bbt087.
Article CAS PubMed Google Scholar
Hernaez M, Pavlichin D, Weissman T, Ochoa I. Genomic data compression. Annu Rev Biomed Data Sci. 2019;2:19–37. https://0-doi-org.brum.beds.ac.uk/10.1146/annurev-biodatasci-072018-021229.
Article Google Scholar
Dufort y Álvarez G, Seroussi G, Smircich P, Sotelo-Silveira J, Ochoa I, Martín Á. RENANO: a REference-based compressor for NANOpore FASTQ files. Bioinformatics. 2021;37(24):4862–4. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btab437.
Article CAS PubMed Google Scholar
Yao H, Ji Y, Li K, Liu S, He J, Wang R. HRCM: an efficient hybrid referential compression method for genomic big data. BioMed Res Int. 2019. https://0-doi-org.brum.beds.ac.uk/10.1155/2019/3108950.
Article PubMed PubMed Central Google Scholar
Saha S, Rajasekaran S. NRGC: a novel referential genome compression algorithm. Bioinformatics. 2016;32(22):3405–12. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btw505.
Article CAS PubMed PubMed Central Google Scholar
Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res. 2012;40(22):e171. https://0-doi-org.brum.beds.ac.uk/10.1093/nar/gks754.
Article CAS PubMed PubMed Central Google Scholar
Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 2012;28:1415–9. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/bts173.
Article CAS PubMed Google Scholar
Roguski Ł, Deorowicz S. DSRC2 industry-oriented compression of FASTQ files. Bioinformatics. 2014;30:2213–5. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btu208.
Article CAS PubMed Google Scholar
Deorowicz S. FQSqueezer: k-mer-based compression of sequencing data. Sci Rep. 2020;10(1):1–9. https://0-doi-org.brum.beds.ac.uk/10.1038/s41598-020-57452-6.
Article CAS Google Scholar
Meng Q, Chandak S, Zhu Y, Weissman T. Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach. Sci Rep. 2023;13(1):2082. https://0-doi-org.brum.beds.ac.uk/10.1038/s41598-023-29267-8.
Article CAS PubMed PubMed Central Google Scholar
Chandak S, Tatwawadi K, Weissman T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics. 2018;34(4):558–67. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btx639.
Article CAS PubMed Google Scholar
Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. Bioinformatics. 2014;31:1389–95. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btu844.
Article CAS PubMed Google Scholar
Benoit G, Lemaitre C, Lavenier D, Drezen E, Dayris T, Uricaru R, et al. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC Bioinform. 2015;16(1):1–14. https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-015-0709-7.
Article CAS Google Scholar
Chandak S, Tatwawadi K, Ochoa I, Hernaez M, Weissman T. SPRING: a next-generation compressor for FASTQ data. Bioinformatics. 2019;35(15):2674–6. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/bty1015.
Article CAS PubMed Google Scholar
Roguski Ł, Ochoa I, Hernaez M, Deorowicz S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics. 2018;34(16):2748–56. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/bty205.
Article CAS PubMed Google Scholar
Kowalski TM, Grabowski S. PgRC: pseudogenome-based read compressor. Bioinformatics. 2020;36(7):2082–9. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btz919.
Article CAS PubMed Google Scholar
Liu Y, Yu Z, Dinger ME, Li J. Index suffix–prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics. 2018;35(12):2066–74. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/bty936.
Article CAS Google Scholar
Xie S, He X, He S, Zhu Z. CURC: a CUDA-based reference-free read compressor. Bioinformatics. 2022;38(12):3294–6. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btac333.
Article CAS PubMed Google Scholar
Liu Y, Li J. Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. PLoS Comput Biol. 2021;17(7):e1009229. https://0-doi-org.brum.beds.ac.uk/10.1371/journal.pcbi.1009229.
Article CAS PubMed PubMed Central Google Scholar
White WTJ, Hendy MD. Compressing DNA sequence databases with coil. BMC Bioinformatics. 2008;9(1):242. https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-9-242.
Article CAS PubMed PubMed Central Google Scholar
Yanovsky V. ReCoil-an algorithm for compression of extremely large datasets of DNA data. Algorithms Mol Biol. 2011;6(1):23. https://0-doi-org.brum.beds.ac.uk/10.1186/1748-7188-6-23.
Article PubMed PubMed Central Google Scholar
Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics. 2022;38(2):351–6. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/btab696.
Article CAS PubMed Google Scholar
Al Yami S, Huang CH. LFastqC: A lossless non-reference-based FASTQ compressor. PLoS ONE. 2019;14(11):e0224806. https://0-doi-org.brum.beds.ac.uk/10.1371/journal.pone.0224806.
Article CAS PubMed PubMed Central Google Scholar
Cheng J, Grossman M, McKercher T. Professional CUDA c programming. Beijing: China Machine Press; 2017.
Google Scholar
Cha SH. Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci. 2007;1(2):300–7.
Google Scholar
Hoare CAR. Quicksort. Comput J. 1962;5(1):10–6. https://0-doi-org.brum.beds.ac.uk/10.1093/comjnl/5.1.10.
Article Google Scholar

Download references

Acknowledgements

The experimental work is supported by the High-performance Computing Center of Guangxi University. The authors thank the editor and anonymous reviewers for their constructive comments and suggestions to improve our manuscript. The authors also thank Dr. Meng Yan for guiding the manuscript revision.

Funding

This work is partly supported by NSF of China (62141412, 62272253, 62272252), Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
Hui Sun, Yingfeng Zheng, Huidong Ma, Xiaoguang Liu & Gang Wang
Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning, China
Haonan Xie

Authors

Hui Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yingfeng Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Haonan Xie
View author publications
You can also search for this author in PubMed Google Scholar
Huidong Ma
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoguang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

HS and YFZ implemented the code. HS and HNX wrote the manuscript. XGL and GW guided the project. HDM completed a part of the experiments and prepared the dataset. XGL and GW also helped to modify the manuscript. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Xiaoguang Liu or Gang Wang.

Ethics declarations

Ethics approval and consent to participate

The ethic approval is not required since we used publicly available datasets.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

PMFFRC_Supplementary_Material.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Sun, H., Zheng, Y., Xie, H. et al. PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering. BMC Bioinformatics 24, 454 (2023). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-023-05566-9

Download citation

Received: 12 September 2023
Accepted: 13 November 2023
Published: 30 November 2023
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-023-05566-9

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering

Abstract

Background

Results

Conclusions

Background

Implementation

Sequencing files clustering

Pre-compression

Feature extraction

Similarity calculation

Fastq files clustering

Joint compression and decompression

Results

Experimental configuration and datasets

Performance on Homo sapiens dataset

Performance on Salvelinus fontinalis dataset

Performance on Cicer arietinum dataset

Discussion

Availability and requirements

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us