Skip to main content

Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring

Abstract

Background

Somatic copy number alternations (SCNAs) can be utilized to infer tumor subclonal populations in whole genome seuqncing studies, where usually their read count ratios between tumor-normal paired samples serve as the inferring proxy. Existing SCNA based subclonal population inferring tools consider the GC bias of tumor and normal sample is of the same fature, and could be fully offset by read count ratio. However, we found that, the read count ratio on SCNA segments presents a Log linear biased pattern, which influence existing read count ratios based subclonal inferring tools performance. Currently no correction tools take into account the read ratio bias.

Results

We present Pre-SCNAClonal, a tool that improving tumor subclonal population inferring by correcting GC-bias at SCNAs level. Pre-SCNAClonal first corrects GC bias using Markov chain Monte Carlo probability model, then accurately locates baseline DNA segments (not containing any SCNAs) with a hierarchy clustering model. We show Pre-SCNAClonal’s superiority to exsiting GC-bias correction methods at any level of subclonal population.

Conclusions

Pre-SCNAClonal could be run independently as well as serving as pre-processing/gc-correction step in conjuntion with exsiting SCNA-based subclonal inferring tools.

Background

Tumor heterogeneity introduces challenges in cancer tissue diagnosis and subsequent treatment [1]. Currently, projects such as TCGA [2] screened thousands of tumor samples using whole-genome sequencing(WGS) on tissue (bulk) cells, provide more explicit molecular insights on identifying cancer cell types and sub-types than other bioinformatics methods [36]. To decipher cell composition in bulk cell WGS, somatic copy number alterations (SCNAs), commonly found in tumor cells [7], are utilized as the representative to determine tumor subclonal populations in a tumor-normal tissue paired manner by existing tools, e.g. MixClone [8], THetA [9]. The benefits of using SCNAs is that the WGS data doesn’t have to be deeply sequenced [8]. However, existing tools lack the ability to properly account for sequencing GC bias which is widely observed in DNA-seq data [10].

Evidences have showed that GC-bias could affect SCNA identification in tumor cells [1113]. Existing tools consider the SCNA segments have the same sequence properties between the normal and tumor samples, and consider the bias could be offset to use the read count ratios between tumor and normal paired samples [8, 9]. However, We found that, in a GC biased study, the GC contents and read count ratios on SCNA segments present a Log linear biased pattern. Though existing method [14] suggests removing GC bias by modeling GC cotent with tumor-normal coverage difference for small genomic windows, however, we find that small window is not a proper and robust resolution for SCNA.

We present Pre-SCNAClonal, a tool that improving tumor subclonal population inferring by correcting GC-bias at SCNAs level. Pre-SCNAClonal first corrects GC bias using Markov chain Monte Carlo probability model, then accurately locates baseline DNA segments (not containing any SCNAs) with a hierarchy clustering model. We show Pre-SCNAClonal’s superiority to exsiting GC-bias correction methods for SCNA-based tumor reconstruction tools at any level of subclonal population. We also note that Pre-SCNAClonal could be run independently as well as serving as pre-processing/gc-correction step in conjuntion with exsiting SCNA-based subclonal inferring tools.

Data

The WGS data of human breast cancer HCC2218 and HCC1954 with different levels of normal contamination (coverage 30x) are used to validate the method proposed in this paper. Each of the HCC1954 samples, HCC1954.mix1.n5t95, HCC1954.mix1.n20t80, HCC1954.mix1.n40t60, HCC1954.mix1.n60t40, HCC1954.mix1.n80t20 and HCC1954.mix1.n95t5, contains one tumor subclone. The tumor subclonal frequencies (or tumor purity) of these samples are 0.95, 0.80, 0.60, 0.40, 0.20 and 0.05, respectively. We also use the data of human ovary cancer sample TCGA-13-0723 in Benjamini’s work [11] to show the read count ratio’s GC bias between paired tumor and normal sample.

The WGS sequence alignment data (.bam files) of HCC2218 and its paired normal sample are publicly available on Illumina BaseSpace Sequence Hub website https://basespace.illumina.com. The WGS sequence alignment data (.bam files) of HCC1954 and its paired normal sample and the WGS data with different levels of normal contamination are public available at National Cancer Institute GDC Data Portal https://gdc.cancer.gov/resources-tcga-users/tcga-mutation-calling-benchmark-4-files. The WGS sequence alignment data (.bam files) of TCGA-13-0723 is available at National Cancer Institute GDC Data Portal https://portal.gdc.cancer.gov/ only for authorized user.

Methods

GC bias of the tumor WGS data does not have the same feature as its paired normal

Let coefficient θ j denote the effect of mappability and genomic length of segment j, \(\bar {C}_{j}\) denote the average copy number of segment j, λ j denote the expected read counts, and let \(D_{j}^{N}\) denote the read counts of segment j in matched normal genome, then for segment i and segment j, existing SCNA based tumor subclonal populations inferring tools [8, 9] assume that \({\lambda _{i}}/{\lambda _{j}} ={\bar {C}_{i}\theta _{i}}/{\bar {C}_{j}\theta _{j}}\), and \(\theta _{i} / \theta _{j} = D_{i}^{N} / D_{j}^{N}\), then

$$ \frac{D_{i}^{S}}{D_{j}^{S}} = \frac{\lambda_{i}}{\lambda_{j}} = \frac{\bar{C}_{i}\theta_{i}}{\bar{C}_{j}\theta_{j}} = \frac{\bar{C}_{i}}{\bar{C}_{j}} * \frac{D_{i}^{N}}{D_{j}^{N}}. $$
(1)

Figure 1 shows the two normal libraries from the same normal sample, and there is a crossover point of the two loess lines. Here we suppose the normal Lib 2 is a tumor sample has no variations, and normal Lib 1 is its paired normal sample. According to Eq. 1,

$$ \frac{D_{i}^{Lib2}}{D_{j}^{Lib2}} = \frac{\lambda_{i}}{\lambda_{j}} = \frac{\bar{C}_{i}\theta_{i}}{\bar{C}_{j}\theta_{j}} = \frac{2}{2} * \frac{D_{i}^{Lib1}}{D_{j}^{Lib1}} = \frac{D_{i}^{Lib1}}{D_{j}^{Lib1}}, $$
(2)
Fig. 1
figure 1

GC curves (10 kb bins). Observed fragment counts and loess lines plotted against GC of two libraries from the same normal sample TCGA-13-0723.Bins were randomly sampled from chromosome 1. This Figure is drawn by Benjamini et al. [11]

If j is the crossover point, we have \(D_{i}^{Lib2} = D_{i}^{Lib1}\), which means the two loess lines should overlap each other. This demonstrates that the GC bias is different in the tumor and its paired normal sample.

Modelling the difference of GC bias between paired tumor and normal sample

We find that, the difference between the GC bias of tumor and its paired normal could be modelled as following equation,

$$ \begin{aligned} &D_{i}^{N} = \frac{f(GC_{i})}{\exp(a_{1} * GC_{i}) / (d_{1} * GC_{i})} \\ &D_{i}^{S} = \frac{f(GC_{i})}{\exp(a_{2} * GC_{i}) / (d_{2} * GC_{i})} \\ \end{aligned}, $$
(3)

In this equation, f(GC i ) is a function of GC content, which represents the bias feature that shared by tumor and its paired normal sample. a1, a2, d1 and d2 denote the distinctions of bias feature between tumor and its paired normal sample. a1 and a2 represent the curvature of tumor and its paired normal sample respectively; d1 and d2 represent the distance of tumor and its paired normal sample respectively; As shown in Fig. 2, the distinctions of bias feature between the paired tumor sample HCC1954 and its paired normal HCC1954 BL could be well captured by this model.

Fig. 2
figure 2

The relationship of GC bias between paired tumor and normal samples. a The GC bias distribution of read count observed in 500bp bins with high mappability (top 10%). To account for uniqueness of sequences, a mappability measure is calculated for each position (base pair) in the bin. A location is called ‘mappable’ if the k-mer of the reference genome starting at the location is not perfectly repeated at any other location in the genome, where k is the read length. Both of the tumor and normal samples are processed by Illumina platform to produce the reads, and use GATK’s table recalibration and use Burrows-Wheeler Aligner (bwa) to align the sequence data with the same parameters. b In this figure, the red and blue solid lines are the mean functions of loess smooth in (a). The red and blue dashed lines are the mean functions of loess smooth in (a) multiplies exp(1.3GC)/(2.3GC) and exp(2GC)/(3GC) respectively

According to Eq. 3, Eq. 1 is transformed into

$$ \frac{D_{i}^{S}}{D_{j}^{S}} = \frac{\bar{C}_{i}}{\bar{C}_{j}} * \exp\left[(a_{2}-a_{1})* (GC_{j} - GC_{i})\right] * \frac{D_{i}^{N}}{D_{j}^{N}}, $$
(4)

then,

$$ \log{\frac{D_{i}^{S}}{D_{i}^{N}}} - \log{\frac{D_{j}^{S}}{D_{j}^{N}}} = \log{\frac{\bar{C}_{i}}{\bar{C}_{j}}} + \left(a_{2}-a_{1}\right)* \left(GC_{j} - GC_{i}\right). $$
(5)

Equation 5 reveals that the read count ratio presents a Log linear biased pattern on SCNAs which we will prove it later. Equation 5 also shows that the read count ratio’s GC bias between paired tumor and normal sample exists if the curvature of tumor and its paired normal sample are not the same. We also find this phenomenon in HCC2218 (Additional file 1: Figure S1).

BAF in tumor WGS data presents symmetrical pattern in [0,1] at heterozygous SNP sites

Let μ i denote the BAF of SCNA segment i of tumor genome on germline heterozygous SNP site, and let C i , G i respectively denote the absolute copy number and genotype of SCNA segment i. The B allele (non-reference allele) could be either maternal or paternal allele, thus the BAF of SCNA segments of tumor genome presents symmetrical pattern in [ 0,1] (please see Additional file 1: Supplementary 3.3.2 for detail proof). Let ξ i denote the BAF of the tumor sample, ϕ i denote the subclonal population frequency, then,

$$ \xi_{i} = \frac{\phi_{i} * C_{i} * \mu_{i} + \left(1- \phi_{i}\right) *2*\frac{1}{2}}{\bar{C}_{i}}, $$
(6)
$$ \bar{C}_{i} = \phi_{i} * C_{i} + \left(1- \phi_{i}\right) *2. $$
(7)

In Eqs. 6 and 7, ‘2’ and ‘\(\frac {1}{2}\)’ are the copy number and heterozygous BAF of normal sample. Then, ξ i is symmetrical in [ 0,1], because μ i is symmetrical in [ 0,1].

GC bias of read count ratio affects SCNA based subclonal population analysis

By increasing the window size to 5000bp (Fig. 3c) or even larger at SCNA level (Fig. 3b), the 2D plot between GC content and tumor-normal coverage ratio clearly clustered into multiple stripes. It is noted that the relationship is pretty linear between GC content and log ratio of tumor-normal coverage on SCNAs (Fig. 3a) and we show that slopes of linear relation vary across tumors (Additional file 1: Figure S1). We also show that the gaps between the stripes in Fig. 3a are proportional to the subclonal populations (as shown in the sub-figures in the first column of Fig. 4). The SCNA segments which are clustered into the same stripe, present the symmetrical pattern of B allele frequency (BAF) density on the heterozygous allele loci of paired normal sample (Fig. 3e), which reveals that these SCNA segments in the same stripe contain the same copy number(see Additional file 1: Supplementary 3.3.2 for detail proof). While using the ratio of read counts of SCNA segments to get the precise subclonal population of each SCNA, it needs to correct the GC bias of the gap first.

Fig. 3
figure 3

GC bias of WGS data of tumor-normal paired sample HCC1954.mix1.n20t80 of TCGA mutation calling benchmark 4. Let DS and DN respectively denote the read counts of the segment of tumor and normal samples. a The GC bias of the Log ratio of tumor and normal read counts of the SCNA segments. The purple and blue lines are linear regression and loess regression lines respectively. b The GC bias of the ratio of tumor and normal read counts of the SCNA segments. The red line are drawn by the loess regression model with a quadratic polynomial function, which is used to rectify the distribution of the ratio DS/DN in the state-of-art GC correction method [14]. c The GC bias of the ratio of tumor and normal read counts of the 5000 bp bin. Since the majority (81%) of CNV calls are between 1 kb and 100 kb [17], most of 5000 bp bins spans only one SCNA. This sub-figure shows most SCNAs clustered clearly into multiple strips. d The GC bias of the ratio of tumor and normal read counts of the 500 bp bin. e The distribution of B-allele frequency (BAF) of stripe 1–6 in Fig. 3a. The SCNA segments are obtained by BIC-seq [18]

Fig. 4
figure 4

Read count ratio’s GC bias correction of HCC1954 with different levels of normal contamination. Here ‘n5t95’, ‘n20t80’, ‘n40t60’, ‘n60t40’ and ‘n95t5’ respectively denote the tumor sample ‘HCC1954.mix1.n5t95’, ‘HCC1954.mix1.n20t80’, ‘HCC1954.mix1.n40t60’, ‘HCC1954.mix1.n60t40’, ‘HCC1954.mix1.n80t20’ and ‘HCC1954.mix1.n95t5’. Subfigures in the ‘Origin’ column show the GC bias of read count ratio before correction, and column ‘MCMC’ and ‘Regression’ show the GC bias of read count ratio after the correction by MCMC model of Pre-SCNAClonal and Regression model respectively. The red lines are the linear regression lines. All the subfigures are plotted by Pre-SCNAClonal

Existing read count ratio’s GC bias correction methods are not suitable for SCNA based subclonal population analysis

Existing GC correction methods for WGS data of tumor normal paired sample, such as CNAnorm [14], rectifies the distribution of the ratio of read counts of the small window, aiming at finding the position of SCNA and absolute copy number (Fig. 3d) by merging the adjoining small window with similar ratio properties. This method uses regression model to rectify the GC content distribution of the ratio and hence removing the dependencies on GC content. However, while using this GC correction method to rectify the bias of read count ratio for SCNA based subclonal population analysis, it additionally requires the regression correctly capture the slope of the gaps between the SCNA stripes. As shown in Fig. 3a and b, linear or loess regression could be easily biased by outliers, regression lines in Fig. 3a and b do not parallel the stripes, hence there would still exist GC content bias after removing the dependencies on GC content based on these regression lines (see Fig. 4).

Models of Pre-SCNAClonal for read count ratio’s GC bias correction for SCNA based subclonal population analysis

MCMC model

Pre-SCNAClonal uses a Markov chain Monte Carlo (MCMC) model to pick out the maximum posterior probability of stripe slope m listed in Eq. 8,

$$ \begin{array}{ll} p(m|Y,X) & \sim p(m) * p(Y,X|m)\\ m & \sim \text{Uniform}(a- \delta, a+\delta)\\ p(Y, X|m) & = \Lambda(D, \tau * \max(cn))\\ D & = density(Y')\\ Y' & = Y-(m*X + c) + \text{median}(Y)\\ \end{array}, $$
(8)

here Y, X denotes log(DS/DN) and GC content respectively; a, c are slope and intercept pre-determined by two points, coordinates of which are the median of Y and X at high and low GC content areas; δ is the slope range pre-specified; D denotes the density function, Λ(D,τ max(cn)) denotes the sum of top (largest) τ max(cn) peaks of density curve of D; τ denotes the number of subclonal populations, max(cn) denotes the maximum copy number pre-defined. Y represents the corrected Y.

Hierarchy clustering model

Note that, normally, the read counts of tumor segments without SCNA (defined as baseline) are not equivalent to those from paired normal samples due to coverage difference. According to Eqs. 6 and 7, the \(\bar {C}_{i}\) and ξ i of baseline segment always equals to 2 and \(\frac {1}{2}\) respectively. If and only if \(\mu _{i}=\frac {1}{2}\), \(\xi _{i} = \frac {1}{2}\). Then according to Eqs. 5 and 7, the baseline segments locate in the SCNA stripe with \(\xi _{i} = \frac {1}{2}\) and the smallest \(\log {\frac {D_{i}^{S}}{D_{i}^{S}}}\), because only positive even C i with equal paternal and maternal copy could make \(\mu _{i}=\frac {1}{2}\). Thus, after the GC correction, Pre-SCNAClonal picks out all the segments with \(\xi _{i} = \frac {1}{2}\), and imports a hierarchy clustering model to group the segments into several clusters, then Pre-SCNAClonal selects the cluster with smallest \(\log {\frac {D_{i}^{S}}{D_{i}^{S}}}\) as baseline segments.

Results

We use WGS data of HCC1954 with different levels of normal contamination (coverage 30x) to test the models of Pre-SCNAClonal.

Result of MCMC model

As shown in Fig. 4, MCMC model of Pre-SCNAClonal could better correct the read count ratio’s GC bias than linear regression method [14]. The GC correction results of linear regression are either over-corrected (HCC1954.mix1.n5t95 and HCC1954.mix1.n20t80) or under-corrected(HCC1954.mix1.n60t40, HCC1954.mix1.n80t20 and HCC1954.mix1.n95t5).

To further test MCMC model, we also develop a package that could simulate data with extreme bias (please refer to Additional file 1: Supplementary 2.1 for detail). Results show that the MCMC model of Pre-SCNAClonal is robust and noise tolerant, outperforms the regression method in CNAnorm [14].

Result of hierarchy clustering model

The baseline selection method in MixClone [8] obtains baseline by removing outliers of read count ratios of the segments that do not lose heterzygosity(LOH). In the WGS data, it is difficult to distinguish LOH from sequencing deviation or error. As shown in Fig. 5a, segments that do not lose heterzygosity are randomly distributed everywhere. Baseline selection method of MixClone almost picks out all the segments as baseline while the tumor purity is low. In comparison, as shown in Fig. 5b, baseline obtained by Pre-SCNAClonal is lower and more consistent than the baseline obtained by MixClone.

Fig. 5
figure 5

Distribution of logDS/DN and baseline segments. a The green and blue points together are the segments without LOH, the blue points are the baseline segments selected by MixClone, The blue line in each sub-figure is the average value of \({D_{j}^{S}}/{D_{j}^{N}}\) of baseline segments. b The green and blue points together are the segments with no LOH and B allele frequencies around 0.5, the red points are the baseline segments selected by Pre-SCNAClonal. The red line in each sub-figure is the average value of logDS/DN of baseline segments. All the points in this figure are plotted by ggplot2 R package with opacity parameter α=0.05

We calculate the ploidy number based on baseline segments’ \(\log {\frac {D_{i}^{S}}{D_{i}^{N}}}\) to validate baseline selection model of Pre-SCNAClonal, and the result shows that the tumor sample HCC1954 is tetraploidy which is the same as results of COSMIC [15] and ABSOLUTE [16](for detail procedure, please see Additional file 1: Supplementary 3.3.1). Furthermore, the BAF distribution on germline heterozygous SNP site also shows the baseline segments obtained by hierarchy clustering models are correct (for detail procedure, please see Additional file 1: Supplementary 3.3.2).

Result of pipeline test

We respectively string Pre-SCNAClonal with two typical SCNAs based subclonal inferring tools, MixClone [8] and THetA [9], to test the bias correction and baseline selection models of Pre-SCNAClonal on HCC1954. As shown in Table 1, Pre-SCNAClonal–MixClone pipeline almost precisely estimated the subclonal frequency for all HCC1954 tumor samples, which outperforms MixClone alone a lot. Pre-SCNAClonal–THetA pipeline also provided better estimation than THetA alone (THetA could not run on sample ‘n80t20’ and ‘n95t5’ for their BIC-seq segments number lower than 1000). This result shows that Pre-SCNAClonal could greatly improve the performance of tumor subclonal population inferring algorithms.

Table 1 Pipeline test of Pre-SCNAClonal on HCC1954

Discussion

Generally, SCNAs with larger subclonal frequency could be more precisely located relatively. However, due to the twice sequencing procedures of tumor and its paired normal, the read information of the genomic regions with the same copy number in tumor sample is not exactly the same as its paired normal’s. Moreover, the lower read overage of next generation sequencing (NGS) makes the perturbation more likely to be mistaken for a SCNA. As shown in Fig. 6, the number of SCNA breakpoints obtained by SCNA detection tool is proportional to the subclonal frequency. For the samples with higher subclonal frequency, the “true” SCNA segments could be segmented into multiple segments, which causes the Fig. 3c and b presents the same stripe pattern.

Fig. 6
figure 6

Breakpoints distribution on chromosome 1 of mixed “HCC1954” samples. Here ‘n5t95’, ‘n20t80’, ‘n40t60’, ‘n60t40’ and ‘n95t5’ respectively denote the tumor sample ‘HCC1954.mix1.n5t95’, ‘HCC1954.mix1.n20t80’, ‘HCC1954.mix1.n40t60’, ‘HCC1954.mix1.n60t40’, ‘HCC1954.mix1.n80t20’ and ‘HCC1954.mix1.n95t5’. ‘n0t100’ denotes the tumor sample ‘HCC1954’ contains no normal contamination. Each of these sample contains one tumor subclone. All the breakpoints are obtained by BIC-seq [18]

For NGS based SCNA analysis, the read count ratio stripes could serve as a good proxy for bias correction, even if the break points are not correct, because the read count ratio of the “true” SCNA segment is preserved as the center of read count ratio stripe.

Conclusion

Pre-SCNAClonal proposed in this paper is a robust GC bias correction and baseline selection tool for SCNAs based tumor subclonal inferring. Pre-SCNAClonal could correct the read count ratio’s GC bias and improve the performance of SCNA based subclonal inferring tools at all levels of tumor subclonal frequency even the subclonal frequency is very small. Furthermore, Pre-SCNAClonal also provides an user-friendly interface for visualizing and manually correcting the GC bias of read count ratio.

References

  1. Nowell PC. The clonal evolution of tumor cell populations. Science. 1976; 194(4260):23–8.

    Article  CAS  PubMed  Google Scholar 

  2. McLendon R, Friedman A, Bigner D, Van Meir EG, Brat DJ, Mastrogianakis GM, Olson JJ, Mikkelsen T, Lehman N, Aldape K, et al. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008; 455(7216):1061–8.

    Article  CAS  Google Scholar 

  3. Peng J, Lu J, Shang X, Chen J. Identifying consistent disease subnetworks using dnet. Methods. 2017; 131:104–10.

    Article  CAS  PubMed  Google Scholar 

  4. Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinformatics. 2017; 18(16):573.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Peng J, Xue H, Shao Y, Shang X, Wang Y, Chen J. A novel method to measure the semantic similarity of hpo terms. Int J Data Mining Bioinformatics. 2017; 17(2):173–88.

    Article  Google Scholar 

  6. Peng J, Zhang X, Hui W, Lu J, Li Q, Shang X. Improving the measurement of semantic similarity by combining gene ontology and co-functional network: a random walk based approach. BMC Syst Biol. 2018;12(Suppl2). In press.

  7. Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010; 463(7283):899–905.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Li Y, Xie X. Mixclone: a mixture model for inferring tumor subclonal populations. BMC Genomics. 2015; 16(Suppl 2):1.

    Article  Google Scholar 

  9. Oesper L, Mahmoody A, Raphael BJ. Theta: inferring intra-tumor heterogeneity from high-throughput dna sequencing data. Genome Biol. 2013; 14(7):1.

    Article  Google Scholar 

  10. Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010; 7(2):111–8.

    Article  CAS  PubMed  Google Scholar 

  11. Benjamini Y, Speed TP. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic Acids Res. 2012; 40:e72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Farkash-Amar S, Lipson D, Polten A, Goren A, Helmstetter C, Yakhini Z, Simon I. Global organization of replication time zones of the mouse genome. Genome Res. 2008; 18(10):1562–70.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Desprat R, Thierry-Mieg D, Lailler N, Lajugie J, Schildkraut C, Thierry-Mieg J, Bouhassira E. Predictable dynamic program of timing of dna replication in human cells. Genome Res. 2009; 19:2288–99.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Gusnanto A, Wood HM, Pawitan Y, Rabbitts P, Berri S. Correcting for cancer genome size and tumour cell content enables better estimation of copy number alterations from next-generation sequence data. Bioinformatics. 2012; 28(1):40–7.

    Article  CAS  PubMed  Google Scholar 

  15. Forbes SA, Beare D, Gunasekaran P, Leung K, Bindal N, Boutselakis H, Ding M, Bamford S, Cole C, Ward S, et al. Cosmic: exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res. 2014; 43(D1):805–11.

    Article  Google Scholar 

  16. Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. Absolute quantification of somatic dna alterations in human cancer. Nat Biotechnol. 2012; 30(5):413–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Locke MEO, Milojevic M, Eitutis ST, Patel N, Wishart AE, Daley M, Hill KA. Genomic copy number variation in mus musculus. BMC Genomics. 2015; 16(1):497.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Xi R, Luquette J, Hadjipanayis A, Kim TM, Park PJ. Bic-seq: a fast algorithm for detection of copy number alterations based on high-throughput sequencing data. Genome Biol. 2010; 11(1):1.

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to members of bioinformatics lab of HIT who have tested Pre-SCNAClonal and the financial support of the funding agencies.

Funding

The publication costs of this article was funded by the Major State Research Development Program of China [no. 2016YFC1202302], the National Natural Science Foundation of China (no. 61571152) and the National High-Tech R&D Program of China (863 Program) [nos. 2015AA020101, and 2015AA020108].

Availability of data and materials

The tool we provided in this paper, Pre-SCNAClonal, is publicly available as a Python package: https://github.com/dustincys/Pre-SCNAClonal.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 5, 2018: Selected articles from the Biological Ontologies and Knowledge bases workshop 2017. The full contents of the supplement are available online at https://0-bmcbioinformatics-biomedcentral-com.brum.beds.ac.uk/articles/supplements/volume-19-supplement-5.

Author information

Authors and Affiliations

Authors

Contributions

All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yadong Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1

Modeling and Correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring. (PDF 2570 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chu, Y., Teng, M. & Wang, Y. Modeling and correct the GC bias of tumor and normal WGS data for SCNA based tumor subclonal population inferring. BMC Bioinformatics 19 (Suppl 5), 112 (2018). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2099-0

Download citation

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2099-0

Keywords