 Research article
 Open Access
 Published:
A site specific model and analysis of the neutral somatic mutation rate in wholegenome cancer data
BMC Bioinformatics volume 19, Article number: 147 (2018)
Abstract
Background
Detailed modelling of the neutral mutational process in cancer cells is crucial for identifying driver mutations and understanding the mutational mechanisms that act during cancer development. The neutral mutational process is very complex: wholegenome analyses have revealed that the mutation rate differs between cancer types, between patients and along the genome depending on the genetic and epigenetic context. Therefore, methods that predict the number of different types of mutations in regions or specific genomic elements must consider local genomic explanatory variables. A major drawback of most methods is the need to average the explanatory variables across the entire region or genomic element. This procedure is particularly problematic if the explanatory variable varies dramatically in the element under consideration.
Results
To take into account the fine scale of the explanatory variables, we model the probabilities of different types of mutations for each position in the genome by multinomial logistic regression. We analyse 505 cancer genomes from 14 different cancer types and compare the performance in predicting mutation rate for both regional based models and sitespecific models. We show that for 1000 randomly selected genomic positions, the sitespecific model predicts the mutation rate much better than regional based models.
We use a forward selection procedure to identify the most important explanatory variables. The procedure identifies sitespecific conservation (phyloP), replication timing, and expression level as the best predictors for the mutation rate. Finally, our model confirms and quantifies certain wellknown mutational signatures.
Conclusion
We find that our sitespecific multinomial regression model outperforms the regional based models. The possibility of including genomic variables on different scales and patient specific variables makes it a versatile framework for studying different mutational mechanisms. Our model can serve as the neutral null model for the mutational process; regions that deviate from the null model are candidates for elements that drive cancer development.
Background
Cancer is driven by somatic mutations that convey a selective advantage to the cell. However, in most cases the somatic mutation rate in cancer cells is considerably higher than in healthy tissues, while only a small fraction of the mutations are thought to be associated with cancer development [1]. The majority of the mutations are neutral and are caused by perturbed cell division, maintenance and repair or overexpression of mutagenic proteins (e.g. the APOBEC gene family [2]). A comprehensive framework of the random mutation process in cancer cells is key to identify the regions, pathways and functional units that are under positive selection during cancer development.
The mutation rate varies along the genome, depending on genomic properties of the position such as the sequence context (e.g. the 5’ and 3’ nucleotides; [3]), chromatin organisation [4] or replication timing [5]. Many studies have investigated what determines the mutation rate and what kind of models should be used [5] modeled the mutation heterogeneity using local regression with expression level and replication timing as explanatory variables [4] applied random forest regression on mutation counts in 1Mb windows using histone modifications and the density of DNase I hypersensitive sites. [6] predicted the number of mutations per element by a betabinomial distribution using replication timing and noncoding annotations such as promoter, UTR and ultraconserved sites. Unlike these approaches that segmented the genome into regions according to the explanatory variables and estimated separate models for them, in a sitespecific regression model, this division is not necessary [7] implemented a Poissonbinomial model on 50kb windows, where they used logistic regression to predict the positionspecific mutation probability, based on basepair, replication timing and the presence and type of transcript. Here, we propose a framework based on multinomial regression that is able to model different types of substitutions. Unlike regionbased models, we can use sitespecific explanatory variables without dividing the data into subsets. In this way, we use the full dataset to estimate the regression coefficients for all the explanatory variables. We include the trinucleotide context, GC content and CpG island annotations to describe the local base composition. Local properties of the genome related to transcription and replication are taken into account using expression level, replication timing, DNase I hypersensitivity and genomic element types. To study the differences between somatic mutations and germline substitutions, we include the conservation score phyloP. We also include an explanatory variable to mask the repeat regions in the genome as mutation calls at repeat regions are biased due to technical reasons.
Generalized linear models also provide an interpretable framework for modelling and hypothesis testing. For example, we can estimate the mutation rates of CpG sites within and outside CpG islands, and test if they are different, and a patientspecific intercept allows us to take the large variation of mutation rates between patients into account.
Here, we analyse 505 cancer genomes from 14 different cancer types [8]. We compare the performance in predicting mutation probabilities of regionbased Poisson models, sitespecific binomial models and sitespecific multinomial models. The sitespecific multinomial model can predict both the overall mutation rate and mutation types accurately. We use a forward model selection procedure to compare and identify the explanatory variables that best explain the heterogeneity of the sitespecific mutational process (Fig. 1).
The forward model selection procedure is implemented using 2% of the data while the final model fit is obtained from the remaining 98%. We find that sitespecific conservation (phyloP), replication timing and expression level are the best predictors for the mutation rate. In general, the framework allows formal testing for inclusion of explanatory variables and also interaction terms. The impact of different explanatory variables can be inferred from the parameter estimates of our final model as the multiplicative changes in mutation rate. Our analysis confirms some known mutational signatures and it identifies associations with genomic variables like replication timing. It can also be used as the null model for cancer driver detection [9] and other applications that rely on a model of the mutation rate (e.g. identification of the tissue of origin for tumors of unknown primary, [4]).
Results
Heterogeneity of the mutation rate
We observe significant heterogeneities of the mutation rate at multiple levels (Fig. 2). The mutation rate varies among cancer types (Fig. 2a): skin cutaneous melanoma, colorectal cancer and lung adenocarcinoma are the cancer types with the highest mean mutation rates (5–10 mut/patient/Mb) while thyroid carcinoma, prostate adenocarcinoma and lowgrade glioma have the lowest (0.5–1 mut/patient/Mb). The mutation rate also differs between samples from the same cancer type, with the largest variation seen for skin cutaneous melanoma (the mutation rate ranges from 1 mut/patient/Mb to 150 mut/patient/Mb).
The mutation rate also varies between different genomic contexts (Fig. 2b). As previously shown, we find that mutational signatures are cancer type specific by looking into the mutation rate for different trinucleotide contexts. For instance, the mutation rate at TC* sites is particularly high in skin cutaneous melanoma, with the largest proportion of mutations at TCC positions of all cancer types. The mutation rate at CpG sites is elevated in all the cancer types. In colorectal cancer, we observe a high proportion of mutations at TCG and TCT sites. These can be attributed to mutations in the POLE gene that cause DNA polymerase ε deficiency [10]: we find an increased overall mutation rate, a very high proportion of T[C >A]T and T[C >T]G mutations and a high contribution of COSMIC signature 10 in six out of 42 colon cancer samples [11]. Five of those (and three of the other colon cancer samples) have a nonsynonymous mutation in POLE and one of them in addition in POLD1, which encodes the DNA polymerase δ (Additional file 1: Figure S1). When the samples with POLE mutation pattern are removed, the mutation pattern of the remaining colon cancer samples is very similar to most cancer types with a high proportion of mutations at CpG positions (Additional file 1: Figure S2).
The mutation rate also differes between genomic environments defined by the explanatory variables (Fig. 2b, c). Coding regions tend to have fewer mutations in all cancer types. Mutation rates are elevated for simple repeat regions, which might be related to mapping artefacts and ensuing technical challenges during mutation calling. The effect of CpG islands varies between different cancer types. The mutation rate in CpG islands is higher than in regions outside for thyroid carcinoma, prostate adenocarcinoma, lowgrade glioma and kidney chromophobe, while for colorectal cancer, lung adenocarcinoma, lung squamous cell carcinoma and skin cutaneous melanoma the situation is reversed. Regions that are late replicated, GC rich, evolutionarily less conserved, inside DNase 1 peaks and lowly expressed have an elevated mutation rate. The explanatory power of the variables varies across cancer types as shown by the regression lines in Fig. 2c.
Granularity of regression models
We model the mutation probability in cancer genomes using a set of regression models. The most coarsegrained description of the number of mutations in a region is a Poisson regression count model and the most finegrained is a binomial or multinomial sitespecific regression model. Here we describe and investigate in detail the (dis)advantages of these three models. A conceptual overview of the models is given in Fig. 3a.
Poisson count regression model
In Poisson regression, the number of mutations in a genomic region of fixed length is modeled. The whole genome is divided into regions of prefixed length or according to the value of explanatory variables (e.g. segmented by genomic element types). Regression modelling is facilitated by summing the mutation counts and summarizing the annotations over the region.
We model the mutation count N_{r,sam} in the rth genomic region with length L_{ r } in sample sam. Furthermore, can is the cancer type of the sample. Mutations arise randomly with probability p_{r,sam}. As p_{r,sam} is small and L_{ r } is large, we have the Poisson approximation of the binomially distributed number of mutations
For J explanatory variables, the expected mutation count λ_{r,sam}=L_{ r }p_{r,sam} can be modeled by Poisson regression with a loglink:
where for the jth explanatory variable, x_{r,j} is the average value of the annotation across region r if the explanatory variable is continuous and for categorical explanatory variables, x_{r,j} is derived from the proportions of different levels of the annotations in region r, j=1,…,J. μ_{ sam } is the samplespecific intercept.
Sitespecific binomial regression model
In sitespecific regression models, the mutation probability is modeled in each position of the genome. We enable regression modelling by binning the continuous annotations, such that we are able to sum mutation counts over positions with the same combination of annotations, and thereby reduce the size of the data set. We consider both sitespecific binomial and multinomial regression models.
We model the mutation probability p_{i,sam} at a site i in sample sam of cancer type can. With a logit link, the mutation probability can be modeled by logistic regression:
where x_{i,j} is the value of the jth explanatory variable at site i.
Sitespecific multinomial regression model for strandsymmetric mutation types
We model the mutation probability for different mutation types. Assuming strandsymmetry, we are not distinguishing between e.g. A >G (A to T) mutations and T >C mutations, p^{A>G}=p^{T>C}. We consider the strand with the C or T nucleotide, and the mutation probability matrix is
with only 6 types of mutations.
We model these mutation probabilities by setting up a multinomial logistic regression model, where dummy variables are used to distinguish between mutations from (G:C) base pairs and mutations from (A:T) base pairs. The (G:C) basepairs are modelled with probabilities
for (G:C) position i in sample sam. With J explanatory variables, the mutation probability at G:C position i in sample sam of cancer type can can be written as:
Note that the probability for no mutation \(p_{i, sam}^{{C > C}}\) is the reference. Similarly, (A:T) basepairs are modeled with probabilities
and the reference is the probability for no mutation \(p_{i, sam}^{{T > T}}\). We compare the performance of the three models on 2% of the whole genome data.
The setting for the three models is shown in Fig. 3a. Since we are mainly interested in their predictive performance, we have not included overdispersed models. Overdispersion is a way to model unexplained variance in the data, but it does not qualitatively change the predictions of a model. Each model is trained with replication timing, phyloP, and context information from the reference genome. For the regionbased Poisson model, continuous annotation values are averaged over the selected region and GC percentages are calculated for each region. For the sitespecific models, we use the sitespecific annotations for each site. Continuous values are discretized to simplify the estimation process.
The results are shown in Fig. 3be. We compare the performance of these models at different resolutions using different datasets. In Fig. 3b, we predict the mutation counts in large windows of length 100kb. In Fig. 3c, d, e, we predict the mutation counts in sets of 1000 randomly sampled individual sites. When predicting mutation counts in large windows (100 kb), the three regression models perform similarly. For prediction in a randomly selected small number of sites (1 kb), the two sitespecific models outperform the regionbased model (Fig. 3c). The sitespecific models can capture the mutational heterogeneities between sites and provide a more accurate mutation probability at any resolution. This is in contrast to the regionbased model where a large number of sites are required for accurate predictions. In addition to predicting the probability for mutation events, the multinomial regression model can also predict the probabilities of different mutation types. It outperforms the binomial model by taking the different mutation rates into account (Fig. 3d, e).
Model selection
We consider the sitespecific multinomial regression model to predict mutation probabilities for different mutation types at a single site. We implement a forward model selection procedure to determine the explanatory variables in the final model (Fig. 1). In each step, we add all possible new variables to the previous model in turn and rank the resulting new models. We identify and include the explanatory variable with the best fit and iterate the procedure several times. With forward model selection, we avoid the preparation of large analytical data tables that contain all potential explanatory variables (“Preparation of the analytical data table” section). We choose the deviance loss that measures the predictive performance of the model to assess the fit. By estimating it by crossvalidation, we avoid overfitting, because it assesses the fit on an independent subset of the data (“Cross validation” section).
By construction, the fit improves during the model selection procedure (Fig. 4a). We also evaluate McFadden’s pseudo R^{2} as a measure of the explained variance that is valid in categorical regression models (“McFadden’s pseudo R^{2}” section).
As a reference model, we start out with a single mutation rate for the whole genome in all samples (Model 1; Fig. 4a). This model cannot explain any of the variation in the mutation rate between samples and positions, so McFadden’s pseudo R^{2}=0. After including the six strandsymmetric mutation types in the model (Model 2), we add cancer and sample specific intercepts (Model 3 and Model 4) to make sure that we account for samplespecific mutation rates. In the next step, we include the left and right neighboring basepair for each cancer type (Model 5).
Starting with Model 5, additional annotations are added using forward model selection. We consider the phyloP score, replication timing, expression, genomic segments, GC content in 1 kb, CpG islands, simple repeats and DNase I hypersensitivity. For each of these variables, cancer specific regression coefficients are estimated to allow for differences in the mutational process between cancer types. For expression, we use data directly obtained from matching tumor types, so we also take expression differences between cancer types into account.
The annotation with the largest decrease in the deviance loss function is the phyloP score (Model 6). Subsequently, replication timing (Model 7) and gene expression (Model 8) are added. At this point, we stop the model selection procedure to avoid the timeconsuming creation of larger count tables (see “Preparation of the analytical data table” section for details) as we also see that the improvement of the fit levels off. Detailed results for each step of the forward selection procedure are provided in Additional file 1: Section S2.1. To assess the robustness of the results, we rerun the forward model selection procedure five times on randomly selected regions that cover 2% of the whole genome. The ranking of the variables is constant for all the experiments (Additional file 1: Section S2.2).
Adding the context in model 5 gives a substantial improvement over model 4. When the phyloP score is added in model 6, it can be seen in Fig. 4b that it considerably changes the predicted mutation rate for the same nucleotide triplet at two positions with different phyloP score. While the trinucleotide context and the phyloP score vary on a basepair scale, both replication timing and expression vary on a kilobase scale. Even though much of the perbasepair variation in the mutation rate is already captured in Models 5 and 6, the longrange variation is considerably better explained in Model 7 and Model 8 (Fig. 4b). We can see that adding replication timing in Model 7 considerably changes the predicted mutation rate obtained from Model 6 that is relatively uniform in the 1.1 Mb region shown in the figure, by taking the replication timing gradient in the region into account. Finally, in Model 8, the mutation rate in highly expressed regions is lowered, which again improves the prediction.
Driver detection methods, such as MutSigCV [5] and ncdDetect [9], are generally based on a model of the neutral mutation rate in tumors. We use two typical cancer genes, the oncogene KRAS and the tumor suppressor gene TP53, to contrast the observed mutation pattern in a driver to the predicted neutral mutation rate.
In Additional file 1: Figure S4, it is obvious that the predicted mutation rate is lower at the gene body of KRAS than right outside of it. A zoom onto exon 2 shows a cluster of mutations at positions 25,398,2815, with 25 mutations at position 25,398,2845 which cause a change of protein. This cluster is highly unlikely under our neutral model.
For TP53, we also find a lower predicted mutation rate at the gene body of TP53 and the overlapping gene WRAP53 (Additional file 1: Figure S5). Here, the observed mutations are more spread than in KRAS, but they mainly occur in highly conserved exonic regions, where the neutral model predicts a low mutation rate.
Estimation results
Upon determining the final model from the model selection procedure, we estimate parameters for the multinomial logistic regression model on the remaining 98% of the genome. To study the difference between cancer types, all positionspecific explanatory variables are stratified by cancer type. The coefficients represent multiplicative changes in mutation rate. Our results confirm the large differences in mutation pattern both between samples and cancer types, but also between different genomic and epigenomic regions.
We observe that regions that have been highly conserved during human evolution have a lower mutation rate in all cancer types and for all mutation types, and this difference is nearly always significant (Fig. 5a). For kidney chromophobe and prostate adenocarcinoma, it is reduced to less than half for some of the mutation types. In breast cancer, head and neck squamous cell carcinoma, kidney chromophobe and thyroid carcinoma, this difference is much more pronounced at A:T positions than at G:C positions, but there is no general pattern with respect to the mutation type.
As previously described [5], we nearly always find a positive association between replication timing and mutation rates (see regression lines in Fig. 2c; later replicating regions have more mutations), but the regression coefficient varies significantly between the different cancer types and the mutation types (Fig. 5b). For the dummy variable that distinguishes gene bodies, where expression is measured, from intergenic regions, we see a mixed pattern of insignificant and negative regression coefficients where the mutation rate in gene bodies is reduced to up to one third of the rate in intergenic regions (lowgrade glioma; Fig. 5c). The only exception is melanoma with a slightly increased mutation rate in gene bodies.
Also when we consider regions within gene bodies, we find a reduced mutation probability in highly expressed regions for most cancer types. The most extreme example is the probability of a C > T mutation in highly expressed regions in melanoma, which is only one fifth of the probability in lowly expressed ones (Fig. 5d). Lowgrade glioma, prostate adenocarcinoma and thyroid cancer show the opposite pattern for some of the mutation types, though. However, this effect is dampened by the dummy variables for the gene body that have comparably large coefficients of the opposite sign for these three cancer types.
We find that the mutation rates differ between the different types of mutations, but also between the specific contexts that we consider: CpG, to capture the pattern of spontaneous deamination, and TpCp[AT], to capture the APOBEC signature. We find that the C > T mutation rate is higher in CpG sites than in other sites in all cancer types. In skin cutaneous melanoma, we also observe elevated mutation rate for CC context, which is related to the elevated CC > TT mutation rate due to UV light [12]. We observe elevated rates of mutations that fit the APOBEC pattern in breast cancer, bladder urothelial carcinoma, head and neck squamous cell carcinoma, lung squamous cell carcinoma and skin cutaneous melanoma.
Discussion and conclusions
We use a multinomial logistic regression model to analyse the somatic mutation rate in each position of a cancer genome. We consider various genomic features, such as the local base composition, the functional impact of a region and replication timing. Because of the sitespecific formulation, the model is the most finegrained description of the mutation rate, while it can still take genomic properties into account that vary or are measured on a longer scale, like replication timing and expression levels.
The mutational spectrum and intensity are known to vary considerably between cancer types as well as between patients with the same cancer type [5]. In our analyses, we capture a considerable part of the variance by including cancer type and sample specific mutation rates.
We find the sitespecific phyloP conservation scores to explain more than any of the other explanatory variables we tested. The conservation scores reflect the germline substitution rate between species through evolution and are designed to capture the effect of selection [13]. However, somatic evolution of the cancer genome is not subject to the same constraints as germline evolution. In particular, purifying selection may be relaxed as many genes and regulatory elements that are needed during the organismal life cycle are dispensable for the growth of cancer cells [14]. Instead, the conservation scores’ explanatory power in our study may stem from their correlation with genes and other functional elements, with special properties that affect their mutation rate. In particular, it is known that the mutation rate is elevated at transcription factor binding sites in some cancer types [15]. Furthermore, the phyloP scores are also correlated with expression regions and open chromatin, which both have decreased mutation rates due to transcriptioncoupled repair and other repair mechanisms [16].
The second explanatory variable added is replication timing. This has not only been found for cancer genomes [5], but also for germline mutations [17] and somatic mutations in healthy tissue [18]. In late replicating regions, single stranded DNA (ssDNA), which is susceptible to mutagenic processes like deamination, accumulates [17]. Varying intensities of such mutational processes could explain the differences in the impact of replication timing between the cancer types and the mutation types. The different strengths of the associations between replication timing and the mutation rate can also be attributed to differences in replication timing between cell types [19] that may be missed in the replication timing dataset at use, which is obtained from HeLa cell lines [20].
Only after the phyloP score and replication timing, the cancer type specific expression level is added to the model. This can be explained by the fact that the phyloP score and expression are correlated, as mentioned above, and so are replication timing and expression [21]: proteincoding genes are concentrated in early replicating regions. However, the expression levels that we use are tissue specific and might therefore improve the fit of the model further.
By separately estimating a regression coefficient for the genebody and one for the expression level within gene bodies, we can potentially distinguish the general properties of genes, like location with respect to replication timing origins and sequence composition, and the intensity of transcriptioncoupled repair (TCR). Since TCR is a subpathway of nucleotide excision repair (NER), it is expected to act on helixdistorting mutations like for example the wellknown UV light induced CC > TT mutations in melanoma. Thus, the cancer specific differences that we see might be explained by varying effectiveness of TCR, but also by different proportions of mutations that create bulky distortions. For example, the rate of C > T mutations is further decreased in highly expressed regions in melanoma than the other mutation types.
The parameters can be interpreted as multiplicative mutation rate changes. By using interaction terms, it is straightforward to analyse and test differences in mutation rate between cancer types, samples or specific genomic regions of interest. Furthermore, generalized linear models come with a natural framework for hypothesis testing, e.g. to compare two subcohorts of interest within the same model, like POLE mutated versus POLE wildtype colorectal cancer samples, or NER deficient tumors versus NER proficient ones. It would be interesting to compare our modelbased description of the mutation rate to unsupervised learning of mutational signatures by matrix factorization [3].
In the logistic regression model, we can estimate cancer type specific regression coefficients to capture the differences between cancer types. We can also include tissue specific explanatory variables if there are measurements for the corresponding cancer tissue or matching healthy tissue available. This is of particular interest for epigentic measurements like histone modifications, where the tissue of origin has been shown to be informative [4].
Patientspecific characteristics can also be added as explanatory variables. The age of the patients could be used to study clocklike mutational processes [11]. Known somatic or germline mutations that are associated with specific mutational processes or repair pathways can also be used as explanatory variables, e. g. a germline deletion of APOBEC3B that fuses APOBEC3A with the 3’ UTR of APOBEC3B has been found to be associated with an increased number of APOBECtype mutations [22]. The impact of this mutation could be studied by including an interaction term with the TpCp[AT] positions.
Our model is very flexible and versatile and the explanatory variables can be customized according to different applications. We provide software that facilitates similar positionspecific analyses on new mutation datasets with a userspecified set of genomic annotations as explanatory variables. First, an analytical data table is created for the whole genome or a specified set of genomic regions (implemented in python). This table can be updated with new samples at any time. Then, a multinomial logistic model can be estimated. This is implemented in R and mimics the interface of standard R functions like glm.
An important application of the model is the prediction of the somatic mutation probability under the assumption of neutrality. In our driver detection method ncdDetect [9], we use the positionspecific predictions that we obtain from the multinomial model to evaluate if the mutation rates or their distributions are significantly different from the expectation under neutrality. This allows a flexible analysis of regions of any size. Even noncontiguous regions with very different properties than the overall genomic patterns can be investigated.
Methods
Data
Somatic mutation dataset
We use SNV calls from 505 tumornormal samples across 14 different cancer types [8]. We build our data set based on the UCSC hg19 assembly. We removed regions with low mappability, ultrahigh mutation rates and lacking annotation. Problematic regions for NGS alignments identified for the ENCODE project [23] were subtracted. Low mappability regions, defined by the GEM tool [24] and CRG Alignability track from UCSC with mappability less than 0.5 in 100mers, were also subtracted. Hypermutated genomic segments containing Immunoglobulin/Tcell receptor (IG/TR) genes defined by GENCODE together with 10 kb flanking regions, combined when less than 100 kb in distance, were excluded from analysis. We also excluded sites on ChrX and ChrY, because for some of the annotation files we lack information for one or both of the sex chromosomes.
A total number of 14,200,393 SNVs are observed in the subtracted regions for 505 samples across 14 cancer types.
Genomic element types
We divided the genome into six types of genomic elements: coding, 5’ UTR, 3’UTR, ncRNA, intron and intergenic. Based on the GENCODE v.19 transcripts, coding regions, 5’ UTR regions and 3’ UTR regions as well as introns were defined for proteincoding transcripts. Noncoding RNA regions were defined as all remaining regions in the full transcript set. All remaining bases were categorized as intergenic.
GC content
We calculate the percentage of G:C base pairs in 1 kb windows based on the reference genome. Regions with GC percentage less than 10% are annotated with value 0. Other regions are discretized into quartiles.
CpG islands
We segmented the genome by presence or absence of CpG islands. The CGI Mountain annotation from CgiHunter (http://cgihunter.bioinf.mpiinf.mpg.de/) was used. The CGI Mountain score quantifies if a region is a CpG island. Scores above zero indicate a CpG island. We use a dummy variable derived from the CGI Mountain score in our analysis indicating whether the CGI Mountain score is larger than zero.
Simple repeats
We annotated the simple repeat regions in the genome according to RepeatMasker (http://www.repeatmasker.org), which defines the interspersed and lowcomplexity repeats in hg19. We use a dummy variable to indicate whether a genomic site is in a region masked as simple repeats.
DNase I peaks
We defined DNase I peaks according to the DNase I annotation from the Roadmap Epigenomics project [25]. We use the score from HoneyBadger2 to indicate the DNase I signal strength for regions with a DNase I peak (http://www.broadinstitute.org/~meuleman/reg2map/). The regions not annotated in the HoneyBadger2 were masked as “no peak” regions in our analysis. The values of peaks are approximatedly between 0.4 and 1, with 1 being highly accessible.
PhyloP score
The conservation score phyloP (phylogenetic pvalues) is part of the PHAST package (http://compgen.bscb.cornell.edu/phast/). We used the score from the multiple alignments of 99 vertebrate genomes to the human genome [13]. We use the version of phyloP100way which covers 99.8% of the subtracted regions.
Replication timing
We adjust the replication timing data from [20] to the hg19 assembly. The replication timing values range from 0 to 1, indicating earlier to later replicating regions. The replication timing annotation covers 91.2% of the subtracted regions.
Gene expression
We define the gene expression level according to TCGA RNAseq expression data. Expression data was \(\log _{2}(x+1)\) transformed. For each cancer type, the median expression was calculated for all genes. If multiple annotations of a gene exist, the longest annotation is used. For overlapping genes, the expression is a cumulative sum.
As in [8], we collapse colon (COAD) and rectal carcinoma (READ) to a joint cancer type CRC by averaging across the expression values.
Preparation of the analytical data table
In order to facilitate the model fitting procedure, we summarized the genomic data into count tables. In Poisson count models, each row in the count table represents a predefined region. For continuous explanatory variables, such as replication timing, the annotations are averaged over the region. For categorical explanatory variables, the annotations are transformed to the percentage for different levels of the variables, e.g, the binary explanatory variable indicating whether the site is a (G:C) base pair or not is transformed to GC content of the window. In sitespecific regression models, we discretized continuous variables into bins according to quartiles or quintiles. Each row in the count table represents the counts of mutations under a certain combination of levels of all the explanatory variables. As we are summarizing the whole genome in the count table, we expect to see all the combinations of the levels for all the explanatory variables. Thus, the sizes of the count tables grow significantly when adding new explanatory variables. The generation of the count tables also takes much longer time with more explanatory variables. Because of the space and time consumption, a large count table including all the explanatory variables is computationally infeasible. We implemented the forward model selection procedure to avoid many huge count tables. For each iteration in the forward model selection procedure, we made new count tables only for the selected sets of explanatory variables from previous step and one new candidate explanatory variable. We then added the best candidate in the explanatory variable set and repeated for the next step. We used 2% of the whole genome in the model selection and made a final count table with the remaining 98% sites only for the explanatory variables that were determined from the model selection procedure. The generation of the count table for the final model takes 6000 CPU hours on our cluster (2.5 GHz CPUs).
Multinomial regression model
Estimation
The observations in the regression model are indexed by the genomic position i(1≤i≤2.56·10^{9}) for all positions on chr1 to chr22 after excluding problematic regions and the samples sam(1≤sam≤505), so the total number of observations is 1.3·10^{12}. Starting from Model 4, sample specific intercepts for the six mutation types sum up to 6·505=3030 parameters. The regression coefficients for explanatory variables are indexed by the 14 cancer types. In Model 5 each cancer type has 4·4·6=96 parameters for the neighboring context and the mutation type (4 for the left neighboring site, 4 for the right neighboring site and 6 for the mutations), so the total number of parameters in this model is 6·505+14·96=4374. From Model 6 to Model 8 we use 1 parameter for each of the continuous explanatory variables phyloP, replication timing and expression level. We also add 1 dummy variable for expression level indicating whether the given site is potentially expressed or not. In Model 6 we have 4374+14·6=4458 parameters. In Model 7 we have 4458+14·6=4542 parameters. In the final model we have 6·505+14·96+14·(3+1)·6=4710 parameters in total. To reduce memory usage and computation time, the parameters are estimated in three separate binary logistic regression models, but the variancecovariance matrix of the parameters is estimated for the multinomial model [26]. We have implemented an Rpackage for the estimation [27] that is based on the function glm4 from the contributed package MatrixModels [28]. It also includes estimation of the variancecovariance matrix using the package Matrix for efficient handling of large and sparse matrices [29].
Dirichlet prior and pseudo counts
If a model with many explanatory variables and interaction terms among them is estimated (e.g. sampleID × neighbors × strong), it can easily occur that for a certain combination of levels of categorical variables, there have been no mutations of a certain type observed. This case is especially likely, if the sampleID is involved. This causes numerical problems in the maximum likelihood estimation [30].
To solve this problem while still obtaining a positive mutation probability, we add pseudo counts to the observed mutation counts. This is equivalent to specifying a Dirichlet prior for the multinomial model, so it leads to the same point estimates as the posterior mean would in a DirichletMultinomial model [31].
To reduce the impact of the pseudo counts, we only add them to combinations of levels with no observed mutations and we do not use a uniform distribution, but let them be proportional to the observed mutation counts from all samples of the corresponding cancer type. The mutation counts from the sample and from the cancer type are equally weighted.
Let \(\textbf {n}_{K, \text {sam}} = \left (n_{K, \text {sam}}^{C>A}, n_{K, \text {sam}}^{C>G}, n_{K, \text {sam}}^{C>C}, n_{K, \text {sam}}^{C>T}\right)\) be the number of each type of mutation for the combination of levels K in sample sam with at least one of the mutation counts being zero, and \(\textbf {n}_{K,\text {can}} = \left (n_{K, \text {can}}^{C>A}, n_{K, \text {can}}^{C>G}, n_{K,\text {can}}^{C>C}, n_{K, \text {can}}^{C>T}\right)\) be the mutation counts for combination K for all samples of cancer type can. Let N_{K,sam} and N_{K,can} be the total count of mutations of category K for the sample and cancer type, respectively. Then, the new count vector \(\tilde {\textbf {n}}_{K, \text {sam}}\) is defined as
Consequently, the number of mutations per sample is preserved. The new mutation counts can be nonintegers, but this can be handled by our implementation.
McFadden’s pseudo R ^{2}
To assess the fit of a model, we report McFadden’s pseudo R^{2}
where L_{ M } is the likelihood of the model under investigation and L_{0} is the likelihood of a model with no predictors [32]. To measure the improvement of a model in comparison with the binomial model where there is no distinction between the mutation types, we use the same mutation probability for each mutation type in the model without predictors.
Forward variable selection
To speed up data preparation, forward variable selection is conducted on a subset of approximately 2% of the genome, which is constructed by randomly selecting 60,000 windows of size 1 kb.
The cancer types kidney chromophobe, lowgrade glioma, prostate adenocarcinoma and thyroid carcinoma have very low mutation counts, so they are disregarded during variable selection. We use crossvalidation for forward variable selection. The improvement of the fit along the model selection procedure is measured with the deviance loss.
Starting with Model 5,
additional terms of the form
with explanatory variable j∈ {phyloP, replication timing, genomic segment, expression, GC content, DNase 1, simple repeats, CpG island}, are selected following the forward selection scheme.
To test the robustness of the selected model, the forward variable selection procedure is repeated five times.
Cross validation
An assessment of the fit of a model can be obtained by cross validation. Fivefold cross validation is used to select the annotation with the highest explanatory power. To this end, the 1 kb windows of the variable selection subset are divided randomly among 5 sets. In turn, 4 of them are joined as the training set, on which the multinomial model is estimated, and the remaining set is used as the validation set to estimate the loss function.
The deviance loss function for the observed site i in sample sam is defined as
where k denotes all possible mutation events at site i, y_{i,sam} the observed event at site i in sample sam and \(\hat {p}(y_{i, sam}=k)\) the probability of event k estimated under the multinomial regression model [33]. Thus, it measures the prediction accuracy of the multinomial regression model. The total deviance loss of a model is
with N being the number of genomic positions.
Abbreviations
 NER:

Nucleotide excision repair
 TCR:

Transcriptioncoupled repair
References
 1
Stratton MR, Campbell PJ, Futreal PA. The cancer genome. Nature. 2009; 458(7239):719–24.
 2
Bacolla A, Cooper DN, Vasquez KM. Mechanisms of base substitution mutagenesis in cancer genomes. Genes. 2014; 5(1):108–46.
 3
Alexandrov LB, NikZainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, BørresenDale AL, et al. Signatures of mutational processes in human cancer. Nature. 2013; 500(7463):415–21.
 4
Polak P, Karlic R, Koren A, Thurman R, Sandstrom R, Lawrence MS, Reynolds A, Rynes E, Vlahovicek K, Stamatoyannopoulos JA, Sunyaev SR. Celloforigin chromatin organization shapes the mutational landscape of cancer. Nature. 2015; 518(7539):360–4.
 5
Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancerassociated genes. Nature. 2013; 499(7457):214–8.
 6
Lochovsky L, Zhang J, Fu Y, Khurana E, Gerstein M. LARVA: An integrative framework for largescale analysis of recurrent variants in noncoding annotations. Nucleic Acids Res. 2015; 43(17):8123–34.
 7
Melton C, Reuter JA, Spacek DV, Snyder M. Recurrent somatic mutations in regulatory regions of human cancer genomes. Nat Genet. 2015; 47(7):710–6.
 8
Fredriksson NJ, Ny L, Nilsson JA, Larsson E. Systematic analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nat Genet. 2014; 46(12):1258–63.
 9
Juul M, Bertl J, Guo Q, Nielsen MM, Świtnicki M, Hornshøj H, Madsen T, Hobolth A, Pedersen JS. Noncoding cancer driver candidates identified with a sampleand positionspecific model of the somatic mutation rate. eLife. 2017; 6:21778.
 10
Shinbrot E, Henninger EE, Weinhold N, Covington KR, Göksenin AY, Schultz N, Chao H, Doddapaneni H, Muzny DM, Gibbs RA, Sander C, Pursell ZF, Wheeler DA. Exonuclease mutations in DNA polymerase epsilon reveal replication strand specific mutation patterns and human origins of replication. Genome Res. 2014; 24(11):1740–50.
 11
Alexandrov LB, Jones PH, Wedge DC, Sale JE, Campbell PJ, NikZainal S, Stratton MR. Clocklike mutational processes in human somatic cells. Nat Genet. 2015; 47:1402–7.
 12
Rass K, Reichrath J. UV damage and DNA repair in malignant melanoma and nonmelanoma skin cancer In: Reichrath J, editor. Sunlight, Vitamin D and Skin Cancer. New York: Springer: 2008. p. 162–178.
 13
Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010; 20(1):110–121.
 14
De Silva DR, Nichols R, Elgar G. Purifying selection in deeply conserved human enhancers is more consistent than in coding sequences. PloS ONE. 2014; 9(7):103357.
 15
Sabarinathan R, Mularoni L, DeuPons J, GonzalezPerez A, LopezBigas N. Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature. 2016; 532(7598):264–7.
 16
Su D, Wang X, Campbell MR, Song L, Safi A, Crawford GE, Bell DA. Interactions of chromatin context, binding site sequence content, and sequence evolution in stressinduced p53 occupancy and transactivation. PLoS Genet. 2015; 11(1):1004885.
 17
Stamatoyannopoulos JA, Adzhubei I, Thurman R, Kryukov GV, Mirkin SM, Sunyaev SR. Human mutation rate associated with DNA replication timing. Nat Genet. 2009; 41(4):393–5.
 18
Koren A, Polak P, Nemesh J, Michaelson JJ, Sebat J, Sunyaev SR, McCarroll SA. Differential relationship of DNA replication timing to different forms of human mutation and variation. Am J Hum Genet. 2012; 91(6):1033–40.
 19
Hansen RS, Thomas S, Sandstrom R, Canfield TK, Thurman R, Weaver M, Dorschner MO, Gartler SM, Stamatoyannopoulos JA. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc Natl Acad Sci. 2010; 107(1):139–44.
 20
Chen CL, Rappailles A, Duquenne L, Huvet M, Guilbaud G, Farinelli L, Audit B, d’AubentonCarafa Y, Arneodo A, Hyrien O, Thermes C. Impact of replication timing on nonCpG and CpG substitution rates in mammalian genomes. Genome Res. 2010; 20(4):447–57.
 21
Rhind N, Gilbert DM. DNA replication timing. Cold Spring Harb Perspect Biol. 2013; 5(8):a010132.
 22
NikZainal S, Wedge DC, Alexandrov LB, Petljak M, Butler AP, Bolli N, Davies HR, Knappskog S, Martin S, Papaemmanuil E, et al. Association of a germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of putative APOBECdependent mutations in breast cancer. Nat Genet. 2014; 46(5):487–91.
 23
The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74.
 24
Derrien T, Estellé J, Sola SM, Knowles DG, Raineri E, Guigó R, Ribeca P. Fast computation and applications of genome mappability. PloS ONE. 2012; 7(1):e30377.
 25
Roadmap Epigenomics Consortium. Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518:317–30.
 26
Begg CB, Gray R. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika. 1984; 71(1):11–8.
 27
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. R Foundation for Statistical Computing. http://www.rproject.org/.
 28
Bates D, Maechler M. MatrixModels: Modelling with Sparse And Dense Matrices. 2015b. R package version 0.41. http://cran.rproject.org/package=MatrixModels.
 29
Bates D, Maechler M. Matrix: Sparse and Dense Matrix Classes and Methods. 2015a. R package version 1.22. http://cran.rproject.org/package=Matrix.
 30
Agresti A. Categorical Data Analysis, 2nd ed. Wiley Series in Probability and Statistics. New York: Wiley; 2002.
 31
Durbin R, Eddy SR, Krogh A, Mitchison G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge: Cambridge University Press; 1998.
 32
McFadden D. Conditional logit analysis of qualitative choice behaviour In: Zarembka P, editor. Frontiers in Econometrics. New York: Academic Press: 1974.
 33
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. Springer Series in Statistics. New York: Springer New York Inc.; 2001.
Acknowledgements
Computations were conducted on the GenomeDK HPC cluster at Aarhus University.
Funding
This work was supported by the Danish Councils for Independent Research (the Medical Sciences (FSS) Sapere Aude program to JSP and grant DFF 400200382 to AH). The funding body had no role in the design of the study, the collection, analysis, and interpretation of data or in writing the manuscript.
Availability of data and materials
The count table for the selected explanatory variables and the parameters estimated for the final model are available at https://figshare.com/projects/MultinomialMutations/24685.
The software to prepare a count table from a set of annotations and mutation data and for the multinomial regression analysis is available at https://github.com/MultinomialMutations.
Author information
Affiliations
Contributions
JB, QG, MJ, SB, AH and JSP designed the logistic regression framework. JB, QG, MJ, SB, MMN, HH and JSP collected the datasets for explanatory variables and mutation calls. JB, QG and MJ analyzed the datasets. All authors contributed to the interpretation of the results. JB and QG drafted the manuscript. AH and JSP revised the manuscript critically for important updates. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Jakob Skou Pedersen and Asger Hobolth are joint first authors.
Additional file
Additional file 1
Section S1POLE mutation patterns in colon cancer samples. Section S2 Detailed forward model selection results. Section S3 Mutation patterns in two cancer genes. (PDF 2222 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Bertl, J., Guo, Q., Juul, M. et al. A site specific model and analysis of the neutral somatic mutation rate in wholegenome cancer data. BMC Bioinformatics 19, 147 (2018). https://0doiorg.brum.beds.ac.uk/10.1186/s1285901821412
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s1285901821412
Keywords
 Multinomial logistic regression
 Sitespecific model
 Somatic cancer mutations