DNA methylation arrays as surrogate measures of cell mixture distribution

Houseman, Eugene Andres; Accomando, William P; Koestler, Devin C; Christensen, Brock C; Marsit, Carmen J; Nelson, Heather H; Wiencke, John K; Kelsey, Karl T

doi:10.1186/1471-2105-13-86

Research article
Open access
Published: 08 May 2012

DNA methylation arrays as surrogate measures of cell mixture distribution

Eugene Andres Houseman¹,
William P Accomando²,
Devin C Koestler³,
Brock C Christensen³,
Carmen J Marsit³,
Heather H Nelson⁴,
John K Wiencke⁵ &
…
Karl T Kelsey^2,6

BMC Bioinformatics volume 13, Article number: 86 (2012) Cite this article

44k Accesses
2033 Citations
51 Altmetric
Metrics details

Abstract

Background

There has been a long-standing need in biomedical research for a method that quantifies the normally mixed composition of leukocytes beyond what is possible by simple histological or flow cytometric assessments. The latter is restricted by the labile nature of protein epitopes, requirements for cell processing, and timely cell analysis. In a diverse array of diseases and following numerous immune-toxic exposures, leukocyte composition will critically inform the underlying immuno-biology to most chronic medical conditions. Emerging research demonstrates that DNA methylation is responsible for cellular differentiation, and when measured in whole peripheral blood, serves to distinguish cancer cases from controls.

Results

Here we present a method, similar to regression calibration, for inferring changes in the distribution of white blood cells between different subpopulations (e.g. cases and controls) using DNA methylation signatures, in combination with a previously obtained external validation set consisting of signatures from purified leukocyte samples. We validate the fundamental idea in a cell mixture reconstruction experiment, then demonstrate our method on DNA methylation data sets from several studies, including data from a Head and Neck Squamous Cell Carcinoma (HNSCC) study and an ovarian cancer study. Our method produces results consistent with prior biological findings, thereby validating the approach.

Conclusions

Our method, in combination with an appropriate external validation set, promises new opportunities for large-scale immunological studies of both disease states and noxious exposures.

Background

The biology of the development of any multisystem life form is fundamentally grounded in systematic cellular differentiation. This is essentially defined by lineage commitment of cells whose origin can be traced to a pluripotent progenitor and is marked by mitotically heritable epigenetic changes that reflect complex transcriptional programming of gene expression within the individual cell[1–3]. One such epigenetic mark is DNA methylation, which is tightly associated with alterations in the nucleosome DNA scaffold (and hence chromatin) that is responsible for coordination of gene expression in individual cells[1–3]. It is now appreciated that differentially methylated DNA regions (DMRs) distinguish cell lineages with high sensitivity and specificity[4] and considerable research is now underway to delineate precise DMRs that define and specify a particular cell lineage. The most developed understanding of epigenetic markers of lineage commitment to date is perhaps that of immune cell subclasses defined by populations of distinct circulating blood cells[5, 6].

Pluripotent hematopoietic stem cells residing in the bone marrow continually give rise to the entire hierarchy of blood cell subclasses through a developmental process known as hematopoiesis. Leukocytes, commonly called white blood cells, are critical in the host response to pathogens and foreign antigens and are divided into two compartments, the myeloid lineage and lymphoid lineage (also called lymphocytes). The composition of leukocyte populations is well known to reflect disease states and toxicant exposures and can be altered by signaling cascades that prompt migration of whole classes of cells into or out of tissues. Several DMRs that serve as reliable biomarkers of individual human white blood cell types have already been identified[5, 6]. Individual assays identifying cell-specific DMRs have proven useful for quantifying individual cell types in human tissues and peripheral blood. However, these assays are limited to detecting the relative proportion of one individual cell type compared with all others. On the other hand, simultaneous quantification of fluctuation in overall lymphocyte population composition can be accomplished only by using methods based on flow cytometry, which require large volumes of fresh blood and involve laborious antibody tagging. Hence, an approach that allows for the simultaneous quantification of the entire distribution of cell types, using an array of biomarkers based on generally available technology, would be considerably more informative, especially in studies of human disease and exposures.

In some instances, it is generally the overall balance of leukocyte subclasses in circulation or tissue that most prominently influences pathogenesis. For example, although incipient cancer cells are recognized and eliminated by cytotoxic T-cells (CTLs) and natural killer (NK) cells, tumorigenesis is also promoted by certain other inflammatory cells, including B-lymphocytes, mast cells, neutrophils, regulatory T-cells (Tregs), and numerous others. All of these cells have been shown to promote angiogenesis, tumor cell proliferation, tissue invasion and metastasis[7, 8]. Likewise, while higher levels of NK cells and CTLs circulating in the blood and residing in adipose tissues are associated with lower incidence of metabolic diseases such as type II diabetes[9], higher levels of M1 macrophages in adipose tissue can induce inflammation and insulin resistance[10]. These examples illustrate incredible potential for methods of quantifying the composition of lymphocyte populations to critically inform the underlying immuno-biology of disease states as well as the immune response to almost all chronic medical conditions. In addition, they offer great potential for predicting therapeutic outcomes[11].

Here we employ the concept of DMRs as markers of immune cell identity using a high density methylation platform, and propose a set of analytical tools for estimating the proportions of immune cells in unfractionated whole blood that does not require fresh cells. The backbone of the approach is the DNA methylation signature of each of the principal immune components of whole blood (B cells, granulocytes, monocytes, NK cells, and T cells subsets). We essentially seek a form of regression calibration, where we consider a methylation signature to be a high-dimensional multivariate surrogate for the distribution of white blood cells. In turn, this distribution is of interest for predicting or modeling disease states. As a surrogate, the DNA methylation signature is assumed to be a highly correlated, yet imperfect, measure of leukocyte distribution, and thus fits into the framework of measurement error models, where the use of a noisy surrogate marker to investigate an association with a disease outcome of interest results in biased estimates, unless internal or external validation data can be obtained to “calibrate” the model and correct the bias[12]. However, in this case, the problem is complicated by the extremely high dimension of the surrogate, so we propose an alternative to the traditional regression-calibration procedure that circumvents these complications but still allows us to extract the desired biological information.

We note that since we began this work, a small number of authors have published similar deconvolution algorithms using gene expression data[13–15]. The techniques are similar to the quadratic programming method we describe below in Methods for deconvolving a single sample, but none comprehensively addresses statistical properties or employs data from DNA methylation.

Methods

In this section we describe our proposed statistical methods, the data sets used to demonstrate their utility, and finally the design of simulation studies we have conducted to investigate statistical properties of our proposed algorithms.

Statistical methods

Let Y_0h be an m × 1 vector of methylation assay values, e.g. average beta values from an Infinium bead-array product corresponding to a purified blood sample consisting of a homogenous cellular population (e.g. monocytes or granulocytes), with the qualitative characterization of cell type (among d₀ such types) indicated by a d₀ × 1 covariate vector w_h. Here, h∈{1,…,n₀}, where n₀ is the number of specimens and the m individual values correspond to CpG sites on a DNA methylation microarray, possibly pre-selected to correspond to putative DMRs for distinguishing different cellular types. Correspondingly, let Y_1i be an m × 1 vector of methylation assay values for the same CpG sites (in the same order) as Y_0h, but corresponding to a heterogeneous mixture of cells (e.g. peripheral whole blood) from a human subject. Here, i∈{1,…,n₁}, n₁ is the number of target specimens, and z_1iis a d₁×1 covariate vector representing phenotypes or exposures corresponding to the subject, e.g. d₁ = 2 for a simple case/control study without confounders. Our goal is to understand the associations between Y_1i and z_1i in terms of associations between Y_0hand w_0h, i.e. to infer changes in mixtures of cell types associated with phenotypes or exposures, using DNA methylation as a surrogate measure of cell mixture. Thus, we have two data sets, S₀ = {(Y₀₁,w₁),…,(Y_0n₀,w_n₀)}, the set of data from “purified” cell samples effectively representing external validation or gold-standard data, and S₁ = {(Y₁₁,z₁),…,(Y_1n₁,z_n₁)}, representing surrogate data collected from a target population. To this end, we posit the following linear models:

\begin{array}{l} Y_{0 h} = B_{0} w_{0 h} + e_{0 h} \\ Y_{1 i} = B_{1} z_{1 i} + e_{1 i}, \end{array}

(1)

where B₀ and B₁ are, respectively, m × d₀ and m × d₁ matrices and e₀ and e₁ are error vectors. For simplicity we assume a one-way ANOVA parameterization for w, though in the Additional file1 we describe slight generalizations to account for design complications met in practice. We also assume a reasonable regression parameterization for z, including an intercept, and for convenience, denote the first column of B₀ as μ₁, the m × 1 intercept. The error vectors e₀ and e₁ may reflect independence among arrays h and i, or else may have more complex random effects structure accounting for technical effects or biological replication; however, their substructures are incidental to this analysis, with the exception of the fine details of the bootstrap procedure proposed below.

To implement a surrogacy relation, we propose the following linking regression model:

B_{1} = 1_{m} γ_{0}^{T} + B_{0} Γ + U,

(2)

where Γ is a d₀ × d₁ matrix that summarizes associations between the rows of B_0j and B_1i and U is a matrix of errors. Substituting equation (2) into (1), writing B₀ = (b₀₁,…,b_0d₀) explicitly in terms of its columns and writing $Γ^{T} = (γ_{1}, \dots, γ_{d_{0}})$ , it follows that

Y_{1 i} = \sum_{l = 0}^{d_{0}} b_{0 l} (γ_{l}^{T} z_{1 i}) + (1_{m} γ_{0}^{T} + U) z_{1 i} + e_{1 i} .

(3)

To impart a biological interpretation, we assume that the DNA assayed in S₁ arises as a mixture of DNA from cell types profiled in S₀, with mixture coefficients whose population averages, conditional on z, are ${ω_{1}^{(z)}, \dots, ω_{d_{0}}^{(z)}}$ , so that

E (Y_{1 i} | z_{1 i} = z) = ξ^{(z)} + \sum_{l = 1}^{d_{0}} b_{0 l} ω_{l}^{(z)},

(4)

where the m × 1 vector ξ^(z) represents cell types excluded from consideration among the purified samples in S₀, or else non-cell-specific methylation, including alterations at the molecular level in the maintanence of DNA methylation patterns themselves (possibly exposure related, age, or disease related). It follows from (3) and (4) that the mixture coefficients are recoverable from Γ, $ω_{l}^{(z)} = γ_{l}^{T} z_{1 i}$ , provided ξ^(z) is orthogonal to the column space of B₀. As we discuss in detail in the Additional file1, bias can arise if differences in ξ^(z) between distinct values of z have nonzero projection onto the column space of B₀, although the magnitude of anticipated biases can be assessed through sensitivity analysis.

It is possible to assign interpretations to the components of variation in (3). Let SS_o represents overall variability in Y_1i, i.e. $S S_{o} = \sum_{i = 1}^{n_{1}} ∥ Y_{1 i} - {\bar{μ}}_{1} ∥^{2}$ , where ${\bar{μ}}_{1} = E (Y_{1 i})$ . From multivariate probability theory it is straightforward to show that S S_o= S S_e + S S_v + S S_u, where $S S_{e} = \sum_{i = 1}^{n_{1}} ∥ e_{1 i} ∥^{2}$ , $S S_{v} = \sum_{i = 1}^{n_{1}} {(z_{1 i} - {\bar{z}}_{1})}^{T} Γ^{T} B_{0}^{T} B_{0} Γ (z_{1 i} - {\bar{z}}_{1})$ , and $S S_{u} = \sum_{i = 1}^{n_{1}} {{(z_{1 i} - {\bar{z}}_{1})}^{T} U^{T} U (z_{1 i} - {\bar{z}}_{1}) + m {(z_{1 i} - {\bar{z}}_{1})}^{T} γ_{0} γ_{0}^{T} (z_{1 i} - {\bar{z}}_{1})}$ . S S_e measures variation unexplained by the covariates z_1i, presumed to represent a combination of technical noise and unsystematic biological heterogeneity. SS_v measures variability explained by mixtures of profiles in the set S₀, while SS_u measures variability in systematic biological heterogeneity that nevertheless remains unexplained by mixtures of profiles in S₀, presumably due to some process other than differences in mixtures of cell types. Thus we propose two partial coefficient of determination measures: $R_{1, 0}^{2} = S S_{v} / S S_{o}$ , which represents the proportion of total variation in S₁ explained by S₀, and $R_{1, 1}^{2} = S S_{v} / (S S_{o} - S S_{e})$ , which represents the proportion of systematic variation in S₁ explained by S₀. Note that $R_{1, 1}^{2}$ is poorly defined when S S_o≈S S_e.

Estimation procedes by applying an appropriate linear model, e.g. ordinary least squares, linear mixed effects models[16], limma[17], or surrogate variable analysis[18, 19], to obtain estimates ${\hat{B}}_{0}$ and ${\hat{B}}_{1}$ . Estimates of γ₀ and Γ are then obtained by projecting ${\hat{B}}_{1}$ onto the column space of ${\tilde{B}}_{0} = (1_{m}, B_{0})$ , as described in detail in the Additional file1. Standard errors can be obtained in one of three ways. The simplest estimator, S E₀, is the “naive” estimator from simple least-squares theory, ignoring the fact that ${\hat{B}}_{0}$ and ${\hat{B}}_{1}$ are estimates, i.e. potentially variable. To account for variation in estimating ${\hat{B}}_{1}$ , a simple alternative is to use a nonparametric bootstrap procedure. For each bootstrap iteration t, we sample with replacement from S₁ (or sample errors in a manner consistent with a hierarchical experimental design) to obtain $S_{1}^{(t)}$ , producing bootstrap estimates ${\hat{B}}_{1}^{(t)}$ from which “single-bootstrap” standard errors SE₁ are computed. Finally, it is possible to account for variation in estimating _{B 0}by also bootstrapping S₀; because of potentially small sample sizes n₀, we propose using a parametric bootstrap. A“double-bootstrap” standard error estimator, SE₂, is computed from these two sets of bootstraps. The double-bootstrap has the additional benefit over the single-bootstrap, in that it can be used to assess bias due to measurement error (variability) in ${\hat{B}}_{0}$ . Estimation details are provided in the Additional file1, as are the results of simulation studies.

Beyond bias due to measurement error, which is easily corrected using the double-bootstrap procedure, there are additional sources of potential bias. For example, consider a univariate z_1i representing case/control status, where $δ \equiv ξ^{(1)} - ξ^{(0)} = B_{0} α$ for some d₀ × 1 vector α ≠ 0; i.e. δ is the mean difference in DNA methylation between a case and control, contributed by cell mixtures that remain uncharacterized or non-cell-specific methylation. In such a situation, there will be a bias equal to α in estimating the mixture differences. The Additional file1 provides a detailed analysis of such biases, and proposes a sensitivity analysis procedure for assessing the magnitude of possible bias in a given data set.

While the focus of this paper is analysis of population data, it is possible to use S₀ to predict distribution of leukocytes in a single sample having DNA methylation profile Y^∗. Equating the intercept term of B₁ in (1) with Y^∗ and applying (2), we obtain mixing proportion estimates $Γ^{*} = {({\tilde{B}}_{0}^{T} {\tilde{B}}_{0})}^{- 1} {\tilde{B}}_{0}^{T} Y^{*}$ . Estimates can be further refined with the use of quadratic programming techniques[20], restricting the components of Γ^∗,γ_l^∗ ≥ 0, in minimizing $∥ Y^{*} - {\tilde{B}}_{0} Γ^{*} ∥^{2}$ with respect to Γ^∗. Such individual projections of methylation profiles on the column space spanned by S₀ facilitate the application of the fundamental ideas proposed above to individual, clinically-based diagnostic procedures. Note, however, that DNA methylation arrays are typically focused on the comparison of methylated to unmethylated CpG dinucleotides, not quantifying actual amounts of DNA. Therefore, information on cell mixtures from DNA methylation is limited to distributions, not actual counts, as one might obtain from flow cytometry. Finally, we remark that it is possible to model z_1i directly as a function of mixture coefficients Γ^∗ obtained individually via the constraint γ_l^∗ ≥ 0, but the inferential implications are less clear, and we view the proposed approach for populations as more statistically robust.

Implementation

We describe several examples using existing methylation data sets as benchmarks for validating the proposed method, in order to demonstrate its clinical or epidemiological utility. First we describe the validation data set S₀ used in all examples. Next we describe a laboratory reconstruction experiment, which validates our fundamental proposition that DNA methylation retains substantial information about cell mixtures. Finally we describe the results of applying our methodology to several different target data sets S₁. For the head and neck cancer and ovarian cancer data sets, from which bead chip data were available, a linear mixed effects model with a random intercept for bead chip was used to estimate the corresponding row of B₁. For the remaining data sets, no bead chip data were available; consequently, ordinary least squares was used. 250 bootstrap iterations were used for each example and each of the two bootstrap methods of standard error estimation.

Validation data

All data analyses involve DNA methylation data obtained by the Infinium HumanMethylation27 Beadchip Microarrays from Illumina, Inc. (San Diego, CA). We used a subset of m = 100 CpG sites on the array, selected as described below. In all of our examples, S₀ consisted of 46 white blood cell samples, de-identified specimens that were not subject to human subjects review by an institutional review board (IRB). The sorted, normal, human, peripheral blood leukocyte subtypes were purchased from AllCells^Ⓡ, LLC (Emeryville, CA) and were isolated from whole blood using a combination of negative and positive selection with highly specific cell surface antibodies conjugated to magnetic beads; materials and protocols were obtained from Miltenyi Biotec, Inc. (Auburn, CA). These 46 samples are summarized in Table1 and depicted by the clustering heatmap in Figure1. Note that T lymphocytes that express CD4 or CD8 constitute over 95% of the T cell class, and that the pan-T cell type was further refined to CD4+, CD8+, and “other” Pan-T cells subtypes. In summary, the covariate vector w_hconsisted of indicators for five cell types and another two indicators for CD4+ and CD8+ T cell subtypes. A generalization of the one-way ANOVA parameterization assumed above for w_h, described in the Additional file1, was necessary to account for the ambiguous status of some Pan-T cells. For each CpG site, a linear mixed effects model with a random intercept for bead chip was used to estimate B₀; 27 additional whole blood control samples (replicates from the same individual) were used to assist in estimating chip effects, since otherwise the data set would have been sufficiently sparse to risk confounding between cell type and chip. These “array controls” were indicated with an additional term in w_0h. For each CpG site, a linear mixed effects model with a random intercept for bead chip was used to estimate the corresponding row of B₀and B₁. From S₀, F statistics (described in the Additional file1) were computed and used to order each of the 26,486 autosomal CpGs by decreasing level of informativeness with respect to blood cell types. As described in the Additional file1, we determined that maximum informativeness was provided by the top m = 100 − 300 CpG sites, with m > 300 reflecting diminishing returns from adding additional CpGs. Therefore, we chose a moderately low value in this range, m = 100, consistent with the size of a small custom microarray chip.

Table 1 Sorted white blood cells in S₀

Full size table

Cell mixture experiment

Proof of the utility of the proposed methods in predicting leukocyte distributions for individual samples requires extensive, detailed reconstruction experiments beyond the scope of the present paper. However, to provide evidence that such experiments are worthwhile and show promise of positive results, we conducted a simple experiment involving six known mixtures of monocytes and B cells and six known mixtures of granulocytes and T cells. The results of this experiment are described below in Results.

Head and neck cancer

Our first target data set S₁ consisted of arrays applied to whole blood specimens collected in a random subset of individuals involved in an ongoing population-based case-control study[21] of head and neck cancer (HNSCC): 92 cases and 92 age and sex matched controls. The study was approved by Brown University IRB, protocol #0707992334. Blood was drawn at enrollment (prior to treatment in 85% of the cases). Mean age among the subjects arrayed in this study was 60 years, and there were 56 females and 128 males, consistent with the higher incidence of the disease in men. Thus, the covariate vector z consisted of an indicator for case/control status, an indiator for male sex, and age (in decades) centered at the mean. The clustering heatmap in Figure2 depicts the raw DNA methylation data in S₁.

Ovarian cancer

We next applied our method to an ovarian cancer data set[22]. DNA methylation data for blood samples are available from Gene Expression Omnibus (GEO,http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/geo/, Accession number GSE19711). We used only those cases having blood drawn pre-treatment. After removing 4 arrays with a preponderance of missing values, the data set consisted of 272 controls and 129 cases having blood drawn prior to treatment. A clustering heatmap displaying the DNA methylation data appears in the Additional file1. In this analysis, z consisted of case-control status, age (categorized in 5-year increments), and 2 bisulfite conversion efficiency measures.

Down syndrome

We also applied our method to a trisomy 21 (Down syndrome) data set[23] consisting of 29 total peripheral blood leukocyte samples from Down syndrome cases and 21 controls, as well as 6 T cell samples from cases and 4 T cell samples from controls (GEO Accession number GSE25395). Because of the potential for bias induced by copy number amplification, we excluded 4 CpG sites on Chromosome 21, resulting in m = 96 CpG sites used for analysis. A clustering heatmap displaying the DNA methylation data appears in the Additional file1. In one analysis, we compared cases and controls using the total leukocyte samples only, and in another we compared total leukocytes to T cells, pooling cases and controls. The Additional file1. presents coefficient estimates.

Obesity in African Americans

Finally, we applied our method to an obesity data set[24] consisting of 7 lean African-Americans and 7 Obese African-Americans (GEO Accession number GSE25301). A clustering heatmap displaying the DNA methylation data appears in the Additional file1. In this analysis, z consisted of obesity status.

Additional analyses

If the subject population for which z = 0 is sufficiently homogeneous with respect to blood cell distribution to admit sensible characterization of that distribution, then it is possible to recover estimates from $\hat{Γ}$ . The Additional file1 reports the results of such an analysis applied to the HNSCC case/control data set. Finally, we conducted an additional analysis where we took S₀ to consist of only samples with pure CD4+ or CD8+ cells and S₁ to consist only of samples having the less purified T-lymphocytes. For such S₁, there were no covariates, so z consisted only of an intercept.

Simulations

We conducted extensive simulation studies in order to verify the finite-sample statistical properties of our proposed methodology. Simulation parameters were obtained from the HNSCC data set, and most simulations assumed no sources of biological bias (DNA methylation changes arising from processes not mediated by the profiled leukocytes, including shifts in distribution within cell types not profiled). In every simulation, we specified S₀ to consist of 5 B-cell samples, 10 granulocyte samples, 5 monocyte samples, 15 NK samples, 5 general “Pan-T” T-cell samples, 8 specific CD4+ T cell samples, and 2 specific CD8+ T cell samples. Estimates from the external validation set S₀, described above, were used for mean methylation profiles among WBC types, using the m = 100 most informative CpG sites.

We specified n₁/2 cases and n₀/2 controls, n₀∈{100,200,500}. Among the controls, methylation profiles were generated by a white blood cell population of 7% B-cells, 62% granulocytes, 6% monocytes, 2% NK cells, and 13% were T-cells, of which 65% were CD4+ cells and 35% were CD8+ cells, and the remaining 5% were unspecified (and assumed to have mean methylation equal to that of the unsorted T-lymphocytes). Among cases, we specified one of the following scenarios: a 4% reduction in CD4+ cells, a 2% reduction in CD8+ cells, and an 8% increase in granulocytes (alternative with changes in both CD4+ and CD8+, “Strong Alternative I”); a 6% reduction in CD4+ cells, and an 8% increase in granulocytes (alternative with changes in CD4+ but not CD8+, “Strong Alternative II”); a weaker alternative with half the effects of Strong Alternative I (“Mixed Alternative” elaborated upon below); and two null scenarios with no changes in cell population, each with a different assumption about δ. Note that these changes reflect absolute changes in percentage points, not relative changes. Note also that these values were actually used to generate Dirichlet-distributed mixture weights for each simulated subject, with Dirichlet parameters equal to a precision parameter (100 corresponding to “precise” and 10 corresponding to“noisy”) times the mean weight described above. Residual effects $ξ_{i}^{(0)}$ for controls were set equal to 0.1 times estimated intercept estimate ${\hat{μ}}_{1}$ obtained from the HNSCC data set, while residual effects $ξ_{i}^{(1)}$ for cases were set equal to 0.08 or 0.09 times ${\hat{μ}}_{1}$ plus multiples 10θ of the column of $\hat{U}$ corresponding to case. The constants of proportionality 0.1, 0.08, and 0.09 were chosen to correspond to assumed contributions of ξ to an overall methylation signature presumed to be dominated by profiled populations of white blood cells in specified proportions, with 0.08 used for the strong alternatives and 0.09 used for the Mixed Alternative. The constant 10 was used to amplify the scale of δ so that its effect could be detected in simulation; note that $\hat{U}$ was orthogonal to the white blood cell profiles, by construction. The multiplier θ = 0 was used for strong alternatives, and the “Strong Null” case (i.e. no methylation differences between cases and controls) while θ = 0.5 was used for the Mixed Alternative, and θ = 1 was used for the “Mixed Null” with case/control differences not mediated by cellular population differences. A simple normal error structure for e_0hand e_0i was specified, with no chip effects, but with variance equal to the sum of chip and residual variance estimated (individually for each CpG) for the HNSCC data. For each simulation, 50 bootstraps were used to estimate standard errors. 1000 simulations were run for each scenario.