 Research
 Open Access
 Published:
BG2: Bayesian variable selection in generalized linear mixed models with nonlocal priors for nonGaussian GWAS data
BMC Bioinformatics volume 24, Article number: 343 (2023)
Abstract
Background
Genomewide association studies (GWASes) aim to identify single nucleotide polymorphisms (SNPs) associated with a given phenotype. A common approach for the analysis of GWAS is single marker analysis (SMA) based on linear mixed models (LMMs). However, LMMbased SMA usually yields a large number of false discoveries and cannot be directly applied to nonGaussian phenotypes such as count data.
Results
We present a novel Bayesian method to find SNPs associated with nonGaussian phenotypes. To that end, we use generalized linear mixed models (GLMMs) and, thus, call our method Bayesian GLMMs for GWAS (BG2). To deal with the high dimensionality of GWAS analysis, we propose novel nonlocal priors specifically tailored for GLMMs. In addition, we develop related fast approximate Bayesian computations. BG2 uses a twostep procedure: first, BG2 screens for candidate SNPs; second, BG2 performs model selection that considers all screened candidate SNPs as possible regressors. A simulation study shows favorable performance of BG2 when compared to GLMMbased SMA. We illustrate the usefulness and flexibility of BG2 with three case studies on cocaine dependence (binary data), alcohol consumption (count data), and number of rootlike structures in a model plant (count data).
Introduction
Genomewide association studies (GWAS) have uncovered many single nucleotide polymorphisms (SNP) associated to important phenotypes such as plant productivity [1], plant response to salt stress [2], and human diseases [3]. To take into account the correlation among GWAS observations, the most widely used methods for the analysis of GWAS continuous Gaussian data are single marker analysis (SMA) methods based on linear mixed models (LMMs) [4,5,6]. Recently, SMA based on logistic regression with random effects has been proposed for the analysis of GWAS binary data [7]. However, to the best of our knowledge, there are no published methods for the analysis of other types of correlated GWAS nonGaussian data such as count data. One of our contributions is to propose the use of generalized linear mixed models for the analysis of GWAS nonGaussian data. To that end, we use generalized linear mixed models (GLMMs) and, thus, call our method Bayesian GLMMs for GWAS (BG2).
BG2 has two steps: a screening step and a model selection step. The screening step, similarly to SMA methods, fits p GLMMs where each model has just one SNP, and uses Bayesian FDR control [8, 9] to provide a set of candidate SNPs. After that, the model selection step performs a model search through the space of GLMMs that may include any number of screened candidate SNPs as possible regressors. BG2 implements both steps using a pseudolikelihood approach. We note that a similar pseudolikelihood approach can be used to implement SMA methods for nonGaussian GWAS data, and a particular case of such an approach has been proposed for GWAS binary data [7]. However, simulation studies presented in Sect. 4 show that, when compared to such SMA methods for nonGaussian data, BG2 leads to much lower FDR.
The GLMMs for GWAS data considered by BG2 may have two types of random effects: kinship randoms effects and overdispersion random effects. The kinship random effects account for correlation among GWAS observations due to population stratification and hidden relatedness. Similarly to existing literature for Gaussian GWAS data, we assume that the vector of kinship random effects follows a multivariate Gaussian distribution with a mean vector of zeros and a covariance matrix that is the product of a onedimensional unknown variance parameter and a known positive semidefinite kinship matrix [10, 11]. The overdispersion random effects allow for extra variability not accounted for by the model for observations; for example, when assuming a conditional Poisson model for the observations, the overdispersion random effects account for extraPoisson variability.
Both screening and model selection steps in BG2 are based on nonlocal priors. To the best of our knowledge, this is the first time that nonlocal priors are proposed for regression coefficients in GLMMs. Previous literature in Bayesian model selection for GLMMs has assigned for regression coefficients local priors [12]. While local priors have positive density at null parameter values, nonlocal priors have density equal to zero at null parameter values. Nonlocal priors were first proposed by [13, 14] for Gaussian linear models. Nonlocal priors have been successfully developed for many different problems such as model selection in Gaussian directed acyclic graphical models [15], classification with Bayesian probit models [16], variable selection in logistic models [17], Bayesian wavelet analysis [18], and variable selection in generalized linear models [19]. In particular, [20, 21] have proposed methods based on nonlocal priors for variable selection in linear mixed models applied to GWAS data. However, because LMMs applied to binary or count data may lead to meaningless negative predictions and statistically inefficient estimation, LMMs should not be applied to nonGaussian data such as count and binary data, which are the types of data considered by BG2. Nonlocal priors lead to faster accumulation of evidence in favor of a true null hypothesis [13], and have been shown to be advantageous for highdimensional problems [14, 16, 22]. Therefore, BG2 uses nonlocal priors for SNP search in GWAS analysis.
Due to the large number of GLMMs that need to be fitted, BG2 relies on two approximations to speed up computations: a pseudolikelihood approximation; and a Population Parameters Previously Determined (P3D) approximation that may be seen as an empirical Bayes approach. For GLMMs, the integrated likelihood function obtained by integrating out the random effects is not available in closed form. Repeated numerical integration of the random effects for each GLMM fitted for a GWAS analysis is computationally too expensive. Thus, BG2 uses a pseudolikelihood approach [23] to facilitate integrating out the random effects. Such pseudolikelihood approach leads to a Gaussian approximation for adjusted observations that allows analytically integrating out the random effects. In addition, to avoid the computation of matrix inverses for each SNP and, thus, to further speed up computations, we propose a P3D approximation for GLMMs. A P3D approximation was first proposed by [24] for Gaussian linear mixed models (LMMs) and a variation of this approximation is used in the celebrated and widely used method EMMAX for the analysis of GWAS Gaussian data [6]. With our P3D approach, BG2 needs to compute a spectral decomposition only once for each screening step and only once for each model selection step.
In our P3D approach, for each BG2 step (screening and model selection) we fit a baseline GLMM to obtain adjusted observations and estimates of the variance parameters. We then keep the adjusted observations and the variance parameters fixed at the values computed with the baseline GLMM when fitting all other models in that BG2 step. In our P3D approach, the baseline model is different for the screening step and for the model selection step. For the screening step, the baseline model is a GLMM without any SNPs. For the model selection step, the baseline model is a GLMM with all candidate SNPs obtained from the screening step. This choice of baseline GLMM for the model selection step is based on [25], who have suggested for GLMMs the use of adjusted observations based on the full model – the model with all the regressors – when computing BICs for all possible models. Therefore, BG2 with our P3D approximation does not need to compute a spectral decomposition for each SNP. As a result, when compared to a usual pseudolikelihood approach to GLMMs, our P3D approximation greatly reduces the computational time and allows the analysis of nonGaussian GWAS data within a reasonable time frame.
To be technical, in this work we use a hierarchical model and an empirical Bayes approach to estimate the hyperparameters of the prior distribution of the regression coefficients of GLMMs. We then combine this prior distribution with the data through Bayes Theorem to compute the posterior probability of the competing GLMMs. The Bayesian model selection procedure that we propose in this work is similar to that of [26], except that in our current work we are dealing with the problem of Bayesian ultrahigh dimensional variable selection (p two orders of magnitude larger than n) in GLMMs applied to GWAS analysis. To the best of our knowledge, currently there are no published methods for Bayesian ultrahigh dimensional variable selection in GLMMs.
The remainder of this paper is organized as follows. Section 2 describes the GLMMs that we consider for nonGaussian GWAS data. Section 3 describes our BG2 method for the identification of causal SNPs. Section 4 presents the results of two simulation studies for binary data and for count data. Section 5 illustrates our method with applications to three case studies: human cocaine dependence, alcohol consumption, and the number of rootlike structures in the plant A. Thaliana. Section 6 concludes with a discussion and future directions.
GLMMs for GWAS
Consider observations \(y_1, \dots , y_n\) that, given random effects, are conditionally independent and have a distribution from the exponential family of distributions. This flexible family of distributions includes the Bernoulli, binomial, Poisson, and gamma distributions. Thus, this family may be used to model observed GWAS phenotypes such as an indicator of disease presence/absence, number of lateral roots in plants, or survival time. Then, the density function of \(y_i\) is
for \(i=1,\dots ,n\), where \(T(y_i)\) is the sufficient statistics for \(y_i\), B(.) and C(.) are known functions. Further, each observation \(y_i\) has mean \(\mu _i=B'(\eta _i)\) and variance \(v_i = B''(\eta _i)\). Let \(X_s\) be a matrix of SNPs and \(\pmb {\beta }_s\) be the corresponding vector of regression coefficients. In addition, let \(X_c\) be a matrix that contains a column of ones for the intercept and other columns for control covariates (e.g., age, sex, and environmental factors) and \(\pmb {\beta }_c\) be the corresponding vector of regression coefficients. Thus, \(\pmb {\beta }_s\) and \(\pmb {\beta }_c\) are fixed effects. Further, let \(\pmb {\alpha }_1\) be a vector of random effects that accounts for kinship correlation. Specifically, \(\pmb {\alpha }_1\) has a multivariate normal distribution with mean vector \(\pmb {0}\) and covariance matrix \(\kappa _1 \Sigma\), where \(\kappa _1\) is an unknown scalar and \(\Sigma\) is a kinship matrix. Furthermore, let \(\pmb {\alpha }_2\) be a vector of overdispersion random effects following \(N(\pmb {0}, \kappa _2 I)\). Let \(\pmb {y}=(y_1, \dots , y_n)\) be the vector of observed phenotypes. Then, the conditional expectation \(E(\pmb {y}\pmb {\alpha }_1,\pmb {\alpha }_2)\) is linked to the linear predictor \(X_s\pmb {\beta }_s+X_c\pmb {\beta }_c+\pmb {\alpha }_1+\pmb {\alpha }_2\) by the link function g:
The class of GLMMs given by Eqs. (1) and (2) can be expanded to deal with other cases. For example, to account for the experimental design used for data collection, we may add another random effect \(\pmb {\alpha }_3\) following a multivariate normal distribution with mean vector \(\pmb {0}\) and covariance matrix \(\kappa _3 \Sigma _3\), where \(\kappa _3\) is a unknown parameter and \(\Sigma _3\) is a symmetric positive semidefinite matrix that describes the dependence structure among the observations due to the experimental design. Because of the P3D approach, BG2 can include additional random effects and still use the spectral decomposition approach to speed up computations.
BG2: Bayesian SNP selection in GLMMs for GWAS
Our method BG2 consists of two steps: screening and model selection. The BG2 screening step uses a novel Bayesian single marker analysis for nonGaussian data and applies Bayesian false discovery rate control to yield a set of candidate SNPs. After that, the BG2 model selection step performs a search through the model space of all GLMMs that may include any number of SNPs from the set of candidate SNPs. In both steps, BG2 uses a pseudolikelihood approach to fit models. In what follows, Sect. 3.1 presents the pseudolikelihood approach, Sect. 3.2 introduces the BG2 screening step, and Sect. 3.3 presents the BG2 model selection step.
Pseudolikelihood model fitting
In both the screening and the model selection steps, BG2 uses a pseudolikelihood approach. In this subsection, we provide a summary description of the pseudolikelihood approach. In addition, Additional file 1: Section S1 provides a detailed presentation of the pseudolikelihood approach. This is an iterative approach that writes the model for the observations as \(\pmb {y} = \pmb {\mu } + \pmb {\epsilon }\), where \(\pmb {\epsilon }\) is a vector of errors and \(V=Var(\pmb {\epsilon })=Var(\pmb {y}) {= diag(v_1, \ldots , v_n)}\) is a diagonal matrix. Note that for distributions in the exponential family, the variance \(v_i\) depends on the linear predictor \(\eta _i\) and, thus, gets updated in each iteration of the pseudolikelihood algorithm. More details can be found in Additional file 1: Section S1. In addition, the pseudolikelihood approach expands \(\pmb {\mu }=E(\pmb {y}\pmb {\beta }_s,\pmb {\beta }_c, \pmb {\alpha }_1,\pmb {\alpha }_2)\) in a firstorder Taylor expansion around current estimates of \(\pmb {\beta }_s\), \(\pmb {\beta }_c\), \(\pmb {\alpha }_1\), \(\pmb {\alpha }_2\), \(\kappa _1\), and \(\kappa _2\). The resulting equation is rearranged such that the lefthand side depends only on known quantities (observations, current estimates of parameters, regression matrices). Then, this equation is premultiplied by \(V^{1}\). Let \({\hat{V}}\) be the current estimate for V. The lefthand side of the resulting equation, known as the vector of adjusted observations, is \(\pmb {y}^\star = {\widehat{V}}^{1}(\pmb {y}\widehat{\pmb {\mu }})+X_s\widehat{\pmb {\beta }}_s+ X_c\widehat{\pmb {\beta }}_c+ \widehat{\pmb {\alpha }}_1+\widehat{\pmb {\alpha }}_2\). Equating \(\pmb {y}^\star\) to the righthand side of the resulting equation yields
Then, the pseudolikelihood approach approximates the GLMM by an LMM given by Eq. (3) with vectors of random effects \(\pmb {\alpha }_1 \sim N(\pmb {0}, \kappa _1\Sigma )\) and \(\pmb {\alpha }_2 \sim N(\pmb {0}, \kappa _2 I).\) Based on this LMM, new estimates are computed for \(\pmb {\beta }_s\), \(\pmb {\beta }_c\), \(\pmb {\alpha }_1\), \(\pmb {\alpha }_2\), \(\kappa _1\), \(\kappa _2\), and V. The pseudolikelihood algorithm then iterates until convergence of these estimates. More details about the pseudolikelihood method are given in Additional file 1: Section S1.
BG2 screening step
The BG2 screening step uses a P3D approach based on a baseline model that assumes a linear predictor given in Eq. (2) specialized to contain no SNPs, that is, \(g(E(\pmb {y}\pmb {\beta }_c, \pmb {\alpha }_1,\pmb {\alpha }_2)) = X_c\pmb {\beta }_c+\pmb {\alpha }_1+\pmb {\alpha }_2.\)
Our P3D approach keeps \(\pmb {\beta }_c\), \(\kappa _1\), \(\kappa _2\), and V fixed at their pseudolikelihood estimates when performing the Bayesian SMA in the BG2 screening step. Let us denote these estimates by \(\widehat{\pmb {\beta }}_c\), \({\widehat{\kappa }}_1\), \({\widehat{\kappa }}_2\), and \({\widehat{V}}\). In addition, our P3D approach keeps the vector of adjusted observations fixed equal to \(\pmb {y}^\star\) obtained at the last iteration of the pseudolikelihood algorithm for the baseline model. Let \(H= {\widehat{\kappa }}_1\Sigma + {\widehat{\kappa }}_2 I+ {\widehat{V}}^{1}\) be the estimated covariance matrix of the adjusted observations \(\pmb {y}^\star\). Consider the spectral decomposition of the matrix H given by \(H=PDP^T\). The matrix H is kept fixed for all SNPs in the screening step. Thus, the spectral decomposition of H, which has a computational cost of \(O(n^3)\), has to be computed only once at the beginning of the screening step.
Let \(\pmb {x}_s\) be the vector of covariate values for SNP s. Then, the BG2 screening step assumes for each SNP s, \(s=1,\dots ,p\), that the adjusted observations \(\pmb {y}^\star\) can be modeled by the LMM
Then, the adjusted observations \(\pmb {y}^\star\) have an approximate multivariate Gaussian distribution \(N(X_c \widehat{\pmb {\beta }}_c + \pmb {x}_s \beta _s, H).\) Let \(\widetilde{\pmb {y}} = P^T(\pmb {y}^\star X_c\widehat{\pmb {\beta }}_c)\) and \(\widetilde{\pmb {x}}_s = P^T \pmb {x}_s\). Then, an estimator of \(\beta _s\) is \({\widehat{\beta }}_s = (\widetilde{\pmb {x}}_s^T D^{1} \widetilde{\pmb {x}}_s)^{1} \widetilde{\pmb {x}}_s^T D^{1} {\widetilde{y}}\). In addition, the estimator \({\widehat{\beta }}_s\) has approximate distribution \(N(\beta _s, \sigma ^2_s)\), where \(\sigma ^2_s = var( {\widehat{\beta }}_s) = (\widetilde{\pmb {x}}_s^T D^{1} \widetilde{\pmb {x}}_s)^{1}.\)
We assign for \(\beta _s\) a prior that is a mixture of a Dirac delta function and a nonlocal prior, that is,
where \(\pi _0\) is the probability of the null hypothesis that \(\beta _s\) is equal to zero and \(\tau >0\) is a scale parameter. Larger values of \(\tau\) cause the prior to be more spread out and lead BG2 to focus on identifying SNPs with relatively large regression coefficients. Then, the predictive density of \({\widehat{\beta }}_s\) is
Based on this predictive density and assuming that \({\widehat{\beta }}_1, \ldots , {\widehat{\beta }}_p\) are approximately conditionally independent given \(\pi _0\) and \(\tau\), we obtain the approximate likelihood function of \(\tau\) and \(\pi _0\)
Let \(\pi (\tau )\) and \(\pi (\pi _0)\) be the prior densities of \(\tau\) and \(\pi _0\), respectively. Then, by Bayes Theorem an approximate posterior density for \((\tau ,\pi _0)\) is
BG2 estimates \(\tau\) and \(\pi _0\) by maximizing (7) to obtain posterior modes \({\widehat{\tau }}\) and \({\widehat{\pi }}_0\).
We assign a noninformative uniform prior on (0, 1) for \(\pi _0\) and consider two prior specifications for \(\tau\). The first prior specification is a uniform prior for \(\tau\) on \((0,\infty )\). The second prior specification for \(\tau\) is an inverse gamma distribution with shape parameter \(0.55/0.022 + 1\) and rate parameter 0.55, that is \(\tau \sim IG(0.55/0.022+1,0.55)\). This prior specification implies a prior mean for \(\tau\) equal to 0.022, which was the value for a fixed \(\tau\) recommended by [20] for GWAS studies. In addition, we note that values of \(\tau\) that are too small lead to numerical instability. Therefore, our prior specification implies that a priori \(P(\tau >0.01)=0.999\), stochastically keeping \(\tau\) away from 0.
Alternatively, we may fix \(\tau\) at prespecified values [14, 20]. Specifically, in the context of GWAS analysis, [20] suggested fixing \(\tau =0.022\) because GWAS effect sizes are generally very small. When \(\tau =0.022\), the nonlocal product moment prior (pMOM) prior assigns a probability of 0.01 to the event that a standardized effect size falls in the interval (\(\)0.05, 0.05). Thus, in the simulation studies presented in Sect. 4, we also consider fixing \(\tau\) at 0.022.
After estimating \(\tau\) and \(\pi _0\), BG2 takes an Empirical Bayes approach and keep them at their estimates \({\widehat{\tau }}\) and \({\widehat{\pi }}_0\) while using Bayes Theorem to compute the posterior probability that the regression coefficient of SNP s (\(s=1,\ldots ,p\)) in the screening step is different than zero, that is
where \(p( {\widehat{\beta }}_s{\widehat{\tau }},{\widehat{\pi }}_0)\) is the predictive density given in Eq. (5) with \(\tau ={\widehat{\tau }}\) and \(\pi _0={\widehat{\pi }}_0\).
Finally, based on the posterior probabilities computed with Eq. (8), the BG2 screening step uses Bayesian FDR control [8, 9, 27,28,29] to obtain a list of candidate SNPs while keeping the nominal FDR at 5%. Let us denote the number of SNPs contained in this list of candidate SNPs obtained in the screening step by k.
BG2 model selection step
The BG2 model selection step considers GLMMs with any number of SNPs from the list of k candidate SNPs obtained from the BG2 screening step. Thus, the model selection step considers \(S=2^k\) possible models. Let \(M_m\) be the mth model, \(m = 1, \ldots , S\). Let \(X_m\) be the matrix of SNPs in model \(M_m\), \(\pmb {\beta }_m\) be the corresponding vector of regression coefficients, and \(p_m\) be the number of SNPs in model \(M_m\). Let \(X_S\) be the model with all k candidate SNPs.
We assume that the k candidate SNPs may or may not be in a model according to a sequence of exchangeable Bernoulli trials. Specifically, the prior probability of model \(M_m\) is \(P(M_m) = {\widehat{\pi }}_0^{k  p_m} (1 {\widehat{\pi }}_0)^{p_m}\) where \({\widehat{\pi }}_0\) is the estimate of the probability of null hypothesis obtained in the screening step. We do this to ensure that the Bayesian control of false discoveries in the BG2 model selection step is as strict as the control of false discoveries in the BG2 screening step.
The BG2 model selection step uses a P3D approach where the baseline model is the full model \(M_S\) with linear predictor \(g(E(\pmb {y}\pmb {\alpha }_1,\pmb {\alpha }_2)) = X_c\pmb {\beta }_c+X_{S}\pmb {\beta }_{S}+\pmb {\alpha }_1+\pmb {\alpha }_2.\) The pseudolikelihood approach then yields estimates \(\widehat{\pmb {\beta }}_c\), \({\widehat{\kappa }}_1\) and \({\widehat{\kappa }}_2\), \({\widehat{V}}\), and adjusted observations \(\pmb {y}^\star\). We then consider all models \(M_m, m=1,\ldots ,S\), where we keep \(\pmb {\beta }_c\), \(\kappa _1\), \(\kappa _2\), and V fixed at these estimates. Let \(H= {\widehat{\kappa }}_1\Sigma + {\widehat{\kappa }}_2 I+ {\widehat{V}}^{1}\) and consider the spectral decomposition of the matrix H given by \(H=PDP^T\). The matrix H is kept fixed for all nonbaseline models in the model selection step. Thus, even though the spectral decomposition has a computational cost of \(O(n^3)\), this decomposition has to be computed only once at the beginning of the model selection step. In addition, following the recommendation of [25], we keep the adjusted observations for all the S considered models fixed at the adjusted observations \(\pmb {y}^\star\) obtained while fitting the full model.
Therefore, under model \(M_m\) and with the P3D approach, the adjusted observations \(\pmb {y}^\star\) follow the approximate distribution \(N \left( X_c \widehat{\pmb {\beta }}_c + X_m \pmb {\beta }_m, H \right) .\) In addition, let \(\widetilde{\pmb {y}} = P^T(\pmb {y}^\star X_c\widehat{\pmb {\beta }}_c)\) and \({\widetilde{X}}_m = P^TX_m\). Then, we can rewrite the LMM as \(\widetilde{\pmb {y}}\pmb {\beta }_m {\mathop {\sim }\limits ^{a}} N({\widetilde{X}}_m \pmb {\beta }_m, D)\), where \({\mathop {\sim }\limits ^{a}}\) denotes “approximately distributed as.” Because D is a diagonal matrix, computations for this latter model are very fast.
We propose a novel nonlocal prior for GLMMs. Specifically, we propose a prior density that is the product of a multivariate Gaussian density and the product of the square of each element of the vector of regression coefficients \(\beta _m\). In this multivariate Gaussian density, the covariance matrix is \(\tau n (X_m^T H^{1} X_m)^{1}\). Using the spectral decomposition of the matrix H, the prior we propose for \(\pmb {\beta }_m\) is
where \(d_m\) is a normalizing constant.
Let \(C_m = {\widetilde{X}}_m^T D^{1} {\widetilde{X}}_m(1+({\widehat{\tau }} n)^{1})\), \(\widetilde{\pmb {\beta }}_m = C_m^{1}{\widetilde{X}}_m^T D^{1} \widetilde{\pmb {y}}\), and \(R_m \ = \ \widetilde{\pmb {y}}^T D^{1} (D{\widetilde{X}}_m C_m^{1} {\widetilde{X}}_m^T) D^{1}\widetilde{\pmb {y}} \ = \ \widetilde{\pmb {y}}^T D^{1} \widetilde{\pmb {y}} \widetilde{\pmb {y}}^T D^{1} {\widetilde{X}}_m\widetilde{\pmb {\beta }}_m.\) Then, the marginal density of the adjusted observations \(\widetilde{\pmb {y}}\) conditional on model \(M_m\) is
where \(E_1\left( \prod _{i=1}^{ p_m}\beta _{mi}^2 \right)\) is the expected value with respect to \(N(\pmb {0},(1+ {\widehat{\tau }} n) C_m^{1} )\) and \(E_2\left( \prod _{i=1}^{ p_m}\beta _{mi}^2 \right)\) is the expected value with respect to \(N(\widetilde{\pmb {\beta }}_m, C_m^{1})\). To approximate \(E_1\left( \prod _{i=1}^{p_m}\beta _{mi}^2 \right)\) and \(E_2\left( \prod _{i=1}^{p_m}\beta _{mi}^2 \right)\), we simulate 2000 samples from \(N(\widetilde{\pmb {\beta }}_m, C_m^{1})\), denoted as \(\pmb {\beta }_{2m}^{(j)}, \ j=1,\dots , 2000\). We compute \(\sum _{j=1}^{2000}(\prod _{i=1}^{p_m}\beta _{2m i}^{2(j)})/2000\) as an approximation to \(E_2\left( \prod _{i=1}^{p_m}\beta _{mi}^2 \right)\). Let \(\pmb {\beta }_{1\,m}^{(j)} =(1+ {\widehat{\tau }} n)^{\frac{1}{2}}(\pmb {\beta }_{2\,m}^{(j)}\widetilde{\pmb {\beta }}_m), \ j=1,\dots , 2000\). Finally, we compute \(\sum _{j=1}^{2000}(\prod _{i=1}^{p_m}\beta _{1m i }^{2(j)})/2000\) as an approximation to \(E_1\left( \prod _{i=1}^{p_m}\beta _{mi}^2 \right)\).
Then, the posterior probability of model \(M_m\) is
Note that the posterior distribution of the vector of regression coefficients is multimodal. BG2 deals with this multimodality without any difficulties. In the screening step, when \(\beta _s\) is a scalar, we compute the posterior probability of \(\beta _s \ne 0\) using Eqs. (5) and (8). In the model selection step, when \(\pmb {\beta }_m\) is a vector of coefficients, we compute the posterior probability of model \(M_m\) using Eqs. (10) and (11).
If the number of candidate covariates k is small (\(k<16\)), we compute the posterior probabilities for all \(2^{k}\) candidate models and select the highest posterior probability model as the best model. If the number of candidate covariates is large, we use a genetic algorithm from the R package GA [30] to search for the highest posterior probability model.
Simulation studies
We have performed simulation studies to compare our nonlocalpriorbased BG2 method versus SMA for binary data and count data. Specifically, we consider single marker analysis with Bonferroni correction with nominal FDR set to 0.05. To assess the performance of our methods, in these simulation studies we use genotype SNP data from humans and from A. Thaliana. These are the same genotype data used in the case studies we present in Sect. 5. We use four criteria to compare the competing methods: true positives (TP), false positives (FP), false discovery rate (FDR) and F1 score. Within each simulation study, for each method we compute the average TP, FP, FDR and F1 over 100 simulated datasets. We use a buffer to define what is a true positive and a false positive. Following [21], if one or more detected SNPs are adjacent (within 5000 base pairs) to a same causal SNP, that is counted as a true positive. In addition, each detected SNP not adjacent to a causal SNP is counted as a false positive.
Binary data
We simulate binary GWAS data using genotype information from the Study of Addiction: Genetics and Environment (SAGE) which is part of the National Human Genome Research Institute’s Gene Environment Association Study Initiative [Database for Genotypes and Phenotypes (dbGaP) study accession phs000092.v1.p1]. Specifically, we use genotype information from 2,772 European Americans in a total of 800,000 SNPs with minor allele frequency (MAF) larger than 0.01.
From these 800,000 SNPs, we selected 20 evenly spaced SNPs to be the causal SNPs. We set the regression coefficients for 5 of these causal SNPs to 0.2, and for 5 other causal SNPs to \(\)0.2. In addition, the regression coefficients for the other 10 causal SNPs have the same value \(\beta\), but that value varies in six settings: 0.2, 0.3, 0.4, 0.5, 0.7 and 1. Further, we set the intercept at \(\beta _0=0.5\). Furthermore, the variance component \(\kappa\) of the kinship random effects \(\pmb {\alpha }\) is set to 0.15. Thus, the binary phenotype data are simulated from a Bernoulli GLMM with logistic link function and linear predictor \(\beta _0 + \sum _{i=1}^{10} \beta x_{ij} + \sum _{i=11}^{15} 0.2x_{ij} + \sum _{i=16}^{20} (0.2)x_{ij} + \alpha _i,\) with \(\pmb {\alpha } \sim N(\pmb {0}, \kappa \Sigma )\) where \(\Sigma\) is the kinship matrix.
Figure 1 shows for binary data the performance of our BG2 method with three different ways to choose the parameter \(\tau\), as well as the performance of the SMA method. These performances in terms of TP, FP, FDR, and F1 averaged over 100 datasets for each setting are plotted as functions of the varying regression coefficient \(\beta\). In addition, Fig. 1 shows the computational time. Additional file 1: Figure S2 show boxplots of TP, FP, FDR, and F1. Our BG2 methods take twice as long as SMA, which is to be expected since SMA has only a screening step whereas BG2 has a screening step and a model selection step. Among the three ways considered to choose \(\tau\) for BG2, estimating \(\tau\) based on a uniform prior provides higher F1 scores for smaller values of \(\beta\), and provides comparable F1 scores for larger values of \(\beta\). In addition, when compared to SMA, BG2 with uniform prior provides larger average number of true positives TP than when \(\beta\) is small, and a smaller TP when \(\beta\) is large. However, BG2 with uniform prior leads to a much smaller average number of false positives than SMA. As a result, when compared to SMA, for all considered values of the regression coefficient \(\beta\), BG2 with uniform prior has much larger F1.
Finally, we have tested the robustness of BG2 to the case of binary GWAS data with no causal SNPs. Specifically, we have simulated 100 datasets with binary GWAS data from a Bernoulli GLMM with logit link function and linear predictor \(\beta _0 + \alpha _i\). While BG2 with any of the ways to choose \(\tau\) does not yield any false positive for 100 simulated datasets, SMA has an average of 0.06 false positives. Therefore, BG2 performs better than SMA for binary GWAS data and is robust to the case when there are no causal SNPs.
Count data
We simulate count GWAS data using genotype information from The Arabidopsis Information Resource (TAIR9) (https://www.arabidopsis.org/). This simulation study is based on a case study on rootlike structures in A. Thaliana that we present in Sect. 5.3.
Specifically, we use 188,980 SNPs with MAF>0.01 from 152 ecotypes of A. Thaliana. This simulation study assumes 10 causal SNPs evenly located among all available SNPs. Of these 10 causal SNPs, 5 causal SNPs have fixed coefficients equal to 0.2, and the other 5 causal SNPs have the same coefficient \(\beta\) which varies in eight settings: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 and 1. In addition, we set the intercept \(\beta _0\) equal to 1. Further, we assume that there are two random effects: a kinship random effect \(\pmb {\alpha }_1\) with variance component \(\kappa _1\) equal to 1; and an overdispersion random effect \(\pmb {\alpha }_2\) with variance component \(\kappa _2\) equal to 0.3, which is close to the estimate obtained in the case study presented in Sect. 5.3. Let \(r_i\) be the number of replicates of ecotype i. Because in the case study most ecotypes have 12 replicates, in this simulation study we assume that all ecotypes have 12 replicates. In addition, the phenotype \(y_i\) for ecotype i is the total number of rootlike structures of the \(r_i\) replicates. These phenotype count data are sampled from a Poisson GLMM with logarithm link function and linear predictor \(\log (r_i) + \beta _0 + \sum _{i=1}^5 \beta x_{ij} +\sum _{i=6}^{10} 0.2 x_{ij} +\alpha _{1i} +\alpha _{2i}\).
Figure 2 shows for count data the performance of our BG2 method as well as the performance of the SMA method. These performances are averaged over 100 simulated datasets for each setting and plotted as functions of the varying regression coefficient \(\beta\). In addition, Fig. 2 shows the computational time. Additional file 1: Figure S3 show boxplots of TP, FP, FDR, and F1. Our BG2 methods take about eight times longer than SMA, but they still provide results in a feasible amount of time. Among the three ways considered to choose \(\tau\) for BG2, estimating \(\tau\) based on an inverse gamma prior provides larger average number of true positives and about the same FDR level. As a result, when compared to the other ways to choose \(\tau\), estimating \(\tau\) based on an inverse gamma prior has higher F1 scores for most considered values of \(\beta\). In addition, when compared to SMA, BG2 with an inverse gamma prior provides larger average number of true positives TP for most considered values of \(\beta\). Further, BG2 with inverse gamma prior has about the same FDR level as SMA for \(\beta \le 0.5\) and a much smaller FDR level for \(\beta > 0.5\). As a result, while BG2 with an inverse gamma prior has comparable F1 to SMA for small values of \(\beta\), the F1 of BG2 with an inverse gamma prior becomes much larger than the F1 of SMA as \(\beta\) increases.
In addition, we have tested the robustness of BG2 to the case of count GWAS data with no causal SNPs. Specifically, we have simulated 100 datasets with count GWAS data from a Poisson GLMM with logarithm link function and linear predictor \(\beta _0 + \alpha _{1i} +\alpha _{2i}\). The average number of false positives for all considered methods is 0. Thus, both SMA and BG2 methods perform well in the case of count GWAS data with no causal SNPs.
Choice of prior for \(\tau\)
Choice of priors is an important part of the implementation of Bayesian methods. To obtain more information about the impact of prior choice on the results of BG2 implementations, we have expanded our simulation studies presented in Sects. 4.1 and 4.2. Specifically, we have performed two additional simulation studies: one that uses count data simulated with human genome and another one that uses binary data simulated with A. Thaliana genome. Unfortunately, the simulated binary datasets simulated with A. Thaliana genome did not contain enough information for SNPs to be detected by SMA or BG2. Additional file 1: Figure S1 presents the results for count data based on human genome data. Similarly to the results from Sects. 4.1 and 4.2, any of the implementations of BG2 perform much better than SMA in terms of FDR and F1. In addition, for count data based on human genome data, the three implementations of BG2 provide similar results. Therefore, from the results of the simulation studies, there is no prior choice that dominates the other prior choices.
Another important consideration is that the performance of the priors will vary for different GWAS datasets. And, of course, BG2 is to be used by scientists who are not Bayesian statisticians – thus, ideally there should be a default prior that would be safe to use with any GWAS dataset. One such class of priors is the class of noninformative priors [31,32,33,34] that impart little or no information in the analysis. The uniform prior is not concentrated around any particular value of \(\tau\) and, in this sense, is noninformative. In addition, in the simulation studies considered here, BG2 with the uniform prior for \(\tau\) performed similarly or better than BG2 with other prior choices for \(\tau\). Therefore, in the implementation of BG2 we recommend the uniform prior as a default choice for \(\tau\).
Case studies
To illustrate the usefulness and flexibility of the nonlocalpriorbased BG2 method, this section presents three case studies on cocaine dependence, alcohol consumption, and number of rootlike structures in A. Thaliana.
Maximum number of alcoholic drinks
The Collaborative Study on the Genetics of Alcoholism (COGA) [35] was a largescale family study that had as primary objective to identify genes related to alcohol dependence. Here, we consider as the response variable the maximum number of alcoholic drinks consumed in 24 h. We analyze data on 2759 European Americans considering 846,076 SNPs with MAF>0.01 and with less than 5% missing. To perform this analysis, we use the Poisson GLMM for count data considered in Sect. 4.2. In our analysis, the 846,076 SNPs are possible regressors. Our Poisson GLMM accounts for genetic structure among 2759 subjects by including a vector of kinship random effects, and allows for extraPoisson variability with a vector of overdispersion random effects.
While SMA detected 10 SNPs, BG2 detected only one SNP. More specifically, the screening step of BG2 identified 10 candidate SNPs which were then given to the BG2 model selection step. The BG2 model selection step then identified one SNP. Likelihood ratio tests indicate that the identified SNPs do not violate the hypothesis of HardyWeinberg equilibrium. While we cannot be sure about which of these SNPs are false positives, the simulation studies in Sect. 4 show that SMA tends to have a much higher FDR than BG2. Thus, in this case study the nonlocalpriorbased BG2 method provides a list of SNPs for further investigation that is much more focused. The SNP detected by BG2 is located in the proteincoding gene PTGER4 on chromosome 5. The protein encoded by PTGER4 is a receptor for prostaglandin E2 (PGE2). An increase in PGE2 is part of the inflammatory response to alcohol consumption, and the use of the PGE2inhibitor tolfenamic acid significantly reduces the severity of several hangover symptoms [36].
Cocaine dependence
In this case study, we analyze the association between cocaine dependence and single nucleotide polymorphisms (SNPs). We analyze data from the Family Study of Cocaine Dependence (FSCD) [37], which was part of the Study of Addiction: Genetics and Environment. Specifically, we analyze data on 2,767 European Americans considering 846,076 SNPs with MAF>0.01 and with less than 5% missing. The response variable is whether or not the subject is addicted to cocaine. To perform this analysis, we use the model for binary data considered in Sect. 4.1. Because males and females seem to have different behaviors with respect to cocaine use, we include sex as a control covariate. All 846,076 SNPs are possible regressors. In addition, to account for the genetic structure among the 2767 subjects, our Bernoulli GLMM has a vector of kinship random effects.
BG2 detects one SNP, which is located in the proteincoding gene ABCC8 on chromosome 11. In this case study, the screening step of BG2 identified 1 candidate SNP which was then selected in the model selection step. For this dataset, SMA only detects the same SNP. A likelihood ratio test indicates that the identified SNP does not violate the hypothesis of HardyWeinberg equilibrium. The protein encoded by this gene is a member of the superfamily of ATPbinding cassette (ABC) proteins which transport various molecules across extracellular and intracellular membranes. In addition, a quantitative transcriptomics analysis (RNASeq) has shown that this gene is overexpressed in the brain [38]. Further, cocaine increases expression of ABCC1 (another gene that encodes an ABC protein) in mice [39]. Finally, ABCC1siRNA (a silencer of ABCC1) blocks cocaineinduced place preference in mice [39].
Rootlike structures in A. Thaliana
To illustrate the application of our method to count data, we analyze data from a study of plant regeneration from root explants of the selfing species A. Thaliana [40]. Specifically, we consider as response variable the number of rootlike structures. We note that [40] applied a square root transformation to analyze this count phenotype variable. In contrast, we use the Poisson GLMM with overdispersion considered in Sect. 4.2 to analyze the original count data. Our model contains a vector of kinship random effects to account for the correlation among individuals and a vector of overdispersion random effects. We focus on the number of rootlike structures after 21 days in which seedlings are under warm white light at 21\(^o\)C following a 14/10 h light/dark regime. There are 188,980 SNPs for 152 ecotypes, with 12 biological replicates per ecotype, from TAIR9 with MAF>0.01.
BG2 detects 3 SNPs. More specifically, the screening step of BG2 screened 5 candidate SNPs and then the BG2 model selection step identified 3 of these SNPs. For this dataset, SMA detects the same 3 SNPs. These 3 SNPs are expressed in the root and are located in proteincoding genes AT1G20090, AT1G20100 and AT1G20720. Specifically, AT1G20100 encodes a DNA ligaselike protein involved in the regulation of metabolic processes. In addition, gene AT1G20720 encodes a RAD3like DNA binding helicase protein that acts in the repair of doublestrand breaks in DNA, and in nucleotideexcision repair. Finally, AT1G20090 encodes a ROP2 protein which is known to effect root hair initiation and tip growth [41].
Discussion
We have proposed BG2, a twostage Bayesian SNP detection method for nonGaussian GWAS data. BG2 uses a GLMM framework that includes kinship random effects and overdispersion random effects. BG2 has two steps: a screening step and a model selection step. The screening step performs a Bayesian SMA that selects a set of candidate SNPs. The model selection step then considers all possible GLMMs based on this set of candidate SNPs. To speed up computations, we develop a pseudo likelihood approach combined with P3D. Further, we develop a novel class of nonlocal priors for the regression coefficients specially tailored for GLMMs. Simulation studies show that, for both binary and count GWAS data, BG2 is much better than SMA in terms of FDR and F1.
The simulation studies show that, when compared to SMA, BG2 has a much lower FDR. Of course, there are some combinations of parameters for which SMA and BG2 provide similar results, and that is what seems to have happened in Sects. 5.2 and 5.3. However, in some applications BG2 provides a much smaller number of false discoveries than SMA, and that is what seems to have happened in the case study presented in Sect. 5.1. Therefore, when compared to SMA, BG2 is more robust and precise.
A relevant question is how sensitive to the choice of prior is the performance of BG2. We considered here implementations of BG2 with three different choices of priors. The simulation studies presented in this paper and in the additional file have shown that BG2 implementations with each of these three different choices of priors have similar performance. As a matter of fact, BG2 with the different choices of priors considered here provide the same results for the three case studies. Therefore, the performance of BG2 seems to be relatively robust to the choice of priors.
While we have chosen to implement BG2 with a pseudolikelihood approach and a genetic algorithm to explore the model space for the analysis of nonGaussian GWAS data, we acknowledge that other approaches may be possible. For example, instead of the pseudolikelihood approach, researchers may consider variational inference approaches [42, 43]. In addition, instead of combining the pseudolikelihood approach and a genetic algorithm, researchers may consider implementing a parallel tempering approach [44] to perform estimation and model selection at the same time. However, we think that such parallel tempering approach may not be computationally feasible for Bayesian ultrahigh dimensional variable selection in GLMMs applied to nonGaussian GWAS.
There are several possible avenues for future research. One promising research direction is to adapt BG2 for application to biobank scale data. Another possible research direction is to implement BG2 with an iterative procedure that would allow smaller effect sizes to be detected. Finally, another possible research avenue is to develop BG2 for GWAS analysis when the phenotype is survival time.
Conclusion
We propose BG2, a novel twostage Bayesian approach for nonGaussian GWAS data. Compared to SMA, BG2 provides a much lower FDR, is more precise and robust. BG2 is implemented in the R package BG2 that is available on BioConductor at https://bioconductor.org/packages/release/bioc/html/BG2.html.
Availability of data and materials
BG2 is implemented in the R package BG2 that is available on BioConductor at https://bioconductor.org/packages/release/bioc/html/BG2.html. In addition, the case study with A. Thaliana can be reproduced using the information and R code available at https://marfatvt.github.io/BG2CaseStudy.html. The A. Thaliana phenotype data and genotype data are available from the following sources: A. Thaliana phenotype data available at https://arapheno.1001genomes.org; A. Thaliana genotype dataset available from R package qtcat.data (https://rdrr.io/github/QTCAT/qtcat.data/). Genotype and phenotype data for alcohol and cocaine use disorder in humans is available from the NIH dbGap website: https://0wwwncbinlmnihgov.brum.beds.ac.uk/gap/, the accession number is phs000092.v1.p1.
Abbreviations
 BG2:

Bayesian GLMMs for GWAS
 LMM:

Linear mixed model
 GLMM:

Generalized linear mixed model
 SMA:

Single marker analysis
 GWAS:

Genomewide association studies
 SNP:

Single nucleotide polymorphism
 TP:

True positive
 FP:

False positive
 FDR:

False discovery rate
 GA:

Genetic algorithm
 MAF:

Minor allele frequency
 P3D:

Population parameters previously determined
 MCMC:

Markov chain Monte Carlo
References
Zhang X, Ding W, Xue D, Li X, Zhou Y, Shen J, Feng J, Guo N, Qiu L, Xing H, et al. Genomewide association studies of plant architecturerelated traits and 100seed weight in soybean landraces. BMC Genomic Data. 2021;22(1):1–14.
Julkowska MM, Koevoets IT, Mol S, Hoefsloot H, Feron R, Tester MA, Keurentjes JJ, Korte A, Haring MA, de Boer GJ, et al. Genetic components of root architecture remodeling in response to salt stress. Plant Cell. 2017;29(12):3198–213.
Zhang H, Ahearn TU, Lecarpentier J, Barnes D, Beesley J, Qi G, Jiang X, O’Mara TA, Zhao N, Bolla MK, et al. Genomewide association study identifies 32 novel breast cancer susceptibility loci from overall and subtypespecific analyses. Nat Genet. 2020;52(6):572–81.
Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, et al. A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nat Genet. 2006;38(2):203–8.
Kang HM, Zaitlen NA, Wade CM, Kirby A, Heckerman D, Daly MJ, Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008;178(3):1709–23.
Kang H.M, Sul J.H, Service S.K, Zaitlen N.A, Kong S.Y, Freimer N.B, Sabatti C, Eskin E. Variance component model to account for sample structure in genomewide association studies. Nat Genet. 2010;42(4):348–54.
Jiang L, Zheng Z, Fang H, Yang J. A generalized linear mixed model association tool for biobankscale data. Nat Genet. 2021;53(11):1616–21.
Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5(2):155–76.
Muller P, Parmigiani G, Rice K. FDR and Bayesian multiple comparisons rules. In: Bernardo, JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M (eds) Bayesian Statistics 8, pp. 349–370. Oxford Univ. Press, Oxford; 2007.
Poland J, Endelman J, Dawson J, Rutkoski J, Wu S, Manes Y, Dreisigacker S, Crossa J, SánchezVilleda H, Sorrells M, et al. Genomic selection in wheat breeding using genotypingbysequencing. Plant Genome. 2012;5(3).
Müller D, Technow F, Melchinger AE. Shrinkage estimation of the genomic relationship matrix can improve genomic estimated breeding values in the training set. Theor Appl Genet. 2015;128(4):693–703.
Baragatti M, Pommeret D. A study of variable selection using gprior distribution with ridge parameter. Comput Stat Data Anal. 2012;56(6):1920–34.
Johnson VE, Rossell D. On the use of nonlocal prior densities in Bayesian hypothesis tests. J R Stat Soc: Ser B. 2010;72(2):143–70.
Johnson VE, Rossell D. Bayesian model selection in highdimensional settings. J Am Stat Assoc. 2012;107(498):649–60.
Altomare D, Consonni G, La Rocca L. Objective Bayesian search of Gaussian directed acyclic graphical models for ordered variables with nonlocal priors. Biometrics. 2013;69(2):478–87.
Rossell D, Telesca D, Johnson VE. Highdimensional Bayesian classifiers using nonlocal priors. In: Giudici P, Ingrassia S, Vichi M, editors. Statistical models for data analysis. London: Springer; 2013. p. 305–13.
Nikooienejad A, Wang W, Johnson VE. Bayesian variable selection for binary outcomes in highdimensional genomic studies using nonlocal priors. Bioinformatics. 2016;32(9):1338–45.
Sanyal N, Ferreira MAR. Bayesian wavelet analysis using nonlocal priors with an application to FMRI analysis. Sankhya B. 2017;79(2):361–88.
Wu HH, Ferreira MAR, Elkhouly M, Ji T. Hyper nonlocal priors for variable selection in generalized linear models. Sankhya A. 2020;82(1):147–85.
Sanyal N, Lo MT, Kauppi K, Djurovic S, Andreassen OA, Johnson VE, Chen CH. GWASinlps: nonlocal prior based iterative SNP selection tool for genomewide association studies. Bioinformatics. 2019;35(1):1–11.
Williams J, Xu S, Ferreira MAR. BGWAS: Bayesian variable selection in linear mixed models with nonlocal priors for genomewide association studies. BMC Bioinform. 2023;24:194.
Rossell D, Telesca D. Nonlocal priors for highdimensional estimation. J Am Stat Assoc. 2017;112(517):254–65.
Wolfinger R, O’Connell M. Generalized linear mixed models: a pseudolikelihood approach. J Stat Comput Simul. 1993;48(3–4):233–43.
Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, Gore MA, Bradbury PJ, Yu J, Arnett DK, Ordovas JM, et al. Mixed linear model approach adapted for genomewide association studies. Nat Genet. 2010;42(4):355–60.
Ten Eyck P, Cavanaugh JE. An alternate approach to pseudolikelihood model selection in the generalized linear mixed modeling framework. Sankhya B. 2018;80(1):98–122.
Xu S, Ferreira MAR, Porter EM, Franck C. Bayesian model selection for generalized linear mixed models. Biometrics. 2023.
Cui S, Guha S, Ferreira MAR, Tegge AN. hmmseq: a hidden Markov model for detecting differentially expressed genes from RNAseq data. Ann Appl Stat. 2015;9(2):901–25.
Xie J, Ji T, Ferreira MAR, Li Y, Patel BN, Rivera RM. Modeling allelespecific expression at the gene and SNP levels simultaneously by a Bayesian logistic mixed regression model. BMC Bioinform. 2019;20(1):1–13.
Williams J, Ferreira MAR, Ji T. BICOSS: Bayesian iterative conditional stochastic search for GWAS. BMC Bioinform. 2022;23:475.
Scrucca L. GA: a package for genetic algorithms in R. J Stat Softw. 2013;53(4):1–37.
Berger JO, De Oliveira V, Sansó B. Objective Bayesian analysis of spatially correlated data. J Am Stat Assoc. 2001;96(456):1361–74.
Fonseca TCO, Ferreira MAR, Migon HS. Objective Bayesian analysis for the studentt regression model. Biometrika. 2008;95(2):325–33.
Ferreira MAR, Salazar E. Bayesian reference analysis for exponential power regression models. J Stat Distrib Appl. 2014;1(1):1–12.
Keefe MJ, Ferreira MAR, Franck CT. Objective Bayesian analysis for Gaussian hierarchical models with intrinsic conditional autoregressive priors. Bayesian Anal. 2019;14:181–209.
Begleiter H, Reich T, Hesselbrock V, Porjesz B, Li TK, Schuckit MA, Edenberg HJ, Rice JP, et al. The collaborative study on the genetics of alcoholism. Alcohol Health Res World. 1995;19:228–228.
van de Loo AJ, Mackus M, Kwon O, Krishnakumar IM, Garssen J, Kraneveld AD, Scholey A, Verster JC. The inflammatory response to alcohol consumption and its role in the pathology of alcohol hangover. J Clin Med. 2020;9(7):2081.
Bierut LJ, Strickland JR, Thompson JR, Afful SE, Cottler LB. Drug use and dependence in cocaine dependent subjects, communitybased individuals, and their siblings. Drug Alcohol Depend. 2008;95(1–2):14–22.
Fagerberg L, Hallström BM, Oksvold P, Kampf C, Djureinovic D, Odeberg J, Habuka M, Tahmasebpoor S, Danielsson A, Edlund K, et al. Analysis of the human tissuespecific expression by genomewide integration of transcriptomics and antibodybased proteomics. Mol Cell Proteomics. 2014;13(2):397–406.
Chen L, Chen H, Xing Y, Li J. ABCC1 regulates cocaineassociated memory, spine plasticity and GluA1 and GluA2 surface expression. NeuroReport. 2021;32(10):833–9.
Lardon R, Wijnker E, Keurentjes J, Geelen D. The genetic framework of shoot regeneration in Arabidopsis comprises master regulators and conditional finetuning factors. Commun Biol. 2020;3(1):1–13.
Jones MA, Shen JJ, Fu Y, Li H, Yang Z, Grierson CS. The Arabidopsis Rop2 GTPase is a positive regulator of both root hair initiation and tip growth. Plant Cell. 2002;14(4):763–76.
Tan LSL, Nott DJ. Variational inference for generalized linear mixed models using partially noncentered parametrizations. Stat Sci. 2013;28:168–88.
Salimans T, Knowles DA. Fixedform variational posterior approximation through stochastic linear regression. Bayesian Anal. 2013;8:837–82.
Syed S, BouchardCôté A, Deligiannidis G, Doucet A. Nonreversible parallel tempering: a scalable highly parallel MCMC scheme. J R Stat Soc Ser B Stat Methodol. 2022;84(2):321–50.
Acknowledgements
Computations for this manuscript have been performed on supercomputers of Advanced Research Computing at Virginia Tech. The authors thank Dr. Allison Tegge for helpful discussions. In addition, the authors thank two anonymous reviewers whose comments and suggestion led to a substantially improved manuscript.Funding support for the Study of Addiction: Genetics and Environment (SAGE) was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01 HG004422). SAGE is one of the genomewide association studies funded as part of the Gene Environment Association Studies (GENEVA) under GEI. Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Support for collection of datasets and samples was provided by the Collaborative Study on the Genetics of Alcoholism (COGA; U10 AA008401), the Collaborative Genetic Study of Nicotine Dependence (COGEND; P01 CA089392), and the Family Study of Cocaine Dependence (FSCD; R01 DA013423). Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NIH GEI (U01HG004438), the National Institute on Alcohol Abuse and Alcoholism, the National Institute on Drug Abuse, and the NIH contract “High throughput genotyping for studying the genetic contributions to human disease” (HHSN268200782096C). The datasets used for the analyses described in this manuscript were obtained from dbGaP at http://0wwwncbinlmnihgov.brum.beds.ac.uk/projects/gap/cgibin/study.cgi?study_id=phs000092.v1.p1 through dbGaP accession number phs000092.v1.p.
Funding
This work was supported in part by National Science Foundation awards DMS 1853549 and DMS 2054173.
Author information
Authors and Affiliations
Contributions
SX, JW, and MARF conceived the study. SX, JW, and MARF developed the methodology and simulation experiments. SX implemented the simulation experiments. SX implemented the methodology and analyzed the results supervised by MARF. MARF acquired the funding. SX and MARF wrote the manuscript. SX, JW, and MARF reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Supplementary Material for BG2: Bayesian variable selection in generalized linear mixed models with nonlocal priors for nonGaussian GWAS data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Xu, S., Williams, J. & Ferreira, M.A.R. BG2: Bayesian variable selection in generalized linear mixed models with nonlocal priors for nonGaussian GWAS data. BMC Bioinformatics 24, 343 (2023). https://0doiorg.brum.beds.ac.uk/10.1186/s1285902305468w
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s1285902305468w
Keywords
 Bayesian statistics
 GLMM
 GWAS
 Nonlocal prior
 Variable selection