Skip to main content

Spatial rank-based multifactor dimensionality reduction to detect gene–gene interactions for multivariate phenotypes

Abstract

Background

Identifying interaction effects between genes is one of the main tasks of genome-wide association studies aiming to shed light on the biological mechanisms underlying complex diseases. Multifactor dimensionality reduction (MDR) is a popular approach for detecting gene–gene interactions that has been extended in various forms to handle binary and continuous phenotypes. However, only few multivariate MDR methods are available for multiple related phenotypes. Current approaches use Hotelling’s T2 statistic to evaluate interaction models, but it is well known that Hotelling’s T2 statistic is highly sensitive to heavily skewed distributions and outliers.

Results

We propose a robust approach based on nonparametric statistics such as spatial signs and ranks. The new multivariate rank-based MDR (MR-MDR) is mainly suitable for analyzing multiple continuous phenotypes and is less sensitive to skewed distributions and outliers. MR-MDR utilizes fuzzy k-means clustering and classifies multi-locus genotypes into two groups. Then, MR-MDR calculates a spatial rank-sum statistic as an evaluation measure and selects the best interaction model with the largest statistic. Our novel idea lies in adopting nonparametric statistics as an evaluation measure for robust inference. We adopt tenfold cross-validation to avoid overfitting. Intensive simulation studies were conducted to compare the performance of MR-MDR with current methods. Application of MR-MDR to a real dataset from a Korean genome-wide association study demonstrated that it successfully identified genetic interactions associated with four phenotypes related to kidney function. The R code for conducting MR-MDR is available at https://github.com/statpark/MR-MDR.

Conclusions

Intensive simulation studies comparing MR-MDR with several current methods showed that the performance of MR-MDR was outstanding for skewed distributions. Additionally, for symmetric distributions, MR-MDR showed comparable power. Therefore, we conclude that MR-MDR is a useful multivariate non-parametric approach that can be used regardless of the phenotype distribution, the correlations between phenotypes, and sample size.

Peer Review reports

Background

Many attempts have been made to identify susceptible genes that influence the risk of complex diseases such as autism, hypertension, and diabetes [1,2,3]. Analyzing a single locus is not enough to understand the pathophysiology of complex diseases and results in the so-called missing heritability problem. To overcome this problem, several studies have sought to identify gene–gene interactions (GGIs) or gene-environmental interactions [4,5,6].

As a non-parametric model-free approach, multifactor dimensionality reduction (MDR) has been widely applied for detecting GGIs [5]. For binary phenotypes, such as those analyzed in case–control studies, MDR divides high-dimensional genotype combinations into a one-dimensional variable with two groups (high-risk and low-risk), according to whether the ratio of cases to controls exceeds a threshold. Then it finds the interaction model that best predicts the disease status. Balanced accuracy can be used for an evaluation measure [7]. To prevent overfitting, k-fold cross-validation (CV) can be applied. Cross-validation consistency (CVC), or the number of times each single-nucleotide polymorphism (SNP) combination is chosen as best, is obtained during the k-fold CV process. The SNP combination with the highest CVC is defined as the final best interaction model [8]. MDR has several advantages: i) the dimensions of the data are effectively reduced, ii) no specific genetic model is assumed, and iii) high-order interactions can be identified, even if there are no significant main effects [9, 10].

Since its introduction, many studies have been conducted to broaden the scope of application of MDR. According to Gola et al. [4], about 800 MDR-related studies were found as of February 2014 when searching PubMed and Google Scholar. For discrete traits, log-linear models MDR and robust MDR have been proposed to handle data with empty or sparse cells [11, 12]. Odds ratio MDR was proposed, replacing the naïve classifier with the odds ratio [9] and optimal MDR replaced the fixed threshold with a data-driven threshold using an ordered combinatorial partitioning algorithm [13]. As a method of dealing with continuous traits, generalized MDR (GMDR) was proposed. GMDR can handle both dichotomous and continuous phenotypes and can adjust for covariates [14]. Quantitative MDR (QMDR) for continuous traits uses the sample mean of each genotype combination as a classifier, reducing the computing time with comparable performance [15]. Recently, cluster-based MDR (CL-MDR) has been proposed as a method that is less sensitive to outliers and distributional assumptions [16]. For survival time with censored data, Surv-MDR was proposed, which uses the log-rank test statistic to classify the cells of a multifactor combination [17]. Cox-MDR and accelerated failure time MDR are extended versions of GMDR for the survival phenotype based on Cox regression and the accelerated failure time model, respectively [18, 19]. Recently, Kaplan–Meier MDR was also developed, which uses the median Kaplan–Meier survival time as a classifier [20].

As described above, many studies have been conducted to identify genetic interactions associated with single phenotype, but only a few studies have been done on methodologies to treat multiple phenotypes. For complex diseases, several correlated traits are often measured together. For example, weight, the waist-hip ratio, and the body mass index (BMI) can be jointly measured as obesity-related traits. The power to detect associations between genes and these traits is expected to increase if the multivariate approach is adopted [21]. Therefore, more research on multivariate methods detecting GGIs is needed. Recently, multivariate generalized MDR (GEE-GMDR) extended GMDR to the multivariate case by constructing generalized estimation equation models [22]. GEE-GMDR provides stable results, but it does not always show higher power than GMDR [23]. Multivariate QMDR (multi-QMDR) extended QMDR using principal component analysis scores and Hotelling’s T2 statistic as a classifier and an evaluation measure instead [23]. Multi-QMDR gives a high CVC and stable results, but it is prone to outliers or influencing points. More recently, multivariate cluster-based MDR (multi-CMDR) has been proposed [24]. Multi-CMDR applies fuzzy k-means clustering to discriminate between high- and low-risk groups and uses Hotelling’s T2 statistic for evaluation. Multi-CMDR is less sensitive to outliers and non-symmetric distributions.

While MDR is a nonparametric approach, all these methods use parametric test statistics as evaluation measures based on a multivariate normal distribution or exponential family distribution. Instead of using parametric approach, this study considers non-parametric evaluation measures for testing the equality of two multivariate populations. Various methods based on multivariate ranks or distances between the pairs of individual observations have been studied [25]. Note that signs and ranks in a univariate case are based on the ordering of the data. Unfortunately, however, there is no natural ordering of the data for a multivariate case. To extend the concept of rank to a multivariate case, several principles have been considered. First, the methods using interdirection were introduced [26, 27]. Interdirection is a measure based on the angular distance between two observation vectors relative to the rest of the data [28]. Second, the tests based on data depth were proposed [29,30,31]. A data depth measures how deep a multivariate sample lies in the data cloud [32]. Any function which provides a reasonable central-outward ordering of points in multidimensional space can be considered as a depth function [33]. Third, multivariate extensions using spatial signs and ranks were also studied [34, 35]. Affine invariant methods based on spatial sign and rank vectors for various multivariate problems were proposed [34]. More recently, for high-dimensional data, a nonparametric multivariate test using spatial signs [36] and a spatial ranks test for two samples were proposed [37]. Among various approaches and measures, we chose the spatial rank-sum statistic as the non-parametric evaluation measure in this study, because it is one of the most widely statistics and implemented with R program.

We propose a new non-parametric multivariate approach for identifying genetic interactions. We call the proposed method multivariate rank-based MDR (MR-MDR). During classification process, MR-MDR utilizes the fuzzy k-means clustering analysis with a noise cluster as in multi-CMDR. For the evaluation process, MR-MDR calculates the spatial rank-sum statistic as an evaluation measure and selects the best interaction model with the largest statistic. The tenfold CV method is adopted and the final best interaction model is determined by maximum CV consistency.

This manuscript is organized as follows. We first start with an introduction of the spatial rank statistic. The algorithm of the proposed MR-MDR method is then described in detail. We then present the results of intensive simulation studies to investigate the performance of the proposed method. Our method is compared with multi-QMDR and multi-CMDR. We applied the proposed MR-MDR method to data on four multivariate phenotypes related to kidney function obtained from the Korean Genome and Epidemiology Study (KoGES): blood urea nitrogen (BUN), serum creatinine, urinary albumin, and urinary red blood cell (RBC) levels. MR-MDR successfully identified genetic interactions associated with these four phenotypes.

Methods

Nonparametric multivariate rank test

We first introduce the multivariate non-parametric test used for evaluation. To detect a two-sample location shift in univariate analysis, the two-sample t-test is popularly used when the response variable is normally distributed. The Mann–Whitney test based on the rank sum is well known as a nonparametric counterpart of the two-sample univariate t-test. Various robust univariate non-parametric tests have been developed for the two-sample location problem [38].

For multivariate analysis, a classical approach such as Hotelling’s T2 is a popular parametric approach. T2 has optimal power under the assumption of a multivariate normal distribution. However, it performs poorly in the case of heavy-tailed distributions and is highly sensitive to outliers [39]. As an alternative, we consider a nonparametric approach based on spatial signs and ranks. We start with the definition of spatial sign and spatial rank.

Let \({\mathbf{Y}} = ({\mathbf{y}}_{{\mathbf{1}}} ,...,{\mathbf{y}}_{{\mathbf{n}}} )^{\prime }\) be an n × p data matrix with n observations and p variables. The spatial sign function and spatial rank function are defined as follows [28].

$$\begin{aligned} {\mathbf{S}}({\mathbf{y}}) & = \left\{ {\begin{array}{*{20}c} {||{\mathbf{y}}||^{ - 1} {\mathbf{y}},} & {{\mathbf{y}} \ne {\mathbf{0}}} \\ {{\mathbf{0}},} & {{\mathbf{y}} = {\mathbf{0}}} \\ \end{array} } \right., \\ {\mathbf{R}}({\mathbf{y}}) & = ave_{j} \{ {\mathbf{S}}{(}{\mathbf{y}} - {\mathbf{y}}_{{\text{j}}} {)}\} = \frac{1}{n}\sum\limits_{j = 1}^{n} {\{ {\mathbf{S}}{(}{\mathbf{y}} - {\mathbf{y}}_{{\text{j}}} } )\} \\ \end{aligned}$$

where avej means the average taken over all observations for j = 1,…n, ||y|| is the Euclidean distance of y from 0, and y and yj are p-variate vectors [28]. The observed spatial signs si and observed centered spatial ranks ri are defined as

$${\mathbf{s}}_{i} = {\mathbf{S}}({\mathbf{y}}_{i} ),$$

and

$${\mathbf{r}}_{i} = ave_{j} \{ {\mathbf{s}}_{ij} \} = \frac{1}{n}\sum\limits_{j = 1}^{n} {\{ {\mathbf{S}}({\mathbf{y}}_{i} - {\mathbf{y}}_{j} )\} } ,$$

respectively for i, j = 1, …, n. Here, \({\mathbf{s}}_{ij} = {\mathbf{S}}({\mathbf{y}}_{i} - {\mathbf{y}}_{j} )\), ave{ri} = 0.

To make an affine-invariant test statistic, we can apply the spatial sign function to the transformed data points. The test statistic \(T({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{n} )\) is said to be affine-invariant if \(T({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{n} ) = T(D{\mathbf{y}}_{1} ,...,D{\mathbf{y}}_{n} )\) for every p × p nonsingular matrix D and for every p-variate dataset \({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{n}\) [34]. Affine-invariant spatial signs and ranks are obtained by transforming yi to Ayyi,

$$\begin{aligned} {\mathbf{s}}_{i}^{*} & = {\mathbf{S}}(A_{y} {\mathbf{y}}_{i} ), \\ {\mathbf{r}}_{i}^{*} & = ave_{j} \{ {\mathbf{s}}_{ij}^{*} \} = \frac{1}{n}\sum\limits_{j = 1}^{n} {\{ {\mathbf{S}}(A_{y} ({\mathbf{y}}_{i} - {\mathbf{y}}_{j} )\} } \\ \end{aligned}$$

where Ay is Tyler’s transformation, which makes the spatial sign covariance matrix proportional to the identity matrix, that is, \(ave\{ {\mathbf{r}}_{i}^{{*{\prime }}} {\mathbf{r}}_{i}^{*} \} = [c_{y}^{2} /p]I_{p}\), where \(c_{y}^{2} = ave\{ ||{\mathbf{r}}_{i}^{*} ||^{2} \}\). Ay can be obtained during the iterative process and chosen so that \(trace(A_{y}^{\prime } A_{y} ) = p\) [34, 40, 41]. The ranks \({\mathbf{r}}_{i}^{*}\) lie in the unit p-sphere. The direction of \({\mathbf{r}}_{i}^{*}\) roughly points outward from the center of the data cloud and its length tells how far away this point is from the center of the data cloud [42].

Next, for the two-samples location problem, let \({\mathbf{Y}}_{1} = ({\mathbf{y}}_{1} ,...,{\mathbf{y}}_{{{\mathbf{n}}_{1} }} )^{\prime }\) and \({\mathbf{Y}}_{2} = ({\mathbf{y}}_{{{\mathbf{n}}_{1} + 1}} ,...,{\mathbf{y}}_{{{\mathbf{n}}_{1} + {\mathbf{n}}_{2} }} )^{\prime }\) be two independent samples with p variables that have the cumulative distribution functions \(F({\mathbf{x}} - {{\varvec{\upmu}}})\) and \(F({\mathbf{x}} - {{\varvec{\upmu}}} - {{\varvec{\Delta}}})\), respectively. To test the null hypothesis of no differences in location, H0:Δ:0, the multivariate version of Mann–Whitney test statistic U2 can be used. For the combined sample Y = [Y1:Y2], the affine-transformed spatial signs of the transformed differences \({\mathbf{s}}_{ij}^{*} = {\mathbf{S}}(A_{y} ({\mathbf{y}}_{i} - {\mathbf{y}}_{j} ))\) and spatial ranks \({\mathbf{r}}_{i}^{*} = ave_{j} \{ {\mathbf{S}}_{ij}^{*} \}\) are obtained. The multivariate Mann–Whitney test statistic U2 is given by

$$U^{2} = \frac{p}{{c_{y}^{2} }}\sum\limits_{i = 1}^{2} {n_{i} ||}^{{}} {\overline{\mathbf{r}}}_{i}^{*} ||^{2}$$

where \({\overline{\mathbf{r}}}_{i}^{*}\) is the sample-wise mean vector of the spatial ranks \({\mathbf{r}}_{i}^{*}\) and \(c_{y}^{2} = ave\{ ||{\mathbf{r}}_{i}^{*} ||^{2} \}\) [34]. The p-value is obtained by \(E_{\eta } (I(U_{\gamma }^{2} \ge U^{2} ))\), where \(E_{\eta } ( \cdot )\) represents expectation value where \(\eta = (\eta_{1} ,...,\eta_{N} )\) is uniformly distributed over N! permutations of \(\left( {1, \ldots ,N} \right)\) and \(U_{\gamma }^{2}\) is the value of the test statistic of a permuted sample. As the dimension p increases and the distribution becomes heavier-tailed, the performance of U2 improves relative to Hotelling’s T2 statistic [34].

We investigated the effect of Tyler’s transformation on yi through an empirical study using a toy example. Two multivariate distributions of the response variables were considered: a bivariate normal distribution and a bivariate gamma distribution. The correlations between two response variables were set to 0.4 and 0.8. The original data were transformed to spatial signs, and then spatial centered ranks were obtained by averaging the spatial signs of differences. Figure 1 shows the spatial signs and ranks with and without Tyler’s transformation. Spatial signs are represented as directions from the origin with unit length, and thus all the spatial signs lie on a circle of radius 1. The spatial signs with Tyler's transformation tend to be more evenly arranged around the circumference than the spatial signs without Tyler's transformation. The spatial ranks tend to spread evenly, as if they are uniformly distributed within a circle, for the Tyler’s transformation case. Note that the spatial ranks before and after Tyler’s transformation appear different when the correlation is high (r = 0.8); however, the change due to the transformation is negligible if the correlation is moderate (r = 0.4).

Fig. 1
figure1

Examples of spatial signs and ranks for two bivariate distributions

MR-MDR procedure

There are many variations of MDR methods for finding GGIs. However, most MDR approaches have two core procedures: how to classify the cells into two groups and how to evaluate the interaction models. The proposed MR-MDR adopts fuzzy clustering technique in the classification process as in multi-CMDR [24]. For the evaluation process, MR-MDR uses a multivariate spatial rank test statistic. Figure 2 shows the flow of the MR-MDR procedure. The detailed algorithm of MR-MDR is as follows.

Fig. 2
figure2

Overview of the MR-MDR algorithm for tenfold cross-validation and second-order interactions

Step 0. Data

  • Suppose there are n* samples, with p SNP data-points and q continuous phenotypes.

  • Standardize all the phenotypes to have a mean of zero and a unit variance. Let \({\mathbf{Y}}_{i} = (y_{i1} ,y_{i2} , \ldots ,y_{iq} )^{T}\) be the standardized phenotype vector and \({\mathbf{X}}_{i} = (x_{i1} ,x_{i2} , \ldots ,x_{ip} )^{T}\) be the genotype vector for the ith subject, respectively.

Step 1. Global process: clustering

  • Perform fuzzy k-means clustering with k = 2 using phenotype information. We add a noise cluster such that noisy data points may be assigned to the noise class [43]. Remove all the samples in the noise cluster. The remaining samples have membership degrees for each of the two clusters. Denote these two clusters as C1 and C2.

  • Define the global ratio θ as

    $$\theta = {{\sum\limits_{i = 1}^{n} {D_{i1} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{n} {D_{i1} } } {\sum\limits_{i = 1}^{n} {D_{i2} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{i = 1}^{n} {D_{i2} } }},$$

    where n is the number of remaining samples after deleting noise samples and Dik is the membership degree of the ith subject in cluster Ck, (k = 1, 2) [24].

  • For cross-validation (CV), split the samples randomly into 10 subgroups of equal size. Let nine sets of samples be the training dataset and the remaining dataset be the test dataset used for evaluating the model.

Step 2. Local process: classification

  • To find mth-order GGIs, select a set of m SNPs from a pool of SNPs.

  • Calculate the local ratio θj for the jth genotype combination in the training set,

    $$\theta_{j} = {{\sum\limits_{i = 1}^{{n_{j} }} {D_{ij1} } } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{{n_{j} }} {D_{ij1} } } {\sum\limits_{i = 1}^{{n_{j} }} {D_{ij2} } }}} \right. \kern-\nulldelimiterspace} {\sum\limits_{i = 1}^{{n_{j} }} {D_{ij2} } }},\quad \left( {j = 1, \ldots ,3^{m} } \right),$$

    where Dijk is the membership degree of the ith subject with the jth genotype combination in cluster Ck.

  • Label each genotype combination either “H” if θj ≥ θ, or “L” if θj < θ.

Step 3. Local process: evaluation

  • Obtain spatial ranks of Yi for the combined samples from two clusters for i = 1, 2, …, n.

    $$\begin{aligned} r_{i}^{*} & = ave_{j} \{ {\mathbf{S}}(A_{y} ({\mathbf{y}}_{i} - {\mathbf{y}}_{j} )\} \\ & = \frac{1}{n}\sum\limits_{j = 1}^{n} {\frac{{A_{y} ({\mathbf{y}}_{i} - {\mathbf{y}}_{j} )}}{{\sqrt {(A_{y} ({\mathbf{y}}_{i} - {\mathbf{y}}_{j} ))^{2} } }}} \\ \end{aligned}$$

    where S(·)is a sign function and Ay is Tyler’s transformation.

  • Calculate the multivariate Mann–Whitney test statistic:

    $$U^{2} = \frac{p}{{c_{y}^{2} }}\sum\limits_{k = 1}^{2} {n_{k} ||}^{{_{{}} }} {\overline{\mathbf{R}}}_{k} ||^{2}$$

    where p is the number of variables, nk is the number of samples in cluster Ck, \({\overline{\mathbf{R}}}_{k}\) is the mean vectors of the spatial ranks for cluster Ck, and \(c_{y}^{2} = ave\{ ||{\mathbf{R}}_{k} ||^{2} \}\).

  • Obtain U2 for every m SNP combination. Choose the SNP combination with the largest U2 statistic as the best mth-order interaction model in the training data.

Step 4. Selection process: best interaction model

  • Repeat step 2 and 3 for each fold and count the number of specific SNP combinations chosen for the best model. We call this cross-validation consistency (CVC).

  • Select the model with the largest CVC as best final interaction model.

  • Derive the final rank sum statistic for the best model by averaging all statistics for the test data.

  • Calculate the empirical p-value by a permutation test.

Results

Simulation analysis

To compare the performance of the proposed MR-MDR with other existing methods, we conducted simulations for various situations. We considered 20 SNPs, including two-way disease-causal SNPs. For each of the combinations of seven different heritability values (0.01, 0.025, 0.05, 0.1, 0.2, 0.3, 0.4) and two minor allele frequency (MAF) values (0.2, 0.4), five different interaction models (M1-M5) were considered. Typically, penetrance rate is defined as the proportion of individuals with a given genotype who exhibit the phenotype associated with that genotype. However, it is not appropriate for continuous phenotypes and there is no clear definition of continuous phenotype. For QMDR, the penetrance for continuous phenotypes was defined as a function of mean [15]. Similarly, we set the penetrance rate as a function of mean of the response variable in each genotype. Thus, a total of 70 epistatic models with various penetrance functions were generated, as done by Velez et al. [7]. All the models had little marginal effect. For the phenotype distribution, we considered a bivariate normal distribution and a bivariate gamma distribution. The correlations of the bivariate phenotypes also varied (0, 0.25, 0.5). Sample sizes of 100, 200, and 400 were considered. Finally, a total of 1000 replicated data sets were generated for 1260 (= 70 × 2 × 3 × 3) combinations.

We used both a Tyler’s-transformed version (MR-MDR_transformed) and an untransformed version (MRMDR_ untransformed). For the purpose of comparison, multi-CMDR and multi-QMDR methods were also used. All three summary statistics of the multi-QMDR—the first principal component (FPC), weighted sum of principal components (WPC), and weighted squared sum of principal components (WSPC)—were included. To compare the data with the univariate approach, QMDR was also performed for each phenotype separately. Ten-fold CV was considered. A summary of the simulation scheme is shown in Table 1.

Table 1 Summary of the simulation scheme

Since the epistatic models given by Velez et al. [7] were devised only for the discrete phenotype, we modified them to handle continuous phenotypes. Let \({\mathrm{SNP}}_{1}\) and \({\mathrm{SNP}}_{2}\) be the two true causal SNPs, \({\mathbf{Y}} = (Y_{1} ,Y_{2} )^{T}\) the continuous bivariate phenotypes, and fij the penetrance function for the ith genotype of \({\mathrm{SNP}}_{1}\) and jth genotype of \({\mathrm{SNP}}_{2}\) in [7]. For the bivariate normal distribution, \({\mathbf{Y}} = (Y_{1} ,Y_{2} )^{T}\) was generated by

$${\mathbf{Y}}|({\text{SNP}}_{{1}} = i,{\text{ SNP}}_{2} = j)\sim MN({{\varvec{\upmu}}}_{ij} ,{{\varvec{\Sigma}}}),$$

where \({{\varvec{\upmu}}}_{ij} = \left( {\begin{array}{*{20}c} {f_{ij} } \\ {f_{ij} } \\ \end{array} } \right)\) and \({{\varvec{\Sigma}}} = \left( {\begin{array}{*{20}c} 1 & \rho \\ \rho & 1 \\ \end{array} } \right)\). We used the mvrnorm() function of the MASS package in R. Three different values of ρ (0, 0.025, and 0.5) were considered, as mentioned above.

For the skewed asymmetrically distributed phenotypes, we used the copula-based multivariate gamma model [44]. A copula-based bivariate gamma distribution is given by

$$f(y_{1} ,y_{2} ) = c(u,{{\varvec{\Sigma}}})\prod\limits_{k = 1}^{2} {f_{k} (y_{k} )} ,$$

where \(c(u,{{\varvec{\Sigma}}}) = |{{\varvec{\Sigma}}}|^{\frac{1}{2}} \exp \left[ { - \frac{{{\tilde{\mathbf{u}}}^{{\prime }} ({{\varvec{\Sigma}}}^{ - 1} - {\rm I}){\tilde{\mathbf{u}}}}}{2}} \right],\tilde{u} = (\Phi^{ - 1} (u_{1} ), \, \Phi^{ - 1} (u_{2} ))^{{\prime }}\), and \(u_{k} = {\mathbf{F}}_{k} (y_{k} )\) for k = 1, 2. Here fk and Fk are the marginal probability density function and cumulative distribution function of the kth phenotype, respectively, and Φ−1 is the inverse of the cumulative distribution of the standard normal distribution function. In this simulation, we set

$$y_{k} |({\text{SNP}}_{1} = i,{\text{ SNP}}_{2} = j)\sim Gamma(2f_{ij} ,0.5)$$

for k = 1, 2. We used the mvdc(), rMvdc(), normalCopula() functions in the copula package in R.

The power was estimated by the hit ratio, which is the proportion of times that each method correctly chose the causal SNP pairs (\({\mathrm{SNP}}_{1}\) and \({\mathrm{SNP}}_{2}\)) as the best model among all possible two-way interaction models out of each set of 1000 datasets. Figure 3 shows the hit ratios of the eight different methods for the bivariate normal distribution. The power of the multi-QMDR using FPC (multi-QMDR_FPC) was slightly higher than that of the proposed MR-MDR when there was no correlation between phenotypes. However, the difference between multi-QMDR_FPC and MR-MDR decreased as the correlation became stronger. The performances of the transformed one (MR-MDR_transformed) and untransformed (MR-MDR_untransformed) one were almost the same. Multi-QMDR with WSPC (multi-QMDR_WSPC) showed lower power even than the QMDR method.

Fig. 3
figure3

Hit ratios for a bivariate normal distribution

Figure 4 shows the hit ratios for a bivariate gamma distribution. The proposed MR-MDR outperformed the other methods for all ranges of heritability. There was little difference between the performance of the two versions of MR-MDR, and the differences between them were less than 0.01 in all situations. The power of multi-CMDR was the next highest. It should be noted that multi-CMDR uses the fuzzy clustering approach to split data as in MR-MDR. The gap between MR-MDR and other methods became larger as the sample size decreased or the correlation became stronger. Multi-QMDR-FPC and multi-QMDR using the WPC (multi-QMDR-WPC) showed lower power than MR-MDR and multi-CMDR, but higher power than QMDR. The performance of multi-QMDR-WSPC was still poor, although the difference was less than in bivariate normal distribution.

Fig. 4
figure4

Hit ratios for a bivariate gamma distribution

Through these simulation studies, we demonstrated that the proposed MR-MDR outperformed the other existing methods for all ranges of heritability when the phenotypes were asymmetrically distributed. However, when the phenotypes are symmetrically distributed, as in a normal distribution, all methods yielded similar performance.

Three additional simulations were conducted to find out the further properties of the proposed method. The robustness of fuzzy k-means clustering, the effect of noise cluster size, and the effect of outliers were investigated. First, to explore the robustness of the fuzzy k-means clustering in MR-MDR algorithm, we performed a comparison study to investigate the effect of log-transformation on phenotypes which were generated from the bivariate gamma distribution. The power of MR-MDR after log transformation was obtained for each seven heritabilities. We set the correlation coefficient between two phenotypes 0.25 and the sample size 200. The average power of MR-MDR for five interaction model (M1-M5) for 1000 random samples are given in Table 2. The power slightly reduced after log-transformation. Therefore, we can conclude that the fuzzy k-means clustering is robust to the skewed distribution and does not require any further transformation of original data.

Table 2 Hit ratios of MR-MDR according to log transformation for a bivariate gamma distribution (r = 0.25, n = 200)

Secondly, we investigate the efficiency according to the size of noise cluster since the large size of the noise cluster can be a source of loss of efficiency. The size of noise cluster depends on the noise distance δ, which needs to be chosen in advance. If δ is too large, outliers are not removed and are classified as a real cluster. On the other hand, if δ is too small, many observations can be misplaced into the noise cluster. Still, the estimation of the optimal value of δ is an open-problem [45]. In our approach, we used an iterative procedure to determine the value of δ optimally, which is the default of FKM.noise procedure in R.

To check the efficiency, we compared the performance with various values of δ. The results are shown in Table 3. For both bivariate normal and bivariate gamma distribution, the average power of MR-MDR for five interaction models (M1-M5) are given in Table 3. This new simulation result shows that the power using the iterative δ yielded the highest hit ratios for all heritabilities. Note that in this setting outliers were not generated. When there were outliers, a smaller noise cluster would have been created.

Table 3 Hit ratios of MR-MDR according to the noise distance δ (r = 0.25, n = 200)

Thirdly, we have conducted an additional simulation to investigate the effects of outliers. The power of MR-MDR for the datasets with or without outliers was obtained for each seven heritabilities. We set the correlation coefficient between two phenotypes 0.25 and the sample size 200 as in Table 3. The power for five interaction models (M1-M5) for 1000 random samples were obtained. For the datasets with outliers, about 5% of the data were replaced by outliers. For both phenotypes, outliers were generated by adding three times of IQR for the randomly chosen 5% samples. The results are summarized in Table 4. In the presence of outliers, the power tends to decrease for all methods. The differences were the smallest in MR-MDR, indicating the minimum power loss of MR-MDR. As a result, MR-MDR showed much higher power than other methods in the presence of outlying observations.

Table 4 Hit ratios of MR-MDR according to the presence or absence of outliers (r = 0.25, n = 200)

Real data analysis

We illustrate the proposed MR-MDR method by analyzing data from the KoGES [46]. The data were collected from two recruitment areas. Each region is a cohort representing city (Ansan) and rural areas (Anseong). After standard quality control procedures for the subjects and SNPs, a total of 8842 participants and 327,872 SNPs were used for this analysis.

We considered four phenotypes related to kidney function: BUN, serum creatinine, urinary albumin levels, and urinary RBC levels. Traditionally, BUN and serum creatinine levels have been used as surrogate markers of kidney function deterioration [47]. The amounts of albumin and RBC in urine also could be good indicators of kidney disease. Although there have been some studies on associations between genes and kidney-related phenotypes, few studies have performed GGI analyses for these phenotypes [48, 49]. In the case of albumin, the urine test is known to be more accurate than in the case of blood test, so the urine test result is used here. However, urine tests are conducted only on a relatively small number of subjects, which resulted in missing observations in phenotypes. For this high rate of missing data, imputation for phenotypes is not appropriate. We removed observations with at least one missing phenotype value, and 3267 samples remained.

A linear model was separately fit to each phenotype with covariate adjustments for sex, age and recruitment area. Finally, 205 SNPs were selected that had significant marginal effects (p < 1 × 10−7). Residuals were used for the analysis to control for covariate effects. The largest correlation coefficient was 0.32, which was the correlation between BUN and creatinine. The distributions of the phenotypes were heavily skewed. Figure 5 shows scatter plots and box plots of the phenotypes.

Fig. 5
figure5

Scatterplot and boxplot of four phenotypes after adjustment by sex, age and recruitment area. The numbers in the scatter plot are correlation coefficients

We applied the proposed MR-MDR method to identify the best first- and second-order interaction models. Table 5 shows the center of the global clusters from fuzzy k-means clustering during step 1. Cluster centers were determined by the means of each phenotype across samples belonging to the cluster. The clusters differed greatly for BUN and serum creatinine. There seemed to be no difference in RBC levels between the two clusters. Since the higher values of BUN and creatinine indicate abnormal kidney function, we can interpret the first cluster as a low-risk group and the second cluster as a high-risk group.

Table 5 Average of phenotypes for global clusters

We selected the SNP pair with the largest CVC as the best interaction model for each order. The test score was the average of spatial rank sum statistics calculated from the test data set. Empirical p-values were obtained by comparing the test score with the statistic from the permuted dataset generated by shuffle phenotype vectors. If the score calculated with the permuted data exceeded our score, that case was counted. Then p-value was calculated as a/b, where a is the number of cases with a permuted score higher than the original score and b is the total number of trials.

List of the interaction models that had the highest training score at least once during the tenfold CV process by MR-MDR is shown in Table 6. For the first-order interaction, rs41476549 had the highest CVC and was selected as the most relevant to the four phenotypes. rs790410 was selected as the best once during 10 CV processes but showed the highest score. All SNPs except rs17168600 had a p-value lower than 0.05. For the second-order interaction model, the pair of rs41476549 and rs1117105 showed the highest CVC, while the pair of rs790410 and rs961413 yielded the highest test score. However, the CVC value was low in most cases. Among the selected SNPs, rs16862782 on chromosome 3 has been reported to be related to BUN in Korean and to serum urea measurement in European [47, 50]. rs4686914 on chromosome 3 is also known to be related to the kidney function in European and East Asian [47, 51]. Both SNPs are mapped to LINC01991 gene. rs17586294 maps to TUBBP11 gene. To the best of our knowledge, there are no studies that have analyzed kidney-related GGI in a multivariate manner. Therefore, none of the interactions of the selected pair of SNPs have ever been reported.

Table 6 Best interaction models identified by MR-MDR, multi-CMDR and multi-QMDR

We also applied multi-CMDR and multi-QMDR methods for comparison. Only multi-QMDR using FPC was considered for comparison, because it showed the highest power among three types of multi-QMDRs. There are some similarities between the results of MR-MDR and multi-CMDR. However, the results are totally different for multi-QMDR, which are expected to be caused by some outlying observations. For example, the pair of rs17014894 and rs10517358 was chosen as the best interaction model. However, this pair suffers from sparsity and outliers. In particular, there are four cells with zero counts and one cell with one count having an outlying observation.

Figure 6 shows the box plots of the phenotypes after removing the noise cluster for the genotype combinations of the best model selected by MR-MDR. The distribution of each phenotype was different depending on the genotype combination, suggesting that there was an interaction effect.

Fig. 6
figure6

Box plots of four phenotypes after removing the noise cluster for the best SNP combination identified by MR-MDR ((i, j): ith genotype for rs1117105 and jth genotype for rs41476549, s creatinine, ALBU albmin)

To evaluate pure interaction effects for continuous phenotypes, we adopted the classical linear model. For the selected SNP combinations, we fit multivariate analysis of variance (MANOVA) model including two main effects and the interaction effect. The SNP effects were tested for the interaction effect only (H01:β12 = 0) and for the total effects including both main and interaction effects (H02:β1 = β12=0 and H03:β2 = β12=0). The results nine SNP pairs are summarized in Table 7. None of the selected SNP pairs showed significant interaction effects (p > 0.05), while some pairs showed significant total effects. This is because most MDR methods tend to choose a model with a large total effect ignoring pure interaction effects. However, our further empirical study showed that the introduction of noise cluster by fuzzy k-means increased the power of detecting interaction effects. The same MANOVA model were fit to the trimmed data after removing the noise cluster by MR-MDR. As expected, several significant interaction effects were successfully identified after trimming. Although MANOVA requires a Gaussian assumption, the process removing the noise cluster in the proposed method had the effect of yielding more robust and reliable MANOVA results. Among the nine SNP pairs with non-significant interactions, four pairs showed significant interaction effects. The results are summarized in the last three columns of Table 7.

Table 7 p-values from MANOVA test for the SNP combination selected by MR-MDR

Discussion

The proposed MR-MDR method is a non-parametric approach assuming no particular genetic model. To assign samples to two risk groups, MR-MDR uses the fuzzy clustering technique, as in the multi-CMDR method. MR-MDR uses the spatial rank sum statistic as evaluation measure for comparing two groups whereas Hotelling’s T2 statistic is used in multi-CMDR and multi-QMDR. Therefore, robust results can be expected in MR-MDR for skewed distributions or those with outliers.

When calculating the spatial rank statistic, a data-driven transformation matrix was needed to make the statistic invariant. It is known that an affine-invariant test performs better than noninvariant angle coordinate-wise sign tests when there is significant deviation from spherical symmetry caused by the presence of correlations among observed variables. Moreover, the affine-invariant test outperforms Hotelling’s T2-test for heavy-tailed non-normal multivariate distributions [52]. As can be seen in our toy example, the invariant statistic differed from the untransformed statistic, especially in the presence of a high correlation between phenotypes.

The problem with using Tyler's transformation statistic is that it takes much longer to calculate than the untransformed statistic. However, the change of the spatial ranks due to the Tyler’s transformation is trivial even when the correlation is moderate (r = 0.4), as seen in the toy example. Moreover, the simulation results showed that there was little improvement in performance compared to untransformed versions of MR-MDR. This phenomenon was also observed in the case of negative correlation. Therefore, we recommend using the untransformed MR-MDR version if the absolute value of correlation between phenotypes is not too high (e.g., |r|< 0.5).

To apply the proposed method for GWAS data, we considered a filtering procedure to reduce the number of SNPs to be investigated. We selected SNPs with significant marginal effects and investigated the interactions. Since MDR is an exhaustive method, this kind of filtering is needed. However, this filtering process can lead to ignoring gene–gene interactions for the SNPs with weak marginal effects.

Conclusions

In this paper, we proposed the MR-MDR method to detect the best interaction model for multivariate quantitative traits. MR-MDR is based on the fuzzy clustering technique and spatial rank-sum statistic. Intensive simulation studies comparing MR-MDR with several current methods showed that the performance of MR-MDR was outstanding for skewed distributions. The difference in power between the MR-MDR and other methods increased as the sample size became smaller and the correlation became stronger. Additionally, for symmetric distributions, MR-MDR showed comparable power. Therefore, we conclude that MR-MDR is a useful multivariate non-parametric approach that can be used regardless of the phenotype distribution, the correlations between phenotypes, and sample size.

Availability of data and materials

The Korea Association Resource (KARE) project data will be publicly distributed by the Distribution Desk of Korea Biobank Network (https://koreabiobank.re.kr/). The data request should be made directly to Distribution Desk of Korea Biobank Network. Any inquiries should be sent to admin@koreabiobank.re.kr.

Abbreviations

GGI:

Gene–gene interaction

MDR:

Multifactor dimensionality reduction

CV:

Cross validation

CVC:

Cross validation consistency

SNP:

Single nucleotide polymorphism

BMI:

Body mass index

BUN:

Blood urea nitrogen

RBC:

Red blood cell

MAF:

Minor allele frequency

FPC:

First principal component

WPC:

Weight sum of principal components

WSPC:

Weighted squared sum of principal components

MANOVA:

Multivariate analysis of variance

References

  1. 1.

    Grove J, Ripke S, Als TD, Mattheisen M, Walters RK, Won H, Pallesen J, Agerbo E, Andreassen OA, Anney R, et al. Identification of common genetic risk variants for autism spectrum disorder. Nat Genet. 2019;51(3):431–44.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  2. 2.

    McCarthy MI, Zeggini E. Genome-wide association studies in type 2 diabetes. Curr Diab Rep. 2009;9(2):164–71.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  3. 3.

    Levy D, Ehret GB, Rice K, Verwoert GC, Launer LJ, Dehghan A, Glazer NL, Morrison AC, Johnson AD, Aspelund T, et al. Genome-wide association study of blood pressure and hypertension. Nat Genet. 2009;41(6):677–87.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  4. 4.

    Gola D, Mahachie John JM, van Steen K, Konig IR. A roadmap to multifactor dimensionality reduction methods. Brief Bioinform. 2016;17(2):293–308.

    PubMed  Article  PubMed Central  Google Scholar 

  5. 5.

    Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. 2001.

  6. 6.

    Ritchie MD, Van Steen K. The search for gene–gene interactions in genome-wide association studies: challenges in abundance of methods, practical considerations, and biological interpretation. Ann Transl Med. 2018;6(8):157.

    PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007;31(4):306–15.

    PubMed  Article  PubMed Central  Google Scholar 

  8. 8.

    Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006;241(2):252–61.

    PubMed  Article  PubMed Central  Google Scholar 

  9. 9.

    Chung Y, Lee SY, Elston RC, Park T. Odds ratio based multifactor-dimensionality reduction method for detecting gene–gene interactions. Bioinformatics. 2007;23(1):71–6.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  10. 10.

    Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene–gene and gene-environment interactions. Bioinformatics. 2003;19(3):376–82.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  11. 11.

    Lee SY, Chung Y, Elston RC, Kim Y, Park T. Log-linear model-based multifactor dimensionality reduction method to detect gene gene interactions. Bioinformatics. 2007;23(19):2589–95.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  12. 12.

    Gui J, Andrew AS, Andrews P, Nelson HM, Kelsey KT, Karagas MR, Moore JH. A robust multifactor dimensionality reduction method for detecting gene–gene interactions with application to the genetic analysis of bladder cancer susceptibility. Ann Hum Genet. 2011;75(1):20–8.

    PubMed  Article  PubMed Central  Google Scholar 

  13. 13.

    Hua X, Zhang H, Zhang H, Yang Y, Kuk AYC. Testing multiple gene interactions by the ordered combinatorial partitioning method in case–control studies. Bioinformatics. 2010;26(15):1871–8.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  14. 14.

    Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, Li MD. A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am J Hum Genet. 2007;80(6):1125–37.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  15. 15.

    Gui J, Moore JH, Williams SM, Andrews P, Hillege HL, van der Harst P, Navis G, Van Gilst WH, Asselbergs FW, Gilbert-Diamond D. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of gene–gene interactions for quantitative traits. PLoS ONE. 2013;8(6):e66545–e66545.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  16. 16.

    Lee Y, Park M, Park T, Kim H. Gene–gene interaction analysis for quantitative trait using cluster-based multifactor dimensionality reduction method. Int J Data Min Bioinform. 2018;20(1):66.

    Article  Google Scholar 

  17. 17.

    Gui J, Moore JH, Kelsey KT, Marsit CJ, Karagas MR, Andrew AS. A novel survival multifactor dimensionality reduction method for detecting gene–gene interactions with application to bladder cancer prognosis. Hum Genet. 2011;129(1):101–10.

    PubMed  Article  PubMed Central  Google Scholar 

  18. 18.

    Lee S, Kwon MS, Oh JM, Park T. Gene–gene interaction analysis for the survival phenotype based on the Cox model. Bioinformatics. 2012;28(18):i582–8.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  19. 19.

    Oh S, Lee S. An extension ofmultifactor dimensionality reduction method for detecting gene–gene interactions with the survival time. J Korean Data Inf Sci Soc. 2014;25(5):1057–67.

    Google Scholar 

  20. 20.

    Park M, Lee JW, Park T, Lee S. Gene–gene interaction analysis for the survival phenotype based on the Kaplan–Meier median estimate. Biomed Res Int. 2020;2020:5282345.

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Oh S, Huh I, Lee SY, Park T. Analysis of multiple related phenotypes in genome-wide association studies. J Bioinform Comput Biol. 2016;14(5):1644005.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  22. 22.

    Choi J, Park T. Multivariate generalized multifactor dimensionality reduction to detect gene–gene interaction. BMC Syst Biol. 2013;6:66.

    Google Scholar 

  23. 23.

    Yu W, Kwon MS, Park T. Multivariate quantitative multifactor dimensionality reduction for detecting gene–gene interactions. Hum Hered. 2015;79(3–4):168–81.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  24. 24.

    Kim H, Jeong H-B, Jung H-Y, Park T, Park M. Multivariate cluster-based multifactor dimensionality reduction to identify genetic interactions for multiple quantitative phenotypes. Biomed Res Int. 2019;2019:4578983.

    PubMed  PubMed Central  Google Scholar 

  25. 25.

    Anderson MJ. A new method for non-parametric multivariate analysis of variance. Austral Ecol. 2001;26:32–46.

    Google Scholar 

  26. 26.

    Randles RH, Peters D. Multivariate rank tests for the two-sample location problem. Commun Stat. 1990;19(11):4225–38.

    Article  Google Scholar 

  27. 27.

    Dawn Peters RHR. A multivariate signed-rank test for the one-sample location problem. J Am Stat Assoc. 1990;85(410):552–7.

    Article  Google Scholar 

  28. 28.

    Sirkiä S, Taskinena S, Oja H. Symmetrised M-estimators of multivariate scatter. J Multivar Anal. 2007;98(8):1611–29.

    Article  Google Scholar 

  29. 29.

    Liu RY, Kesar S. A quality index based on data depth and multivariate rank tests. J Am Stat Assoc. 1993;88:252–60.

    Google Scholar 

  30. 30.

    Liu RY, Kesar S. Ordering directional data: concepts of data depth on circles and spheres. Ann Stat. 1992;20(3):1468–84.

    Article  Google Scholar 

  31. 31.

    Yijun Zuo XH. On the limiting distributions of multivariate depth-based rank sum statistics and related tests. Ann Stat. 2006;34(6):2879–96.

    Google Scholar 

  32. 32.

    Liu RY, Parelius JM, Kesar S. Multivariate analysis by data depth: descriptive statistics, graphics and inference, (with discussion and a rejoinder by Liu andSingh). Ann Stat. 1999;27(3):783–858.

    Article  Google Scholar 

  33. 33.

    Vencálek O. Concept of data depth and its applications. Mathematica. 2001;50(2):111–9.

    Google Scholar 

  34. 34.

    Oja H, Randles RH. Multivariate nonparametric tests. Stat Sci. 2004;19(4):598–605.

    Article  Google Scholar 

  35. 35.

    Choi K, Marden J. An approach to multivariate rank tests in multivariate analysis of variance. J Am Stat Assoc. 1997;92(440):1581–90.

    Article  Google Scholar 

  36. 36.

    LanWang BP, Li R. A high-dimensional nonparametric multivariate test for mean vector. J Am Stat Assoc. 2015;110(512):1658–69.

    Article  CAS  Google Scholar 

  37. 37.

    Chakraborty A, Chaudhuri P. Tests for high-dimensional data based on means, spatial signs and spatial ranks. Ann Stat. 2017;45(2):771–99.

    Article  Google Scholar 

  38. 38.

    Fried R, Dehling H. Robust nonparametric tests for the two-sample location problem. Stat Methods Appl. 2011;20(4):409–22.

    Article  Google Scholar 

  39. 39.

    Sirkiä S, Taskinen S, Nevalainen J, Oja H. Multivariate nonparametrical methods based on spatial signs and ranks: the R package SpatialNP. J Stat Softw. 2007;6:66.

    Google Scholar 

  40. 40.

    Tyler D. A distribution-free m-estimator of multivariate scatter. Ann Stat. 1987;15:66.

    Article  Google Scholar 

  41. 41.

    Oja H. Multivariate nonparametric methods with R: an approach based on spatial signs and ranks. Springer; 2010.

  42. 42.

    Liu RY. Data depth: robust multivariate analysis, computational geometry, and applications, vol. 72. American Mathematical Society; 2006.

  43. 43.

    Dave RN. Characterization and detection of noise in clustering. Pattern Recogn Lett. 1991;12(11):657–64.

    Article  Google Scholar 

  44. 44.

    Stitou Y, Lasmar N-E, Berthoumieu Y. Copulas based multivariate Gamma modeling for texture classification. 2012.

  45. 45.

    Cimino MGCA, Frosini G, Lazzerini B, Marcelloni F. On the noise distance in robust fuzzy C-means. In: Proceedings of world academy of science, engineering and technology; 2005. p. 1. ISSN 1307-6884.

  46. 46.

    Kim Y, Han B-G. group tK: Cohort Profile: the Korean Genome and Epidemiology Study (KoGES) Consortium. Int J Epidemiol. 2016;46(2):e20–e20.

    PubMed Central  Article  Google Scholar 

  47. 47.

    Lee J, Lee Y, Park B, Won S, Han JS, Heo NJ. Genome-wide association analysis identifies multiple loci associated with kidney disease-related traits in Korean populations. PLoS ONE. 2018;13(3):e0194044.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  48. 48.

    Freedman BI, Skorecki K. Gene–gene and gene-environment interactions in apolipoprotein L1 gene-associated nephropathy. Clin J Am Soc Nephrol. 2014;9(11):2006–13.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  49. 49.

    Tin A, Köttgen A. Genome-wide association studies of CKD and related traits. Clin J Am Soc Nephrol. 2020;6:66.

    Google Scholar 

  50. 50.

    Sinnott-Armstrong NTY, Amar D, Mars N, Benner C, Aguirre M, Venkataraman GR, Wainberg M, Ollila HM, Kiiskinen T, Havulinna AS, Pirruccello JP, Qian J, Shcherbina A, FinnGen F, Rodriguez F, Assimes TL, Agarwala V, Tibshirani R, Hastie T, Ripatti S, Pritchard JK, Daly MJ, Rivas MA. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat Genet. 2021;53(2):185–94.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  51. 51.

    Thio CHL RA, van der Most PJ, Kamali Z, Vaez A, Smit JH, Penninx BWJH, Haller T, Mihailov E, Metspalu A, Damman J, de Borst MH, van der Harst P, Verweij N, Navis GJ, Gansevoort RT, Nolte IM, Snieder H; Lifelines Cohort Study group. Genome-wide association scan of serum urea in European populations identifies two novel loci. Am J Nephrol. 2019;ss49(3):193–202.

  52. 52.

    Chakraborty B, Chaudhuri P, Oja H. Operating transformation retransformation on spatial median and angle test. Stat Sin. 1998;8(3):767–84.

    Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (2013M3A9C4078158, NRF-2021R1A2C1007788).

Author information

Affiliations

Authors

Contributions

Conceptualization, M.P.; methodology, M.P. and T.P.; software, H.J. and J.L.; formal analysis, H.J. and J.L.; writing—original draft preparation, M.P.; writing—review and editing, T.P. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Taesung Park.

Ethics declarations

Ethics approval and consent to participate

The study was reviewed and approved by the Institutional Review Board of Seoul National University (IRB No. E1908/001004). The patients/participants provided their written informed consent to participate in this study. All methods were carried out in accordance with relevant guidelines and regulations (Declaration of Helsinki).

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Park, M., Jeong, HB., Lee, JH. et al. Spatial rank-based multifactor dimensionality reduction to detect gene–gene interactions for multivariate phenotypes. BMC Bioinformatics 22, 480 (2021). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-021-04395-y

Download citation

Keywords

  • Fuzzy clustering
  • Gene–gene interaction
  • Multifactor dimensionality reduction
  • Spatial rank statistic