Skip to main content
  • Research article
  • Open access
  • Published:

A fast least-squares algorithm for population inference

Abstract

Background

Population inference is an important problem in genetics used to remove population stratification in genome-wide association studies and to detect migration patterns or shared ancestry. An individual’s genotype can be modeled as a probabilistic function of ancestral population memberships, Q, and the allele frequencies in those populations, P. The parameters, P and Q, of this binomial likelihood model can be inferred using slow sampling methods such as Markov Chain Monte Carlo methods or faster gradient based approaches such as sequential quadratic programming. This paper proposes a least-squares simplification of the binomial likelihood model motivated by a Euclidean interpretation of the genotype feature space. This results in a faster algorithm that easily incorporates the degree of admixture within the sample of individuals and improves estimates without requiring trial-and-error tuning.

Results

We show that the expected value of the least-squares solution across all possible genotype datasets is equal to the true solution when part of the problem has been solved, and that the variance of the solution approaches zero as its size increases. The Least-squares algorithm performs nearly as well as Admixture for these theoretical scenarios. We compare least-squares, Admixture, and FRAPPE for a variety of problem sizes and difficulties. For particularly hard problems with a large number of populations, small number of samples, or greater degree of admixture, least-squares performs better than the other methods. On simulated mixtures of real population allele frequencies from the HapMap project, Admixture estimates sparsely mixed individuals better than Least-squares. The least-squares approach, however, performs within 1.5% of the Admixture error. On individual genotypes from the HapMap project, Admixture and least-squares perform qualitatively similarly and within 1.2% of each other. Significantly, the least-squares approach nearly always converges 1.5- to 6-times faster.

Conclusions

The computational advantage of the least-squares approach along with its good estimation performance warrants further research, especially for very large datasets. As problem sizes increase, the difference in estimation performance between all algorithms decreases. In addition, when prior information is known, the least-squares approach easily incorporates the expected degree of admixture to improve the estimate.

Background

The inference of population structure from the genotypes of admixed individuals poses a significant problem in population genetics. For example, genome wide association studies (GWAS) compare the genetic makeup of different individuals in order to extract differences in the genome that may contribute to the development or suppression of disease. Of particular interest are single nucleotide polymorphisms (SNPs) that reveal genetic changes at a single nucleotide in the DNA chain. When a particular SNP variant is associated with a disease, this may indicate that the gene plays a role in the disease pathway, or that the gene was simply inherited from a population that is more (or less) predisposed to the disease. Determining the inherent population structure within a sample removes confounding factors before further analysis and reveals migration patterns and ancestry [1]. This paper deals with the problem of inferring the proportion of an individual’s genome originating from multiple ancestral populations and the allele frequencies in these ancestral populations from genotype data.

Methods for revealing population structure are divided into fast multivariate analysis techniques and slower discrete admixture models [2]. Fast multivariate techniques such as principal components analysis (PCA) [2-8] reveal subspaces in the genome where large differences between individuals are observed. For case-control studies, the largest differences commonly due to ancestry are removed to reduce false positives [4]. Although PCA provides a fast solution, it does not directly infer the variables of interest: the population allele frequencies and individual admixture proportions. On the other hand, discrete admixture models that estimate these variables typically require much more computation time. Following a recent trend toward faster gradient-based methods, we propose a faster simpler least-squares algorithm for estimating both the population allele frequencies and individual admixture proportions.

Pritchard et al. [9] originally propose a discrete admixture likelihood model based on the random union of gametes for the purpose of population inference. In particular, their model assumes Hardy-Weinberg equilibrium within the ancestral populations (i.e., allele frequencies are constant) and linkage equilibrium between markers within each population (i.e., markers are independent). Each individual in the current sample is modeled as having some fraction of their genome originating from each of the ancestral populations. The goal of population inference is to estimate the ancestral population allele frequencies, P, and the admixture of each individual, Q, from the observed genotypes, G. If the population of origin for every allele, Z, is known, then the population allele frequencies and the admixture for each individual have a Dirichlet distribution. If, on the other hand, P and Q are known, the population of origin for each individual allele has a multinomial distribution. Pritchard et al. infer populations by alternately sampling Z from a multinomial distribution based on P and Q; and P and Q from Dirichlet distributions based on Z. Ideally, this Markov Chain Monte Carlo sampling method produces independent identically distributed samples (P,Q) from the posterior distribution P(P,Q|G). The inferred parameters are taken as the mean of the posterior. This algorithm is implemented in an open-source software tool called Structure[9].

The binomial likelihood model proposed by Pritchard et al. was originally used for datasets of tens or hundreds of loci. However, as datasets become larger, especially considering genome-wide association studies with thousands or millions of loci, two problems emerge. For one, linkage disequilibrium introduces correlations between markers. Although Falush et al. [10] extended Structure to incorporate loose linkage between loci, larger datasets also pose a computational challenge that has not been met by these sampling-based approaches. This has led to a series of more efficient optimization algorithms for the same likelihood model with uncorrelated loci. This paper focuses on improving computational performance, leaving the treatment of correlated loci to future research.

Tang et al. [11] proposed a more efficient expectation maximization (EM) approach. Instead of randomly sampling from the posterior distribution, the FRAPPE EM algorithm [11] starts with a randomly initialized Z, then alternates between updating the values of P and Q for fixed Z, and maximizing the likelihood of Z for fixed P and Q. Their approach achieves similar accuracy to Structure and requires much less computation time. Wu et al. [12] specialized the EM algorithm in FRAPPE to accommodate the model without admixture, and generalized it to have different mixing proportions at each locus. However, these EM algorithms estimate an unnecessary and unobservable variable Z, something that more efficient algorithms could avoid.

Alexander et al. [13] proposed an even faster approach for inferring P and Q using the same binomial likelihood model but bypassing the unobservable variable Z. Their close-source software, Admixture, starts at a random feasible solution for P and Q and then alternates between maximizing the likelihood function with respect to P and then maximizing it with respect to Q. The likelihood is guaranteed not to decrease at each step eventually converging at a local maximum or saddle point. For a moderate problem of approximately 10000 loci, Admixture achieves comparable accuracy to Structure and requires only minutes to execute compared to hours for Structure[13].

Another feature of Structure’s binomial likelihood model is that it allowed the user to input prior knowledge about the degree of admixture. The prior distribution for Q takes the form of a Dirichlet distribution with a degree of admixture parameter, α, for every population. For α = 0, all of an individual’s alleles originate from the same ancestral population; for α > 0, individuals contain a mixture of alleles from different populations; for α = 1, every assignment of alleles to populations is equally likely (i.e., the non-informative prior); and for α → ∞, all individuals have equal contributions from every ancestral population. Alexander et al. replace the population degree of admixture parameter in Structure with two parameters, λ and γ, that when increased also decrease the level of admixture of the resulting individuals. However, the authors admit that tuning these parameters is non-trivial [14].

This paper contributes to population inference research by (1) proposing a novel least-squares simplification of the binomial likelihood model that results in a faster algorithm, and (2) directly incorporating the prior parameter α that improves estimates without requiring trial-and-error tuning. Specifically, we utilize a two block coordinate descent method [15] to alternately minimize the criterion for P and then for Q. We adapt a fast non-negative least-squares algorithm [16] to additionally include a sum-to-one constraint for Q and an upper-bound for P. We show that the expected value for the estimates of P (or Q) across all possible genotype datasets are equal to the true values when Q (or P) are known and that the variance of this estimate approaches zero as the problem size increases. Compared to Admixture, the least-squares approach provides a slightly worse estimate of P or Q when the other is known. However, when estimating P and Q from only the genotype data, the least-squares approach sometimes provides better estimates, particularly with a large number of populations, small number of samples, or more admixed individuals. The least-squares approximation provides a simpler and faster algorithm, and we provide it as Matlab scripts on our website.

Results

First, we motivate a least-squares simplification of the binomial likelihood model by deriving the expected value and covariance of the least-squares estimate across all possible genotype matrices for partially solved problems. Second, we compare least-squares to sequential quadratic programming (Admixture’s optimization algorithm) for these cases. Third, we compare Admixture, FRAPPE, and least-squares using simulated datasets with a factorial design varying dataset properties in G. Fourth, we compare Admixture and least-squares using real population allele frequencies from the HapMap Phase 3 project. Finally, we compare the results of applying Admixture and least-squares to real data from the HapMap Phase 3 project where the true population structure is unknown.

The algorithms we discuss accept as input the number of populations, K, and the genotypes, g li ∈{0,1,2}, representing the number of copies of the reference allele at locus l for individual i. Then, the algorithms attempt to infer the population allele frequencies, p lk = [0,1], for locus l and population k, as well as the individual admixture proportions, q ki = [0,1] where ∑ k q ki = 1. In all cases, 1 ≤ l ≤ M, 1 ≤ i ≤ N, and 1 ≤ k ≤ K. Table 1 summarizes the matrix notation.

Table 1 Matrix notation

Empirical estimate and upper bound on total variance

To validate our derived bounds on the total variance (Equations 13, 17, 18 and 19), we generate simulated genotypes from a known target for p = [0.1, 0.7]T. We simulate N individual genotypes using the full matrix Q with each column drawn from a Dirichlet distribution with shape parameter α. We repeat the experiment 10000 times producing an independent and identically distributed genotype each time. Each trial produces one estimate for p. We then compute the mean and covariance of the estimates of p and compare them to those predicted in the bounds. For α = 1 and N = 100,

mean ( p ^ ) = 0.0999 0.7002 cov ( p ^ ) = 0.0027 − 0.0015 − 0.0015 0.0046 trace ( cov p ^ ) = 0.0073
(1)

The bound using the sample covariance of q in Equation 13 provides the following:

Q Q T = 36.62 16.20 16.20 30.99 trace ( cov p ^ ) ≤ 0.0097
(2)

The bound using the properties of the Dirichlet distribution in Equation 17 provides a bound of 0.01. As the number of samples increases, the difference between the bound and the asymptotic bound for the Dirichlet distributed q will approach zero.

Figure 1 plots the total variance (trace of the covariance) matrix for a variety of values for N and α using the same target value for p. Because the expected value of the estimate is equal to the true value of p, the total variance is analogous to the sum of the squared error (SSE) between the true p and its estimate. Clearly, the total variance decreases with N. For N = 10000, the root mean squared error falls below 1%.

Figure 1
figure 1

Bound on total variance. Solid and dashed lines correspond to the empirical estimate of the total variance and the upper bound for total variance, respectively.

Intuitively, the error in the least-squares estimate for P and Q decreases as the number of individuals and the number of loci increases, respectively. Figure 1 supports this notion, suggesting that on very large problems for which the gradient based and expectation maximization algorithms were designed, the error in the least-squares estimate approaches zero.

Comparing least-squares approximation to binomial likelihood model

Given estimates of the population allele frequencies, early research focused on estimating the individual admixture [17]. We also note that the number of iterations and convergence properties confound the comparison of iterative algorithms. To avoid these problems and emulate a practical research scenario, we compare least-squares to sequential quadratic programming (used in Admixture) when P or Q are known a priori. In this scenario, each algorithm converges in exactly one step making it possible to compare the underlying updates for P and Q independently. For N = 100, 1000, and 10000; and α = 0.1, 1, and 2; we consider a grid of two-dimensional points for p, where p i = {0.05, 0.15, …, 0.95}. For each trial, we first generate a random Q such that every column is drawn from a Dirichlet distribution with shape parameter, α. Then, we randomly generate a genotype using Equation 11. We compute the least-squares solution using Equation 27 and use Matlab’s built-in function ‘fmincon’ to minimize the negative of the log-likelihood in Equation 7, similar to Admixture’s approach. We repeat the process for 1000 trials and aggregate the results.

Figure 2 illustrates the root mean squared error in estimating p given the true value of Q. Both algorithms present the same pattern of performance as a function of p = [p1, p2]. Values of p near 0.5 present the most difficult scenarios. Positively correlated values (e.g., p1 = p2) present slightly less error than negatively correlated values (e.g., p1 = 1 - p2). Table 2 summarizes the performance over all values of p for varying N and α. In all cases, fmincon performs slightly better than least-squares and both algorithms approach zero error as N increases. We repeat this analysis for known values for P and estimate q using the two approaches. Figure 3 illustrates the difference in performance for the two algorithms as we vary q1 between 0.05 and 0.95 with q2 =1 - q1. Again, fmincon performs slightly better in all cases but both approach zero as M increases. In the next section we show that the additional error introduced by the least-squares approximation to the objective function remains small relative to the error introduced by the characteristics of the genotype data.

Figure 2
figure 2

Precision of best-case scenario for estimating P. Root mean squared error for different values of p using (a) Admixture’s Sequential Quadratic Programming or (b) the least-squares approximation.

Figure 3
figure 3

Precision of best-case scenario for estimating Q. Solid and dashed lines correspond to Admixture’s Sequential Quadratic Programming optimization and the least-squares approximation, respectively.

Table 2 Root mean squared error in P for known Q and K= 2

Simulated experiments to compare least-squares to Admixture and FRAPPE

In the previous sections, we consider the best-case scenario where the true value of P or Q is known. In a realistic scenario, the algorithms must estimate both P and Q from only the genotype information. Table 3 summarizes the results of a four-way analysis of variance with 2-way interactions among experimental factors. By far the factor with the most impact on performance is the number of individuals, N. The degree of admixture, α, and the number of populations, K, accounts for the second and third most variation, respectively. These three factors and two-way interactions between them account for the vast majority of variation. In particular, the choice of algorithm accounts for less than about 1% of the variation in estimation performance. That is, when estimating population structure from genotype data, the number of samples, the number of populations, and the degree of admixture play a much more important role than the choice between least-squares, Admixture, and FRAPPE and least-squares. However, as shown in Figure 4, when considering the computation time required by the algorithm, the choice of algorithm contributes about 40% of the variation including interactions. Therefore, for the range of population inference problems described in this study, the choice of algorithm plays a very small role in the estimation of P and Q but a larger role in computation time.

Figure 4
figure 4

Computational timing comparison. Box plots show the median (red line) and inter-quartile range (blue box) for computation time on a logarithmic scale using (a) N=1000, α=0.5, and varying K; (b) K=4, α=0.5, and varying N; and (c) K=4, N=1000, and varying α.

Table 3 Sources of variation in root mean squared error

Further exploration reveals that the preferred algorithm depends on K, N, and α. Table 4 lists the root mean squared error for the estimation of Q for all combinations of parameters across n = 50 trials. Out of the 36 scenarios, Admixture, least-squares, and FRAPPE perform significantly better than their peers 13, six, and zero times, respectively; they perform insignificantly worse than the best algorithm 30, 17, and 10 times, respectively. The least-squares algorithm appears to perform well on the more difficult problems with combinations of large K, small N, or large α. Table 5 lists the root mean squared error for estimating P. For N = 100, the algorithms do not perform significantly differently. For N = 10000, all algorithms perform with less than 2.5% root mean squared error (RMSE). In all, Admixture performs significantly better than its peers 11 times out of 36. However, Admixture never performs significantly worse than its peers. Least-squares and FRAPPE perform insignificantly worse than Admixture 17 and 20 times out of 36, respectively. Table 6 summarizes the timing results. Least square converges significantly faster 34 out of 36 times with an insignificant difference for the remaining two scenarios. FRAPPE converges significantly slower in all scenarios. With two exceptions, the least-squares algorithm provides a 1.5- to 5-times speedup.

Table 4 Root mean squared error for Q
Table 5 Root mean squared error for P
Table 6 Computation time

Comparison on admixtures derived from the HapMap3 dataset

Tables 7 and 8 lists the performance and computation time for the least-squares approach and Admixture using a convergence threshold of ε = 1.0e-4 and ε = 1.4e-3, respectively. Each marker in the illustrations represents one individual. A short black line emanating from each marker indicates the offset from the original (correct) position. For all simulations, the least-squares algorithms perform within 0.1% of Admixture for estimating the true population allele frequencies in P. For well-mixed populations in Simulation 1 and 2, the least-squares algorithms perform comparably well or even better than Admixture. However, for less admixed data in Simulations 3 - 6, Admixture provides better estimates of the true population proportions depicted in the scatter plots. In all cases, the least-squares algorithms perform within 1.5% of Admixture and between about 2- and 3-times faster than Admixture.

Table 7 Simulation experiments (1-3) using realistic population allele frequencies from the HapMap phase 3 project
Table 8 Simulation experiments (4-6) using realistic population allele frequencies from the HapMap phase 3 project

The apparent advantage of Admixture involves individuals on the periphery of the unit simplex defining the space of Q. In Table 7, this corresponds to individuals on the boundary of the right triangle defined by the x-axis, y-axis, and y = 1 - x diagonal line. For Simulation 1, the original Q contains very few individuals on the boundary, Admixture estimates far more on the boundary, and the least-squares was closer to the ground truth. For Simulation 2 - 6, the ground truth contains more individuals on the boundary, Admixture correctly estimates these boundary points but the least-squares algorithms predict fewer points on the boundary. Simulation 6 provides the most obvious example where Admixture estimates individuals exactly on the boundary and least-squares contains a jumble of individuals near but not exactly on the line.

Real dataset from the HapMap phase 3 project

Over 20 repeated trials, Admixture converged in an average of 42.1 seconds with standard deviation of 9.1 seconds, and the least-squares approach converged in 33.6 seconds with a standard deviation of 9.8 seconds. Figure 5 illustrates the inferred population proportions for one run. The relative placement of individuals from each known population is qualitatively similar. The two methods differ at extreme points such as those values of q1, q2, or 1 - q1 - q2 that are near zero. The Admixture solution has more individuals on the boundary and the least-squares approach has fewer. Although we cannot estimate the error of these estimates because the real world data has no ground truth, we can compare their results quantitatively. The Admixture and the least-squares solution differed by an average of 1.2% root mean squared difference across the 20 trials. We estimate α = 0.12 from the Admixture solution’s total variance using Equation 31. This roughly corresponds to the simulated experiment with three populations, 100 samples, and a degree of admixture of 0.1. In that case, Admixture and least-squares exhibited a very small root mean squared error of 0.62% and 0.74%, respectively (Table 4).

Figure 5
figure 5

Comparison on HapMap Phase 3 dataset. Inferred population membership proportions using (a) Admixture and (b) least-squares with α=1. Each point represents a different individual among the four populations: ASW, CEU, MEX, and YRI. The axes represent the proportion of each individual’s genome originating from each inferred population. The proportion belonging to the third inferred population is given by q3 = 1 - q1 - q2.

Discussion

This work contributes to the population inference literature by providing a novel simplification of the binomial likelihood model that improves the computational efficiency of discrete admixture inference. This approximation results in an inference algorithm based on minimizing the squared distance between the genotype matrix G and twice the product of the population allele frequencies and individual admixture proportions: 2PQ. This Euclidean distance-based interpretation aligns with previous results employing multivariate statistics. For example, researchers have found success using principal component analysis to reveal and remove stratification [2-4] or even to reveal clusters of individuals in subpopulations [5-7]. Recently, McVean [5] proposed a genealogical interpretation of principal component analysis and uses it to reveal information about migration, geographic isolation, and admixture. In particular, given two populations, individuals cluster along the first principal component. Admixture proportion is the fractional distance between the two population centers. However, these cluster centers must known or inferred in order to estimate ancestral population allele frequencies. The least-squares approach infers these estimates efficiently and directly.

Typically, discrete admixture models employ a binomial likelihood function rather than a Euclidean distance-based one. Pritchard et al. detail one such model and use a slow sampling based approach to infer the admixed ancestral populations for individuals in a sample [9]. Recognizing the performance advantage of maximizing the likelihood rather than sampling the posterior, Tang et al. proposed an expectation maximization algorithm and Alexander et al. [13] proposed a sequential quadratic programming (SQP) approach using the same likelihood function [9]. We take this approach a step further by simplifying the model proposed by Pritchard et al. to introduce a least-squares criterion. By justifying the least-squares simplification, we connect the fast and practical multivariate statistical approaches to the theoretically grounded binomial likelihood model. We validate our approach on a variety of simulated and real datasets.

First, we show that if the true value of P (or Q) is known, the expected value of the least squares solution for Q (or P) across all possible genotype matrices is equal to the true value, and the variance of this estimate decreases with M (or N). In this best-case scenario, we show that SQP provides a slightly better estimate than the least-squares solution for a variety of problem sizes and difficulty. For more common scenarios where the algorithms must estimate P and Q using only the genotype information in G, we show that for particularly difficult problems with small N, large K, or large α, the least-squares approach often performs better than its peers. For about one-third of the parameter sets, Admixture performs significantly better than least-squares and FRAPPE but all algorithms approach zero error as N becomes very large. In addition, the error introduced by the choice of algorithms was relatively small compared to other characteristics of the experiment such as sample size, number of populations, and the degree of admixture in the sample. That is, improving accuracy has more to do with improving the dataset than with selecting the algorithm, suggesting that algorithm selection may depend on other criteria such as its speed. In nearly all cases, the least-squares method computes its solution faster, typically a 1.5- to 5-times faster. At the current problem size involving about 10000 loci, this speed improvement may justify the use of least-squares algorithms. For a single point estimate, researchers may prefer a slightly more accurate algorithm at the cost of seconds or minutes. For researchers testing several values of K and α and using multiple runs to gauge the fitness of each parameter set, or those estimating standard errors [13], the speed improvement could be the difference between hours and days of computation. As the number of loci increase to hundreds of thousands or even millions, speed may be more important. The least-squares approach offers an alternative simpler and faster algorithm for population inference that provides qualitatively similar results.

The key speed advantage of the least-squares approach comes from a single nonnegative least-squares update that minimizes a quadratic criterion for P and then for Q per iteration. Admixture, on the other hand, minimizes several quadratic criteria sequentially as it fits the true binomial model. Although the least-squares algorithm completes each update in less time and is guaranteed to converge to a local minimum or straddle point, predicting the number of iterations to convergence presents a challenge. We provide empirical timing results and note that selecting a suitable stopping criterion for these iterative methods can change the timing and accuracy results. For comparison, we use the same stopping criterion with published thresholds for Admixture and FRAPPE[13], and a threshold of MN×10-10 for least-squares.

This work is motivated in part by the desire to analyze larger genotype datasets. In this paper, we focus on the computational challenges of analyzing very large numbers of markers and individuals. However, linkage disequilibrium introduces correlations between loci that cannot be avoided in very large datasets. Large datasets can be pruned to diminish the correlation between loci. For example, Alexander et al. prune the HapMap phase 3 dataset from millions of SNPs down to around 10000 to avoid correlations. In this study, we assume linkage equilibrium and therefore uncorrelated markers and limit our analysis to datasets less than about 10000 SNPs. Incorporating linkage disequilibrium in gradient-based optimizations of the binomial likelihood model remains an open problem.

Estimating the number of populations K from the admixed samples continues to pose a difficult challenge for clustering algorithms in general and population inference in particular. In practice, experiments can be designed to include individual samples that are expected to be distributed close to their ancestors. For example, Tang et al. [11] suggested using domain knowledge to collect an appropriate number of pseudo-ancestors that reveal allele frequencies of the ancestral populations. The number of groups considered provides a convenient starting point for K. Lacking domain knowledge, computational approaches can be used to try multiple reasonable values for K and evaluating their fitness. For example, Pritchard et al. [9] estimated the posterior distribution of K and select the most probable K. Another approach is to evaluate the consistency of inference for different values of K. If the same value of K leads to very different inferences of P and Q from different random starting points, the inference can be considered inconsistent. Brunet et al. [18] proposed this method of model selection called consensus clustering.

For realistic population allele frequencies, P, from the HapMap Phase 3 dataset and very little admixture in Q, Admixture provides better estimates of Q. The key advantage of Admixture appears to be for individuals containing nearly zero contribution from one or more inferred populations, whereas the least-squares approach performs better when the individuals are well-mixed. Visually, both approaches reveal population structure. Using the two approaches to infer three ancestral populations from four HapMap Phase 3 sampling populations reveals qualitatively similar results.

We believe the computational advantage of the least-squares approach along with its good estimation performance warrants further research especially for very large datasets. For example, we plan to adapt and apply the least-squares approach to datasets utilizing microsatellite data rather than SNPs and consider the case of more than two alleles per locus. Researchers have incorporated geospatial information into sampling-based [19] and PCA-based [8] approaches. Multiple other extensions to sampling-based or PCA-based algorithms have yet to be incorporated into faster gradient-based approaches.

Conclusion

This paper explores the utility of a least-squares approach for the inference of population structure in genotype datasets. Whereas previous Euclidean distance-based approaches received little theoretical justification, we show that a least-squares approach is the result of a first-order approximation of the negative log-likelihood function for the binomial generative model. In addition, we show that the error in this approximation approaches zero as the number of samples (individuals and loci) increases. We compare our algorithm to state-of-the-art algorithms, Admixture and FRAPPE, for optimizing the binomial likelihood model, and show that our approach requires less time and performs comparably well. We provide both quantitative and visual comparisons that illustrate the advantage of Admixture at estimating individuals with little admixture, and show that our approach infers qualitatively similar results. Finally, we incorporate a degree of admixture parameter that improves estimates for known levels of admixture without requiring additional parameter tuning as is the case for Admixture.

Methods

The algorithms we discuss accept the number of populations, K, and an M × N genotype matrix, G as input:

G = g 11 g 12 ⋯ g 1 N g 21 g 22 ⋯ g 2 N ⋮ ⋮ ⋱ ⋮ g M 1 g M 2 ⋯ g MN
(3)

where g li ∈ {0,1,2} representing the number of copies of the reference allele at the l th locus for the i th individual, M is the number of markers (loci), and N is the number of individuals. Given the genotype matrix, G, the algorithms attempt to infer the population allele frequencies and the individual admixture proportions. The matrix P contains the population allele frequencies:

P = p 11 p 12 ⋯ p 1 K p 21 p 22 ⋯ p 2 K ⋮ ⋮ ⋱ ⋮ p M 1 p M 2 ⋯ p MK
(4)

where 0 ≤ p lk ≤ 1 representing the fraction of reference alleles out of all alleles at the l th locus in the k th population. The matrix Q contains the individual admixture proportions:

Q = q 11 q 12 ⋯ q 1 N q 21 q 22 ⋯ q 2 N ⋮ ⋮ ⋱ ⋮ q K 1 q K 2 ⋯ q KN
(5)

where 0 ≤ q ik ≤ 1 represents the fraction of the i th individual’s genome originating from the k th population and for all i, ∑ k q ki = 1. Table 1 summarizes the matrix notation we use.

Likelihood function

Alexander et al. model the genotype (i.e., the number of reference alleles at a particular locus) as the result of two draws from a binomial distribution [13]. In the generative model, each allele copy for one individual at one locus has an equal chance, m li , of receiving the reference allele:

m li = Σ k = 1 K p 1 k q k 1
(6)

The log-likelihood of the parameters P and Q from the original Structure binomial model and ignoring an additive constant is the following [13]:

L M = Σ l = 1 M Σ i = 1 N g li 1 n m li + 2 − g li 1 n 1 − m li
(7)

To see the effect on gradient-based optimization, we also present the derivative of the likelihood with respect to a particular m li :

∂ ∂ m li L M = g li − 2 m li m li 1 − m li ≈ 4 g li − 2 m li
(8)

In order to achieve a least-squares criterion, we must approximate this derivative with a line. Figure 6 plots this derivative with respect to m li for the three possible values of g li (0, 1, or 2). To avoid biasing the approximation to high or low values of m li , we approximate the derivative with its first-order Taylor approximation in the neighborhood of m li = 1/2. More complex optimizations might update the neighborhood of the Taylor approximation during the optimization. In the interest of simplicity, we select one neighborhood for all iterations, genotypes, individuals, and loci. The following least-squares objective function has the approximated derivative in the above equation:

− L M ≈ Σ l = 1 L Σ i = 1 N 2 m li − g li 2 = 2 M − G 2 2
(9)
Figure 6
figure 6

First-order approximation for slope of log-likelihood of m . Solid and dashed lines correspond to the true and approximated slope, respectively. The red, green, and blue lines correspond to g = 0, g = 1, and g = 2, respectively.

The right-hand-side of Equation 9 provides the least-squares criterion. Figure 6 shows the deviation between the linear approximation and the true slope. Values match closely for 0.35 ≤ m li ≤ 0.65 but as m li approaches zero or one the true slope diverges for two of the three genotypes. Therefore, we have the following least-squares optimization problem:

arg min P , Q 2 PQ − G 2 2 , such that { 0 ≤ P ≤ 1 Q ≥ 0 Σ k = 1 K q ki = 1
(10)

Bounded error for the least-squares approach

We justify the least-squares approach by showing that the expected value across all genotypes is equal to the true value in the binomial likelihood model, and that the covariance approaches zero as the size of the data increases. In order to analyze the least squares performance across all possible genotype matrices, we consider the generative model for G. Given the true ancestral population allele frequencies, P, and the proportion of each individual’s alleles originating from each population, Q, the genotype at locus l for individual i is a binomial random variable, g li :

g li ∼ Binomial ( 2 , m li ) m li = Σ k = 1 K p 1 k q ki
(11)

If M was directly observable, we could solve for P or Q given the other using P = MQ# or Q = P#M, where # is the Moore-Penrose pseudo-inverse. However, we only observe the elements of G which is only partially informative of M. First we consider the uncertainty in estimating P. Each g li is an independent random variable with the following mean and bound on the variance:

E [ g li ] = 2 m li var [ g li ] 1 2
(12)

Mean and total variance of the estimate of p

For ease of notation, we focus on one locus at index l in one row of P,

p ^ = p ^ l 1 , p ^ 12 , … , p ^ 1 K T

, one row of G, g = [gl 1, g12, …, g1N]T, and estimate the mean, covariance, and provide a bound on the total variance of its estimate:

p ^ = 1 2 Q T g E [ p ^ ] = p cov [ p ^ ] = 1 4 Q T cov [ g ] Q t r a c e ( cov [ p ^ ] ) ≤ 1 8 t r a c e Q Q T − 1
(13)

Intuitively, QQT scales linearly with N and we expect the bound on the trace to decrease linearly with N. If the columns, q, of Q are independent and identically distributed, QQT approaches N×E[qqT], resulting in a bound that decreases linearly with N:

trace cov p ^ ≤ 1 8 N trace E q q T − 1
(14)

To put this bound in more familiar terms we consider q drawn from a Dirichlet distribution with shape parameter α, resulting in the following:

E q q T = 1 4 α + 2 α + 1 α α α + 1
(15)

Asymptotically, QQT approaches N×E[qqT] and (QQT)-1 approaches:

2 N α + 1 − α − α α + 1
(16)

resulting in the following asymptotic bound on the total variance:

trace cov p ^ ≤ 1 4 N α + 2
(17)

Mean and total variance of the estimate for q

The same analysis can be repeated for one individual at index i in one column of Q,

q ^ = q ^ li , q ^ 2 i , … , q ^ ki T

and one column of G, g = [g li , g2i, …, g Li ]T:

q ^ = 1 2 P g E [ q ^ ] = q c o v [ q ^ ] = 1 4 P cov [ g ] P T trace ( cov [ q ^ ] ) ≤ 1 8 trace P T P − 1
(18)

Intuitively, PTP increases linearly with M, and we expect the bound on the total variance to decrease linearly with M. Similarly, if the rows, p, of P are independent and identically distributed, PTP approaches M×E[ppT], resulting in an asymptotic bound that decreases linearly with M:

trace cov q ^ ≤ 1 8 M trace E p T p − 1
(19)

Incorporating degree of admixture, α

Pritchard et al. [13] use a prior distribution to bias their solution toward those with a desired level of admixture. This prior on the columns of Q takes the form of a Dirichlet distribution:

q ∼ D α , α , … , α
(20)

Because all the shape parameters (α) are equal, this prior assumes that all ancestral populations are equally represented in the current sample. The log of this prior probability is the following ignoring an additive constant:

In P q = α − 1 Σ k = 1 K ln q k , where q k = 1 − Σ k = 1 K − 1 q k
(21)

The derivative of the log prior with respect to q and its first-order approximation at the mean of q k = 1/K is the following:

∂ ∂ q k ln P q = − α − 1 q k − q K q k q K ≈ − 2 K 2 α − 1 q k − 1 K
(22)

The following penalty function combines the columns of Q into a single negative log-likelihood function with the approximated derivative in the above equation:

− ln p Q ≈ K 2 α − 1 Σ i = l N Σ k = 1 K q ki − 1 K 2 = K 2 α − 1 Q − 1 K 2 2
(23)

The right-hand-side of Equation 23 acts as a penalty term for the least-squares criterion in Equation 9. Figure 7 shows the difference between the real and approximated slope. For q near its mean of 1/K, the approximation fits closely but for extreme values of q the true slope diverges. Combining the terms in Equations 9 and 23 and including problem constraints, we have the following least-squares optimization problem:

arg min P , Q 2 PQ − G 2 2 + K 2 α − 1 Q − 1 K 2 2 , such that 0 ≤ P ≤ 1 Q ≥ 0 Σ k = 1 K q ki = 1
(24)
Figure 7
figure 7

First-order approximation for slope of log-likelihood of q . Solid and dashed lines correspond to the true and approximated slope, respectively, for K = 2. The blue, green, red, and orange lines correspond to α = 0.1, α = 0.5, α = 1, and α = 2, respectively.

Optimization algorithm

The non-convex optimization problem in Equation 10 can be approached as a two-block coordinate descent problem [15, 20]. We initialize Q with nonnegative values such that each column sums to one. Then, we alternate between minimizing the criterion function with respect to P with fixed Q:

arg min 0 ≤ P ≤ 2 PQ − G 2 2
(25)

and then minimizing with respect to Q with fixed P:

arg min Q ≥ 0 Σ k = 1 K q ki = 1 2 PQ − G 2 2 + K 2 α − 1 Q − 1 K 2 2
(26)

This process is repeated until the change in the criterion function is less than ε at which point we consider the algorithm to have converged. The Admixture algorithm suggests a threshold of ε = 1e-4 but we have found that a larger threshold often suffices. Unless otherwise stated, we use a threshold that depends on the size of the problem: ε = MN×10-10, corresponding to 1e-4 when M = 10000 and N = 100.

Least-squares solution for P

Van Benthem and Keenan [16] propose a fast nonnegatively constrained active/passive set algorithm that avoids redundant calculations for problems with multiple right-hand-sides. Without considering the constraints on P, Equation 25 can be classically solved using the pseudo-inverse of Q:

P ^ = 1 2 G Q T Q Q T − 1
(27)

However, some of the elements of P may be less than zero. In the active/passive set approach, if elements of P are negative, they are clamped at zero and added to the active set. The unconstrained solution is then applied to the remaining passive elements of P. If the solution happens to be nonnegative, the algorithm finishes. If not, negative elements are added to the active set and elements in the active set with a negative gradient (will decrease the criterion by increasing) are added back to the passive set. The process is repeated until the passive set is non-negative and the active set contains only elements with a positive gradient at zero. We extend the approach of Van Benthem and Keenan to include an upper bound at one. Therefore, we maintain two active sets: those clamped at zero and those clamped at one and update both after the unconstrained optimization of the passive set at each iteration. We provide Matlab source code that implements this algorithm on our website.

Least-squares solution for Q

When solving for Q it is convenient to reformulate Equation 26 into simpler terms:

arg min Q ≥ 0 Σ k = 1 K q ki = 1 P ¯ Q − G ¯ 2 2 P ¯ = 2 P K α − 1 1 2 I K G ¯ = G α − 1 1 2 1 KxN
(28)

The unconstrained solution for this equation is the following:

Q ^ = 4 P T P + K 2 α − 1 I − 1 2 P T G + K α − 1 = P ¯ T P ¯ − 1 P ¯ T G ¯
(29)

When prior information is known about the sparseness, we use α in the equations above. When no prior information is known, we use α = 1 corresponding to the uninformative prior and resulting in the ordinary pseudo-inverse solution. In order to incorporate the sum-to-one constraint on the columns of Q, we employ the method of Lagrange multipliers using Equation 11 in the work of Settle and Drake substituting the identity matrix for the noise matrix, N[21]. For completeness, we include the solution below:

Q = a U j + ( U − a U J U ) P ¯ T G ¯ U = P ¯ T P ¯ − 1 a = Σ i = 1 K Σ j = 1 K u ij − 1 j = 1 , 1 , … , 1 T J = j j T
(30)

As before, some elements of Q may be negative. In that case, we utilize the active set method to clamp elements of Q at zero and update active and passive sets at each iteration until convergence as described above. We adapt the Matlab script by Van Benthem and Keenan so that the unconstrained solution uses Equation 30 instead of the standard pseudo-inverse and provide it on our website.

Simulated experiments to compare the proposed approach to Admixture and FRAPPE

We generate simulated genotype data for a variety of problems using M = 10000 markers, and varying N between 100, 1000, and 10000; K between 2, 3, and 4; and α between 0.1, 0.5, 1, and 2, for a total of 36 parameter sets. For each combination of N, K, and α, we generate the ground truth P from a uniform distribution, and Q from a Dirichlet distribution parameterized by α. Then, we draw a random genotype for each individual using the binomial distribution in Equation 11. We estimate P and Q using only the genotype information and the true number of populations, K. We repeat the experiment 50 times drawing new, P, Q, and G matrices each time. Finally, we record the performance of Admixture using the published tight convergence threshold of ε = 1e-4[13] and a loose convergence threshold of ε = MN×10-4; the least-squares algorithm using an uninformative prior (α = 1) and ε = MN×10-4, and the FRAPPE EM algorithm using the published threshold of ε = 1. For reference, we also include the least-squares algorithm with informative prior (known α) with convergence threshold of ε = MN×10-4. In all experiments, Admixture’s performances with the two convergence thresholds were nearly identical and we only report the results for ε = MN×10-4, resulting in shorter computation times. We used a four-way analysis of variance (ANOVA) with a fixed effects model to reveal which factors (including algorithm) contribute more or less to the estimation error and computation time.

Statistical significance of root mean squared error and computation time

For each combination of K, N, and α, we perform a Kruskal-Wallis test to determine if Admixture, Least-Squares, and FRAPPE perform significantly differently at a Bonferroni adjusted significance level of 0.05/(36 parameter sets) = 0.0014. If there is no significant difference, we consider their performances equal. If there is a significant difference, we perform pair-wise Mann-Whitney U-tests to determine significant differences between specific algorithms. We use a Bonferroni adjusted significance level of 0.05/(36 parameter sets)/(3 pair-wise comparisons) = 4.6e-4. The ‘Summary’ columns contain the order of performance among the algorithms such that every algorithm to the left of a ‘<’ symbol performs better than every algorithm to the right. An ‘=’ symbol indicates that the adjacent algorithms do not perform significantly differently.

Comparison on admixtures derived from the HapMap3 dataset

In the original Admixture paper [13], the authors simulate admixed genotypes from population allele frequencies derived from the HapMap Phase 3 dataset [22]. We follow their example to compare the algorithms with more realistic population allele frequencies. Rather than drawing P from a uniform distribution, we estimate the population allele frequencies for unrelated individuals in the HapMap Phase 3 dataset using individuals from the following groups: Han Chinese in Beijing, China (CHB), Utah residents with ancestry from Northern and Western Europe (CEU), and Yoruba individuals in Ibadan, Nigeria (YRI) [22]. We use the same 13928 SNPs provided in the sample data on the Admixture webpage [23]. We randomly simulate 1000 admixed individuals: q ~ Dirichlet(α 1 , α 2 , α 3 ). When the Dirichlet parameters are not equal, we use the degree of admixture, α, for LSα that results in the same total variance as the combination of α 1 , α 2 , and α 3 :

α = K − 1 K 2 v − 1 K , where the total variance , v = Σ k = 1 K α k α 0 − α k α 0 2 α 0 + 1 , and α 0 = Σ k = 1 K α k
(31)

Real dataset from the HapMap phase 3 project

In the original Admixture paper [13], the authors use Admixture to infer three hypothetical ancestral populations from four known populations in the HapMap Phase 3 dataset, including individuals with African ancestry in the American Southwest (ASW), individuals with Mexican ancestry in Los Angeles (MEX), and the same CEU CEU and YRI individuals from the previous example. We ran each algorithm 20 times on the dataset using a convergence threshold of ε = 1e-4, recording the convergence times for each trial.

References

  1. Beaumont M, Barratt EM, Gottelli D, Kitchener AC, Daniels MJ, Pritchard JK, Bruford MW: Genetic diversity and introgression in the Scottish wildcat. Mol Ecol 2001, 10: 319-336. 10.1046/j.1365-294x.2001.01196.x

    Article  CAS  PubMed  Google Scholar 

  2. Novembre J, Ramachandran S: Perspectives on human population structure at the cusp of the sequencing era. Annu Rev Genomics Hum Genet 2011., 12:

    Google Scholar 

  3. Menozzi P, Piazza A, Cavalli-Sforza L: Synthetic maps of human gene frequencies in Europeans. Science 1978, 201: 786-792. 10.1126/science.356262

    Article  CAS  PubMed  Google Scholar 

  4. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 2006, 38: 904-909. 10.1038/ng1847

    Article  CAS  PubMed  Google Scholar 

  5. McVean G: A genealogical interpretation of principal components analysis. PLoS Genet 2009, 5: e1000686. 10.1371/journal.pgen.1000686

    Article  PubMed Central  PubMed  Google Scholar 

  6. Patterson N, Price AL, Reich D: Population structure and eigenanalysis. PLoS Genet 2006, 2: e190. 10.1371/journal.pgen.0020190

    Article  PubMed Central  PubMed  Google Scholar 

  7. Lee C, Abdool A, Huang CH: PCA-based population structure inference with generic clustering algorithms. BMC Bioinforma 2009., 10:

    Google Scholar 

  8. Novembre J, Stephens M: Interpreting principal component analyses of spatial population genetic variation. Nat Genet 2008, 40: 646-649. 10.1038/ng.139

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 2000, 155: 945-959.

    PubMed Central  CAS  PubMed  Google Scholar 

  10. Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data linked loci and correlated allele frequencies. Genetics 2003, 164: 1567-1587.

    PubMed Central  CAS  PubMed  Google Scholar 

  11. Tang H, Peng J, Wang P, Risch NJ: Estimation of individual admixture: Analytical and study design considerations. Genet Epidemiol 2005, 28: 289-301. 10.1002/gepi.20064

    Article  PubMed  Google Scholar 

  12. Wu B, Liu N, Zhao H: PSMIX: an R package for population structure inference via maximum likelihood method. BMC Bioinforma 2006, 7: 317. 10.1186/1471-2105-7-317

    Article  Google Scholar 

  13. Alexander DH, Novembre J, Lange K: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 2009, 19: 1655. 10.1101/gr.094052.109

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Alexander D, Lange K: Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinforma 2011, 12: 246. 10.1186/1471-2105-12-246

    Article  Google Scholar 

  15. Kim H, Park H: Non-negative matrix factorization based on alternating non-negativity constrained least squares and active set method. SIAM Journal in Matrix Analysis and Applications 2008, 30: 713-730. 10.1137/07069239X

    Article  Google Scholar 

  16. Van Benthem MH, Keenan MR: Fast algorithm for the solution of large-scale non-negativity-constrained least squares problems. J Chemom 2004, 18: 441-450. 10.1002/cem.889

    Article  CAS  Google Scholar 

  17. Hanis CL, Chakraborty R, Ferrell RE, Schull WJ: Individual admixture estimates: disease associations and individual risk of diabetes and gallbladder disease among Mexican Americans in Starr County, Texas. Am J Phys Anthropol 1986, 70: 433-441. 10.1002/ajpa.1330700404

    Article  CAS  PubMed  Google Scholar 

  18. Brunet JP, Tamayo P, Golub TR, Mesirov JP: Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A 2004, 101: 4164. 10.1073/pnas.0308531101

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Guillot G, Estoup A, Mortier F, Cosson JF: A spatial statistical model for landscape genetics. Genetics 2005, 170: 1261-1280. 10.1534/genetics.104.033803

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  20. Bertsekas DP: Nonlinear programming. Belmont, Mass.: Athena Scientific; 1995.

    Google Scholar 

  21. Settle JJ, Drake NA: Linear mixing and the estimation of ground cover proportions. Int J Remote Sens 1993, 14: 1159-1177. 10.1080/01431169308904402

    Article  Google Scholar 

  22. Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P, Gibbs RA, Belmont JW, Boudreau A, Leal SM: A haplotype map of the human genome. Nature 2005, 437: 1299-1320. 10.1038/nature04226

    Article  Google Scholar 

  23. ADMIXTURE: fast ancestry estimation. [http://www.genetics.ucla.edu/software/admixture/download.html] []

Download references

Acknowledgements

This work was supported in part by grants from Microsoft Research, National Institutes of Health (Bioengineering Research Partnership R01CA108468, P20GM072069, Center for Cancer Nanotechnology Excellence U54CA119338, and 1RC2CA148265), and Georgia Cancer Coalition (Distinguished Cancer Scholar Award to Professor M. D. Wang).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to May D Wang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

RMP conceived of the least-squares approach to inferring population structure, designed the study, and drafted the document. MDW initiated the SNP data analysis project, acquired funding to sponsor this effort, and directed the project and publication. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Parry, R.M., Wang, M.D. A fast least-squares algorithm for population inference. BMC Bioinformatics 14, 28 (2013). https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-14-28

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-14-28

Keywords