In this section, we propose a novel approach called PPLasso (Predictive Prognostic Lasso) which consists in writing the identification of predictive and prognostic biomarkers as a variable selection problem in an ANCOVA (Analysis of Covariance) type model mentioned for instance in [12].
Statistical modeling
Let \({\textbf{y}}\) be a continuous response or endpoint and \(t_1\), \(t_2\) two treatments. Let also \({\textbf{X}}_{1}\) (resp. \({\textbf{X}}_{2}\)) denote the design matrix for the \(n_1\) (resp. \(n_2\)) patients with treatment \(t_1\) (resp. \(t_{2}\)), each containing measurements on p candidate biomarkers:
$$\begin{aligned} {\textbf{X}}_1= \begin{bmatrix} X_{11}^{1} &{} X_{11}^{2} &{} \ldots &{} X_{11}^{p} \\ X_{12}^{1} &{} X_{12}^{2} &{} \ldots &{} X_{12}^{p} \\ ...\\ X_{1n_{1}}^{1} &{} X_{1n_{1}}^{2} &{} \ldots &{} X_{1n_{1}}^{p} \end{bmatrix},{\textbf{X}}_2=\begin{bmatrix} X_{21}^{1} &{} X_{21}^{2} &{} \ldots &{} X_{21}^{p} \\ X_{22}^{1} &{} X_{22}^{2} &{} \ldots &{} X_{22}^{p} \\ ...\\ X_{2n_{2}}^{1} &{} X_{2n_{2}}^{2} &{} \ldots &{} X_{2n_{2}}^{p} \end{bmatrix}. \end{aligned}$$
(1)
To take into account the potential correlation that may exist between the biomarkers in the different treatments, we shall assume that the rows of \({\textbf{X}}_{1}\) (resp. \({\textbf{X}}_{2}\)) are independent centered Gaussian random vectors with a covariance matrice equal to \(\varvec{\Sigma _1}\) (resp. \(\varvec{\Sigma _2}\)).
To model the link that exists between \({\textbf{y}}\) and the different types of biomarkers we propose using the following model:
$$\begin{aligned} {\textbf{y}}=\begin{pmatrix} y_{11} \\ y_{12}\\ \vdots \\ y_{1n_{1}}\\ y_{21} \\ y_{22}\\ \vdots \\ y_{2n_{2}} \end{pmatrix} ={\textbf{X}}\begin{pmatrix} \alpha _1 \\ \alpha _2 \\ \beta _{11} \\ \beta _{12}\\ \vdots \\ \beta _{1p} \\ \beta _{21} \\ \beta _{22}\\ \vdots \\ \beta _{2p} \end{pmatrix} + \begin{pmatrix} \epsilon _{11} \\ \epsilon _{12}\\ \vdots \\ \epsilon _{1n_{1}}\\ \epsilon _{21} \\ \epsilon _{22}\\ \vdots \\ \epsilon _{2n_{2}} \end{pmatrix}, \end{aligned}$$
(2)
where \((y_{i1},\dots ,y_{in_i})\) corresponds to the response of patients with treatment \(t_i\), i being equal to 1 or 2,
$$\begin{aligned} {\textbf{X}}=\begin{bmatrix} 1 &{} 0 &{} X_{11}^{1} &{} X_{11}^{2} &{} \ldots &{} X_{11}^{p} &{} 0 &{} 0 &{} \ldots &{} 0 \\ 1 &{} 0 &{} X_{12}^{1} &{} X_{12}^{2} &{} \ldots &{} X_{12}^{p} &{} 0 &{} 0 &{} \ldots &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} &{} \vdots &{} &{} &{} \\ 1 &{} 0 &{} X_{1n_{1}}^{1} &{} X_{1n_{1}}^{2} &{} \ldots &{} X_{1n_{1}}^{p} &{} 0 &{} 0 &{} \ldots &{} 0 \\ 0 &{} 1 &{} 0 &{} 0 &{} \ldots &{} 0 &{} X_{21}^{1} &{} X_{21}^{2} &{} \ldots &{} X_{21}^{p} \\ 0 &{} 1 &{} 0 &{} 0 &{} \ldots &{} 0 &{} X_{22}^{1} &{} X_{22}^{2} &{} \ldots &{} X_{22}^{p} \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} &{} \vdots &{} \vdots &{}\vdots &{} &{}\vdots \\ 0 &{} 1 &{} 0 &{} 0 &{} \ldots &{} 0 &{} X_{2n_{2}}^{1} &{} X_{2n_{2}}^{2} &{} \ldots &{} X_{2n_{2}}^{p} \end{bmatrix}, \end{aligned}$$
with \(\alpha _1\) (resp. \(\alpha _2\)) corresponding to the effects of treatment \(t_1\) (resp. \(t_2\)). Moreover, \(\varvec{\beta }_1=(\beta _{11}, \beta _{12}, \ldots , \beta _{1p})'\) (resp. \(\varvec{\beta }_2=(\beta _{21}, \beta _{22}, \ldots , \beta _{2p})'\)) are the coefficients associated to each of the p biomarkers in treatment \(t_1\) (resp. \(t_2\)) group, \('\) denoting the matrix transposition and \(\epsilon _{11},\dots ,\epsilon _{2n_2}\) are standard independent Gaussian random variables independent of \({\textbf{X}}_{1}\) and \({\textbf{X}}_{2}\). When \(t_1\) stands for the standard treatment or placebo, prognostic biomarkers are defined as those having non-zero coefficients in \(\varvec{\beta }_{1}\). According to the definition of prognostic biomarkers, their effect should indeed be demonstrated in the absence of therapy or with a standard therapy that patients are likely to receive. On the other hand, predictive biomarkers are defined as those having non-zero coefficients in \(\varvec{\beta }_{2}-\varvec{\beta }_{1}\) because they aim to highlight different effects between two different treatments.
Model (2) can be written as:
$$\begin{aligned} {\textbf{y}}={\textbf{X}}\varvec{\gamma }+\varvec{\epsilon }, \end{aligned}$$
(3)
with \(\varvec{\gamma }=(\alpha _1,\alpha _2, \varvec{\beta }_{1}', \varvec{\beta }_{2}')'\). The Lasso penalty is a well-known approach to estimate coefficients with a sparsity enforcing constraint allowing variable selection by estimating some coefficients by zero. It consists in minimizing the following penalized least-squares criterion [30]:
$$\begin{aligned} \frac{1}{2}\left\Vert {\textbf{y}}-{\textbf{X}}\varvec{\gamma }\right\Vert _{2}^{2}+\lambda \left\Vert \varvec{\gamma }\right\Vert _{1}, \end{aligned}$$
(4)
where \(\left\Vert {{\textbf {u}}}\right\Vert _{2}^2=\sum _{i=1}^n u_i^2\) and \(\left\Vert {{\textbf {u}}}\right\Vert _{1}=\sum _{i=1}^n |u_i|\) for \({{\textbf {u}}}=(u_1,\dots ,u_n)\). A different sparsity constraint was applied to \(\varvec{\beta }_1\) and \(\varvec{\beta }_{2}-\varvec{\beta }_{1}\) to allow different sparsity levels. Hence we propose to replace the penalty \(\lambda \left\Vert \varvec{\gamma }\right\Vert _{1}\) in (4) by
$$\begin{aligned} \lambda _{1}\left\Vert \varvec{\beta }_{1}\right\Vert _{1}+\lambda _{2}\left\Vert \varvec{\beta }_{2}-\varvec{\beta }_{1}\right\Vert _{1}. \end{aligned}$$
(5)
Thus, a first estimator of \(\varvec{\gamma }\) could be found by minimizing the following criterion with respect to \(\varvec{\gamma }\):
$$\begin{aligned} \frac{1}{2}\left\Vert {\textbf{y}}-{\textbf{X}}\varvec{\gamma }\right\Vert _{2}^{2}+\lambda _{1}\left\Vert \begin{bmatrix} {\textbf{0}}_{p,1} &{} {\textbf{0}}_{p,1} &{} D_{1}\\ {\textbf{0}}_{p,1} &{} {\textbf{0}}_{p,1} &{} \frac{\lambda _{2}}{\lambda _{1}}D_{2}\end{bmatrix}\varvec{\gamma }\right\Vert _{1}, \end{aligned}$$
(6)
where \(D_1=[{\text {Id}}_{p}, {\textbf{0}}_{p,p}]\) and \(D_2=[-{\text {Id}}_{p},{\text {Id}}_{p}]\), with \({\text {Id}}_{p}\) denoting the identity matrix of size p and \({\textbf{0}}_{i,j}\) denoting a matrix having i rows and j columns and containing only zeros. However, since the inconsistency of Lasso biomarker selection is originated from the correlations between the biomarkers, we propose to remove the correlation by “whitening” the matrix \({\textbf{X}}\). More precisely, we consider \({\widetilde{{\textbf{X}}}}={\textbf{X}}\varvec{\Sigma }^{-1/2}\), where
$$\begin{aligned} \varvec{\Sigma }= \begin{bmatrix} 1 &{} 0 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 0 &{} \varvec{\Sigma }_{1} &{} 0 \\ 0 &{} 0 &{} 0&{} \varvec{\Sigma }_{2} \end{bmatrix} \end{aligned}$$
(7)
and define \(\varvec{\Sigma }^{-1/2}\) by replacing in (7) \(\varvec{\Sigma }_{i}\) by \(\varvec{\Sigma }_{i}^{-1/2}\), where \(\varvec{\Sigma }_{i}^{-1/2}={\textbf{U}}_{i}{\textbf{D}}_{i}^{-1/2}{\textbf{U}}_{i}^{T}\), \({\textbf{U}}_{i}\) and \({\textbf{D}}_{i}\) being the matrices involved in the spectral decomposition of \(\varvec{\Sigma }_{i}\) for \(i=1\) or 2. With such a transformation the columns of \({\widetilde{{\textbf{X}}}}\) are decorrelated and Model (3) can be rewritten as follows:
$$\begin{aligned} {\textbf{y}}= {\widetilde{{\textbf{X}}}}{\widetilde{\varvec{\gamma }}}+\varvec{\epsilon }\end{aligned}$$
(8)
where \({\widetilde{\varvec{\gamma }}}=\varvec{\Sigma }^{1/2}\varvec{\gamma }\). The objective function (6) thus becomes:
$$\begin{aligned} L_{\lambda _1,\lambda _2}^{\text {PPLasso}}({\widetilde{\varvec{\gamma }}})=\frac{1}{2}\left\Vert {\textbf{y}}-{\widetilde{{\textbf{X}}}}{\widetilde{\varvec{\gamma }}}\right\Vert _{2}^{2}+\lambda _{1}\left\Vert \begin{bmatrix} {\textbf{0}}_{p,1} &{} {\textbf{0}}_{p,1} &{} D_{1}\\ {\textbf{0}}_{p,1} &{} {\textbf{0}}_{p,1} &{} \frac{\lambda _{2}}{\lambda _{1}}D_{2}\end{bmatrix}\varvec{\Sigma }^{-1/2}{\widetilde{\varvec{\gamma }}}\right\Vert _{1}. \end{aligned}$$
(9)
Estimation of \({\widetilde{\varvec{\gamma }}}\)
Let us define a first estimator of \({\widetilde{\varvec{\gamma }}}=({\widetilde{\alpha }}_1,{\widetilde{\alpha }}_2,{\widetilde{\varvec{\beta }}}_1',{\widetilde{\varvec{\beta }}}_2')\) as follows:
$$\begin{aligned} \widehat{{\widetilde{\varvec{\gamma }}}}_{0}(\lambda _1,\lambda _2)=(\widehat{{\widetilde{\alpha }}}_{1},\widehat{{\widetilde{\alpha }}}_{2}, \widehat{{\widetilde{\varvec{\beta }}}}_{10}', \widehat{{\widetilde{\varvec{\beta }}}}_{20}')=\mathop {\textrm{Argmin}}\nolimits _{{\widetilde{\varvec{\gamma }}}} L_{\lambda _1,\lambda _2}^{\text {PPLasso}}({\widetilde{\varvec{\gamma }}}), \end{aligned}$$
(10)
for each fixed \(\lambda _1\) and \(\lambda _2\). To better estimate \({\widetilde{\varvec{\beta }}}_1\) and \({\widetilde{\varvec{\beta }}}_2\), a thresholding was applied to \(\widehat{{\widetilde{\varvec{\beta }}}}_{0}(\lambda _1,\lambda _2)=(\widehat{{\widetilde{\varvec{\beta }}}}_{10}(\lambda _1,\lambda _2)',\widehat{{\widetilde{\varvec{\beta }}}}_{20}(\lambda _1,\lambda _2)')'\). For \(K_1\) (resp. \(K_2\)) in \(\{1,\ldots ,p\}\), let \(\text {Top}_{K_1}\) (resp. \(\text {Top}_{K_2}\)) be the set of indices corresponding to the \(K_1\) (resp. \(K_2\)) largest values of the components of \(|\widehat{{\widetilde{\varvec{\beta }}}}_{10}(\lambda _1,\lambda _2)|\) (resp. \(|\widehat{{\widetilde{\varvec{\beta }}}}_{20}(\lambda _1,\lambda _2)|\)), then the estimator of \({\widetilde{\varvec{\beta }}}=({\widetilde{\varvec{\beta }}}_1',{\widetilde{\varvec{\beta }}}_2')\) after the correction is denoted by \(\widehat{{\widetilde{\varvec{\beta }}}}(\lambda _1,\lambda _2)=(\widehat{{\widetilde{\varvec{\beta }}}}_{1}^{({\widehat{K}}_1)}(\lambda _1,\lambda _2), \widehat{{\widetilde{\varvec{\beta }}}}_{2}^{({\widehat{K}}_2)}(\lambda _1,\lambda _2))\) where the jth component of \(\widehat{{\widetilde{\varvec{\beta }}}}_{i}^{(K_i)}(\lambda _1,\lambda _2)\), for \(i=1\) or 2, is defined by:
$$\begin{aligned} \widehat{{\widetilde{\varvec{\beta }}}}_{ij}^{(K_i)}(\lambda _{1}, \lambda _{2})= {\left\{ \begin{array}{ll} \widehat{{\widetilde{\varvec{\beta }}}}_{i0j}(\lambda _{1}, \lambda _{2}), &{} j \in \text {Top}_{K_i} \\ K_1\text {th largest value of } |\widehat{{\widetilde{\varvec{\beta }}}}_{i0j}(\lambda _{1}, \lambda _{2})| , &{} j \not \in \text {Top}_{K_i}. \end{array}\right. } \end{aligned}$$
(11)
Note that the corrections are only performed on \(\widehat{{\widetilde{\varvec{\beta }}}}_0\), the estimators \(\widehat{{\widetilde{\alpha }}}_{1}\) and \(\widehat{{\widetilde{\alpha }}}_{2}\) were not modified. The choice of \(K_1\) and \(K_2\) will be explained in “Choice of the parameters \(K_1, K_2\), \(M_1\) and \(M_2\)” section.
To illustrate the interest of using a thresholding step, we generated a dataset based on Model 3 with parameters described in “Simulation setting” section and \(p=500\). Moreover, to simplify the graphical illustrations, we focus on the case where \(\lambda _{1}=\lambda _{2}=\lambda\). Figure 1 displays the estimation error associated to the estimators of \({\widetilde{\varvec{\beta }}}(\lambda )\) before and after the thresholding. We can see from this figure that the estimation of \({\widetilde{\varvec{\beta }}}(\lambda )\) is less biased after the correction. Moreover, we observed that this thresholding strongly improves the final estimation of \(\varvec{\gamma }\) and the variable selection performance of our method.
Estimation of \(\varvec{\gamma }\)
With \(\widehat{{\widetilde{\varvec{\beta }}}} = (\widehat{{\widetilde{\varvec{\beta }}}}_{1}', \widehat{{\widetilde{\varvec{\beta }}}}_{2}')\), the estimators of \(\varvec{\beta }_1\) and \(\varvec{\beta }_2-\varvec{\beta }_1\) can be obtained by \({\widehat{\varvec{\beta }}}_{10}=\varvec{\Sigma }_{1}^{-1/2}\widehat{{\widetilde{\varvec{\beta }}}}_{1}\) and \(({\widehat{\varvec{\beta }}}_{20}-{\widehat{\varvec{\beta }}}_{10})=\varvec{\Sigma }_{2}^{-1/2}\widehat{{\widetilde{\varvec{\beta }}}}_{2}-\varvec{\Sigma }_{1}^{-1/2}\widehat{{\widetilde{\varvec{\beta }}}}_{1}\). As previously, another thresholding was applied to \({\widehat{\varvec{\beta }}}_{10}\) and \({\widehat{\varvec{\beta }}}_{20}\): for \(i=1\) or 2,
$$\begin{aligned} {\widehat{\varvec{\beta }}}_{ij}^{(M_i)}(\lambda _{1}, \lambda _{2})= {\left\{ \begin{array}{ll} {\widehat{\varvec{\beta }}}_{i0j}(\lambda _{1}, \lambda _{2}), &{} j \in \text {Top}_{M_i} \\ 0 , &{} j \not \in \text {Top}_{M_i}, \end{array}\right. } \end{aligned}$$
(12)
for each fixed \(\lambda _1\) and \(\lambda _2\). The biomarkers with non-zero coefficients in \({\widehat{\varvec{\beta }}}_{1}={\widehat{\varvec{\beta }}}_{1}^{(M_1)}\) (resp. \({\widehat{\varvec{\beta }}}_{2}^{(M_2)}-{\widehat{\varvec{\beta }}}_{1}^{(M_1)}\)) are considered as prognostic (resp. predictive) biomarkers, where the choice of \(M_1\) and \(M_2\) is explained in in “Choice of the parameters \(K_1, K_2\), \(M_1\) and \(M_2\)” section .
To illustrate the benefits of using an additional thresholding step, we used the dataset described in “Estimation of \({\widetilde{\varvec{\gamma }}}\)” section. Moreover, to simplify the graphical illustrations, we also focus on the case where \(\lambda _{1}=\lambda _{2}=\lambda\). Additional file 1: Figure S1 displays the number of True Positive (TP) and False Positive (FP) in prognostic and predictive biomarker identification with and without the second thresholding. We can see from this figure that the thresholding stage limits the number of false positives. Note that \(\alpha _1\) and \(\alpha _2\) are estimated by \(\widehat{{\widetilde{\alpha }}}_{1}\) and \(\widehat{{\widetilde{\alpha }}}_{2}\) defined in (10).
Choice of the parameters \(K_1, K_2\), \(M_1\) and \(M_2\)
For each \((\lambda _1, \lambda _2)\) and each \(K_1\), we computed:
$$\begin{aligned} \widetilde{\text {MSE}}_{K_1, K_2}(\lambda _1, \lambda _2)=\Vert {\textbf{y}}-{\widetilde{{\textbf{X}}}}\widehat{{\widetilde{\varvec{\gamma }}}}^{(K1, K2)}(\lambda _1, \lambda _2)\Vert _2^2, \end{aligned}$$
(13)
where \(\widehat{{\widetilde{\varvec{\gamma }}}}^{(K1, K2)}(\lambda _1, \lambda _2)=(\widehat{{\widetilde{\alpha }}}_1,\widehat{{\widetilde{\alpha }}}_2, \widehat{{\widetilde{\varvec{\beta }}}}_1^{(K_1)'},\widehat{{\widetilde{\varvec{\beta }}}}_2^{(K_2)'})\) defined in (10) and in (11). It is displayed in the left part of Fig. 2.
For each \(\lambda _{1}\), \(\lambda _{2}\) and a given \(\delta \in (0,1)\), the parameter \(\widehat{K_2}\) is then chosen as follows for each \(K_1\):
$$\begin{aligned} \widehat{K_2}(\lambda _{1}, \lambda _{2})=\mathop {\textrm{Argmin}}\nolimits \left\{ K_2\ge 1 \text { s.t. }\frac{\widetilde{\text {MSE}}_{(K_1, K_2+1})(\lambda _{1}, \lambda _{2})}{\widetilde{\text {MSE}}_{(K_1,K_2)}(\lambda _{1}, \lambda _{2})} \ge \delta \right\} . \end{aligned}$$
The \({\widehat{K}}_2\) associated to each \(K_1\) are displayed with ’*’ in the left part of Fig. 2. Then \(\widehat{K_1}\) is chosen by using a similar criterion:
$$\begin{aligned} \widehat{K_1}(\lambda _{1}, \lambda _{2})=\mathop {\textrm{Argmin}}\nolimits \left\{ K_1\ge 1 \text { s.t. }\frac{\widetilde{\text {MSE}}_{(K_1+1, {\widehat{K}}_2})(\lambda _{1}, \lambda _{2})}{\widetilde{\text {MSE}}_{(K_1,{\widehat{K}}_2)}(\lambda _{1}, \lambda _{2})} \ge \delta \right\} . \end{aligned}$$
The values of \(\widetilde{\text {MSE}}_{(K_1,{\widehat{K}}_2)}(\lambda _{1}, \lambda _{2})\) are displayed in the right part of Fig. 2 in the particular case where \(\lambda _{1}=\lambda _{2}=\lambda\), \(\delta =0.95\) and with the same dataset as the one used in “Estimation of \({\widetilde{\varvec{\gamma }}}\)” section. \({\widehat{K}}_1\) is displayed with a red star. This value of \(\delta\) will be used in the following sections. However, choosing \(\delta\) in the range (0.9,0.99) does not have a strong impact on the variable selection performance of our approach.
The parameters \({\widehat{M}}_1\) and \({\widehat{M}}_2\) are chosen in a similar way except that \(\widetilde{\text {MSE}}_{K_1, K_2}(\lambda _1, \lambda _2)\) is replaced by \(\widehat{\text {MSE}}_{M_1,M_2}(\lambda _1, \lambda _2)\) where:
$$\begin{aligned} \widehat{\text {MSE}}_{M_1,M_2}(\lambda _1, \lambda _2)=\Vert {\textbf{y}}-{\textbf{X}}{\widehat{\varvec{\gamma }}}^{(M_1,M_2)}(\lambda _1, \lambda _2)\Vert _2^2, \end{aligned}$$
with \({\widehat{\varvec{\gamma }}}^{(M_1,M_2)}(\lambda _1, \lambda _2)=(\widehat{{\widetilde{\alpha }}}_1,\widehat{{\widetilde{\alpha }}}_2, {\widehat{\varvec{\beta }}}_1^{(M_1)'},{\widehat{\varvec{\beta }}}_2^{(M_2)'})\) defined in (10) and (12). In the following, \({\widehat{\varvec{\gamma }}}(\lambda _1, \lambda _2)={\widehat{\varvec{\gamma }}}^{({\widehat{M}}_1,{\widehat{M}}_2)}(\lambda _1, \lambda _2)\).
Estimation of \(\varvec{\Sigma }_1\) and \(\varvec{\Sigma }_2\)
As the empirical correlation matrix is known to be a non accurate estimator of \(\varvec{\Sigma }\) when p is larger than n, a new estimator has to be used. Thus, for estimating \(\varvec{\Sigma }\) we adopted a cross-validation based method designed by [5] and implemented in the cvCovEst R package [6]. This method chooses the estimator having the smallest estimation error among several compared methods (sample correlation matrix, POET [11] and Tapering [7] as examples). Since the samples in treatments \(t_1\) and \(t_2\) are assumed to be collected from the same population, \(\varvec{\Sigma }_1\) and \(\varvec{\Sigma }_2\) are assumed to be equal.
Choice of the parameters \(\lambda _1\) and \(\lambda _2\)
For the sake of simplicity, we limit ourselves to the case where \(\lambda _1=\lambda _2=\lambda\). For choosing \(\lambda\) we used BIC (Bayesian information criterion) which is widely used in the variable selection field and which consists in minimizing the following criterion with respect to \(\lambda\):
$$\begin{aligned} \text {BIC}(\lambda )= n\log (\text {MSE}(\lambda )/n)+ k(\lambda )\log (n), \end{aligned}$$
(14)
where n is the total number of samples, \(\text {MSE}(\lambda )=\Vert {\textbf{y}}-{\textbf{X}}{\widehat{\varvec{\gamma }}}(\lambda )\Vert _2^2\) and \(k(\lambda )\) is the number of non null coefficients in the OLS estimator \({\widehat{\varvec{\gamma }}}\) obtained by re-estimating only the non null components of \({\widehat{\varvec{\beta }}}_1\) and \({\widehat{\varvec{\beta }}}_2-{\widehat{\varvec{\beta }}}_1\). The values of the BIC criterion as well as those of the MSE obtained from the dataset described in “Estimation of \({\widetilde{\varvec{\gamma }}}\)” section are displayed in Fig. 3.
Additional file 1: Table S1 provides the True Positive Rate (TPR) and False Positive Rate (FPR) when \(\lambda\) is chosen either by minimizing the MSE or the BIC criterion for this dataset. We can see from this table that both of them have TPR=1 (all true positives are identified). However, the FPR based on the BIC criterion is smaller than the one obtained by using the MSE.
Note that additional results using two different parameters \(\lambda _1\) and \(\lambda _2\) in the BIC criterion are provided in “Two parameters \(\lambda _1\) and \(\lambda _2\) v.s. \(\lambda\) in the BIC Criterion” section.