 Research
 Open Access
 Published:
Predicting chemotherapy response using a variational autoencoder approach
BMC Bioinformatics volume 22, Article number: 453 (2021)
Abstract
Background
Multiple studies have shown the utility of transcriptomewide RNAseq profiles as features for machine learningbased prediction of response to chemotherapy in cancer. While tumor transcriptome profiles are publicly available for thousands of tumors for many cancer types, a relatively modest number of tumor profiles are clinically annotated for response to chemotherapy. The paucity of labeled examples and the high dimension of the feature data limit performance for predicting therapeutic response using fullysupervised classification methods. Recently, multiple studies have established the utility of a deep neural network approach, the variational autoencoder (VAE), for generating meaningful latent features from original data. Here, we report the first study of a semisupervised approach using VAEencoded tumor transcriptome features and regularized gradient boosted decision trees (XGBoost) to predict chemotherapy drug response for five cancer types: colon, pancreatic, bladder, breast, and sarcoma.
Results
We found: (1) VAEencoding of the tumor transcriptome preserves the cancer type identity of the tumor, suggesting preservation of biologically relevant information; and (2) as a featureset for supervised classification to predict responsetochemotherapy, the unsupervised VAE encoding of the tumor’s gene expression profile leads to better area under the receiver operating characteristic curve and area under the precisionrecall curve classification performance than the original gene expression profile or the PCA principal components or the ICA components of the gene expression profile, in four out of five cancer types that we tested.
Conclusions
Given highdimensional “omics” data, the VAE is a powerful tool for obtaining a nonlinear lowdimensional embedding; it yields features that retain biological patterns that distinguish between different types of cancer and that enable more accurate tumor transcriptomebased prediction of response to chemotherapy than would be possible using the original data or their principal components.
Introduction
Background
Although chemotherapy is a mainstay of treatment for aggressive cancers, many agents have serious side effects [1]. Whether or not chemotherapy will provide a net benefit to a patient depends in large part on whether the malignancy responds to the treatment. Chemotherapy is often administered in cycles [2], leading to multiple opportunities where treatment appropriateness may be (re)assessed [3]. Currently, the medical costbenefit of chemotherapy (versus a nonpharmaceutical approach) is assessed in light of patient health status, expected therapeutic tolerance, and tumor pathological classification [4, 5]. For many cancer types, there is a broad spectrum of cases where the decision of whether or not to undergo chemotherapy is difficult [6,7,8]. The development of a quantitative model that could predict—based on a specific tumor’s molecular profile—whether or not the tumor will respond to chemotherapy would have significant clinical utility. Moreover, an advance in machinelearning methods for the responsetochemotherapy prediction problem [9, 10] would have potential benefits for other prediction problems in medicine.
Tumorigenesis is driven by alterations in the somatic genome and epigenome in cancer cells [11]; however, the somatic genetic or epigenetic determinants of response to chemotherapy also affect gene expression. Studies of various cancer types have demonstrated that tumor gene expression biomarkers correlate with the probability that a tumor will respond to chemotherapy, for example, a fiveprotein signature in breast cancer [12], a 13gene signature in rectal cancer [13, 14], a 63gene signature in liver cancer [15], and a support vector machine (SVM)based model to predict survival time in breast cancer based on a 19gene signature [16]. The findings from such “omics” studies suggest that RNA sequencing (RNAseq)based transcriptome measurements of tumor samples labeled for clinical response can be used to train machinelearning classifiers for predicting response to chemotherapy. However, the accuracy of models that can be learned by fully supervised approaches is limited by the small number of available clinically labeled training cases, given that tumor transcriptome data are highvariance and highdimensional.
For typical cancers, most available tumor transcriptomes are not labeled for chemotherapeutic response; the ratio of such unlabeled to labeled tumor datasets in the Cancer Genome Atlas (TCGA; [17]) is in the range of 10–20, depending on the cancer type. Unlabeled data are a substantial resource that could—in the context of a semisupervised approach—reveal multivariate patterns that could ultimately improve predictive accuracy. Semisupervised approaches that fuse unsupervised data reduction methods for lowdimensional embedding with supervised methods (such as decision trees) for prediction have proved beneficial in problems where large unlabeled datasets are available; for example, a principal components analysis (PCA)XGBoost method has been previously used in finance [18], and an independent component analysis (ICA)based method has been used to classify electroencephalographic signals [19].
Previous applications of VAE in cancer
Multiple studies [20,21,22,23] have demonstrated the power of the variational autoencoder (VAE; [24, 25])—an unsupervised nonlinear data embedding model in which two deep neural networks are oppositely connected through a lowdimensional, probabilistic latent space—for finding useful features in highdimensional data. In the context of cancer, VAEs have been variously used to (1) model gene expression and capture biological features using the TCGA Pancancer Project RNAseq dataset [26, 27]; (2) find encodings that can be used to predict gene inactivation [28]; and (3) obtain an encoding for predicting chemotherapy resistance [29]. Way and Greene [28] explored VAE architectures for predicting gene inactivation in a pancancer dataset and reported biological insights obtained from the latentspace embeddings. George and Lio [29] used a VAEbased, unsupervised approach to encode tumor transcriptomes to obtain latentspace features associated with chemotherapy response. Dincer et al. [30] used a semisupervised, VAElasso approach to predict drug sensitivity of cancer cells in vitro. In contrast to previous efforts to model cancer cell line drug sensitivity in vitro [30,31,32,33], in this work we focused on predicting therapeutic response in vivo, across five different cancer types (colon adenocarcinoma, pancreatic adenocarcinoma, bladder carcinoma, sarcoma, and breast invasive carcinoma). Specifically, we tested the hypothesis that a tumor transcriptome VAE would be useful for predicting responsetochemotherapy in vivo, across multiple cancer types.
Research objectives
We first asked to what extent VAEencoding tumor transcriptomes would preserve characteristics that are associated with distinct cancer types. To that end, we trained a pancancer transcriptome VAE and used it to encode over 11k tumor transcriptomes from 33 cancer types. By comparing twodimensional embeddings of the original tumor transcriptomes with embeddings of the VAEencoded transcriptomes, we found (“VAE encoding preserves cancer type features” section) that the VAE preserves the clustering of tumors of the same cancer type. Next, we selected five cancer types based on sufficiency of clinical data and trained six VAE models (three architectures and two different loss functions) to encode clinicallyunlabeled transcriptomes of the five cancer types. Using TCGA clinical data, we assigned a label “responded” or “progressive” to tumors where the response to chemotherapy information was available (“Obtaining a labeled tumor transcriptome dataset” section). We then used the VAEencoded transcriptomes for the clinicallylabeled tumors as feature data for predicting response to chemotherapy using gradient boosted decision trees (XGBoost; [34]), which we found to be superior to kernel SVM. Using this “semisupervised VAEXGBoost” approach, we investigated (“L1 loss is better than L2 loss and crossentropy loss for this application” section) which loss function type is best for this VAE application.
In the main part of this work, we focused (“Chemotherapy response classification results” section) on the question of whether and to what extent the semisupervised VAEXGBoost (our new method, Fig. 1) approach would improve performance for transcriptomebased prediction of response to chemotherapy, versus a fullysupervised approach or versus alternative semisupervised approaches using PCA or ICA transcriptome encodings. We further investigated the relative importance of these approaches through the lens of XGBoost feature importance (“PCA & VAE feature importance scores, for COAD” section). We carried out these analyses using a comprehensive, fivecancer set of labeled tumor transcriptomes and obtained unbiased classification performance measurements using crossvalidation.
Results
VAE encoding preserves cancer type features
Given reports [35, 36] that unsupervised embeddings can be used to visualize the grouping of cancer types based on highdimensional molecular tumor data, using unsupervised methods, we investigated the extent to which VAE encoding of tumor transcriptomes preserves dataspace features that determine cancer typespecific groupings. In order to do so, we obtained RNAseq transcriptome data from the TCGA data portal for 11,057 tumors labeled for 33 different cancer types (Figs. 2, Additional file 1: S2, S3). As a baseline visualization, we generated a twodimensional embedding of the 11,057 tumor samples by applying tdistributed stochastic neighbor embedding (tSNE) to the expression levels of the the top20% highestvariance genes (threshold selected as described in “Gene expression data” section), yielding 33 clusters (Fig. 2A). Next, we trained a VAE (“Variational autoencoder (VAE), VAE model architectures” sections) with a deep architecture (VAE1) to encode the expression levels of the highestvariance genes in each of 11,057 tumors into an equivalent number of points in a 50dimensional latent space. An unsupervised tSNE visualization (Fig. 2B) of the VAEencoded tumor transcriptome data was remarkably similar in structure to the tSNE visualization of the 13,584dimensional original dataset (Additional file 1: Fig. S1). Additionally, we compared the clustering of the original transcriptome data with VAEreconstructed transcriptome data by Uniform manifold approximation and projection (UMAP), and found similar results (Additional file 1: Figs. S2, S3). These analyses indicated that the VAE encoding preserves dataspace features that distinguish individual cancer types.
Obtaining a labeled tumor transcriptome dataset
Having demonstrated that the VAE can efficiently encode tumor transcriptomes while preserving features that distinguish different cancer types, and to set the stage for implementing a semisupervised approach for predicting response to chemotherapy, we obtained a fivecancertype tumor transcriptome dataset with a significant subset of the tumors labeled as to whether or not the patient responded to chemotherapy, as described below. We obtained transcriptomes of 2,606 tumors across five cancer types [colon adenocarcinoma (COAD), pancreatic adenocarcinoma (PAAD), bladder carcinoma (BLCA), sarcoma (SARC), and breast invasive carcinoma (BRCA); Table 1]. We selected the five cancer types based on availability of a sufficient amount of labeled data in TCGA and for 806 of the tumor transcriptomes, we generated binary labels corresponding to “responded” or “progressive”.
The ratio of responding tumors to progressive disease tumors (i.e., the class balance ratio) ranged from a low of 0.77 for pancreatic cancer to a high of 8.61 for breast cancer.
L1 loss is better than L2 loss and crossentropy loss for this application
Having obtained 2,606 transcriptomes of tumors of five cancer types (with 806 of the tumors labeled by response), we next sought to determine which type of VAE reconstruction loss function—L1, L2, or binary cross entropy—would yield transcriptome encodings that are most amenable to accurate XGBoostbased prediction chemotherapy response. On the 2,606 tumor transcriptomes, we trained three sets of cancer typespecific VAEs (“VAE model architectures” section) using L1 loss, L2 loss, and binary crossentropy loss respectively. We used the L1, L2, and binary crossentropy VAEs to encode the 806 labeled tumor transcriptomes (the top 20% most variable genes in each cancer type, merged across the five cancers, for a total of 13,584 genes) spanning the five cancer types, yielding (for each cancer type) three feature matrices: one based on L1 loss, one based on L2 loss, and a third one based on binary crossentropy loss. We separately evaluated the three feature matrices for XGBoost prediction of the binary responsetochemotherapy class label. By testset area under the receiver operating characteristic (AUROC), averaged across the five cancers, we found (Fig. 3) that the features that were generated by the L1 VAEs led to \(6.2\%\) better (\(p < 10^{9}\), Welch’s ttest) classification performance than the features generated by the L2 VAEs, \(11.7\%\) better (\(p < 10^{9}\), Welch’s ttest) classification performance than the features generated by the binary crossentropy VAEs and thus, for all subsequent analyses, we used VAEs trained with L1 loss.
Chemotherapy response classification results
Having selected L1 reconstruction loss for training VAEs to encode tumor transcriptomes for predicting responsetochemotherapy, we developed a semisupervised approach based on VAE encoding of the tumor transcriptome, for predicting chemotherapy response. In brief, our approach consisted of three steps:

1.
Training a VAE to encode clinically unlabeled tumor transcriptomes (for the top 20% most variable genes) for a single cancer type, into a lowdimensional space (“VAE model architectures” section).

2.
Using that VAE to obtain latentspace encodings for the tumor transcriptomes that are labeled for a relevant clinical endpoint (in this work, response to chemotherapy).

3.
Training and testing a supervised classifier for predicting chemotherapy response.
Because some cancer types benefited from a deeper VAE network architecture than others for effective encoding, we used three different VAE architectures for learning features for predicting chemotherapy response in the context of three subsets of cancer types (VAE1 for breast and pancreatic; VAE2 for colon; and VAE3 for bladder and sarcoma; Table 5). For each VAE architecture, our approach was to use all of the data from the fivecancer set of 2,606 unlabeled tumors for VAE training, but for predicting chemotherapy response for a given cancer type, we used encodings from the VAE architecture that corresponds to the cancer type (Table 5).
To select the supervised classification algorithm for step (3) above, we used an empirical approach, comparing the AUROC performance of XGBoost, kernel SVM, and knearest neighbors for predicting sarcoma response to chemotherapy with features based on VAE3 encodings (semisupervised) or expression levels of individual genes (fullysupervised). We found (Additional file 1: Fig. S4) XGBoost to be superior to kernel SVM and knearest neighbors (KNN), in both semisupervised and fully supervised analyses, and thus we chose XGBoost as the classification algorithm for subsequent analyses.
To address the primary question of to what extent a VAEbased, semisupervised (VAEXGBoost) approach could advance the stateoftheart for transcriptomebased prediction of chemotherapy response, we sought to compare VAEXGBoost’s performance to that of three alternative approaches: (1) a semisupervised approach using a regular autoencoder (AE) with the same architecture as the VAE; (2) a fully supervised approach directly using the transcriptome data; and (3) a semisupervised approach based on a traditional dimensional reduction technique (either principal component analysis, PCA; or independent component analysis, ICA).
VAEXGBoost versus AEXGBoost
To address alternative approach (1) (“traditional autoencoder”), we compared the performance of VAEXGBoost to that of a model consisting of a regular autoencoder combined with XGBoost (“AEXGBoost”; Table 2 and Additional file 1: Figs. S5–S6). For this fourcancer analysis, we used the VAE1 architecture for BRCA and PAAD, which was the same network that we used in the tSNE analysis and the VAE3 architecture for BLCA and SARC (Table 5). We measured performance using testset AUROC and AUPRC using fivefold crossvalidation. We found (Table 2) that VAEXGBoost outperformed AEXGBoost by an average AUROC increase of 14.5% and an average AUPRC increase of 16.3% over the fourcancer average (breast, pancreatic, bladder, and sarcoma), with (\(p < 10^{9}\), Welch’s ttest) classification performance. Thus, we used the VAE for neural networkbased unsupervised embeddings, for subsequent analyses.
VAEXGBoost versus fullysupervised XGBoost
To address alternative approach (2) (“fully supervised”), we empirically compared the performance of the VAEXGBoost method to a fully supervised model in which we applied XGBoost directly to the tumor expression levels of the top 20% most variable genes (13,584 genes) as feature data. For five out of five cancer types (breast, colon, pancreatic, bladder, and sarcoma), in terms of testset AUROC, the VAEXGBoost approach outperformed the fullysupervised XGBoost approach (Additional file 1: Fig. S7), by Welch’s ttest (Table 3). In terms of testset AUPRC, for four out of five cancer types (breast, colon, pancreatic, and bladder), the VAEXGBoost approach outperformed the fullysupervised approach of applying XGBoost directly to the expression levels of the tumors’ top 20% most variable genes (Additional file 1: Fig. S8), by Welch’s ttest (Table 4); for SARC, the semisupervised VAEXGBoost and fullysupervised models’ performances were statistically indistinguishable.
VAEXGBoost versus PCAXGBoost and ICAXGBoost
To address alternative approach (3), we empirically compared VAEXGBoost to models in which PCA or ICA components were used as XGBoost features (i.e., “PCAXGBoost” and “ICAXGBoost”). We aimed to empirically study prediction performance of these models for each of the five cancer types separately, using the set of cancer typespecific labeled tumors (806 labeled tumors in all). For four out of five cancer types (bladder, breast, pancreatic, and sarcoma), in terms of AUROC the semisupervised VAEXGBoost method significantly outperformed the semisupervised PCAXGBoost method (Table 3 and Additional file 1: Fig. S7). Additionally, for three out of five cancer types (breast, colon, and pancreatic), the semisupervised VAEXGBoost method significantly outperformed the semisupervised ICAXGBoost method (Table 3 and Additional file 1: Fig. S7). The fivecancer average AUROC for VAEXGBoost was 0.688, a performance gain of 6.3% over the fivecancer average AUROC for PCAXGBoost (0.646), a gain of 6.5% over the ICAXGBoost (0.645) and a gain of 4.5% over the fullysupervised model’s average (0.658). Notably, a single deep VAE architecture (VAE1, which had a 50dimensional latent space and six layers in the encoder) yielded latentspace encodings that outperformed semisupervised PCAXGBoost for two cancer types (breast and pancreatic); a single shallow VAE architecture (VAE3, which had a 500dimensional latent space and two layers in the encoder) yielded latentspace encodings that outperformed semisupervised PCAXGBoost for two cancer types (bladder and sarcoma).
For three out of five cancer types (breast, bladder, and pancreatic), in terms of AUPRC the semisupervised VAEXGBoost method significantly outperformed the semisupervised PCAXGBoost method (Additional file 1: Fig. S8 and Table 4). Additionally, for three out of five cancer types (breast, colon, and pancreatic), the semisupervised VAEXGBoost method significantly outperformed the semisupervised ICAXGBoost method (Additional file 1: Fig. S8 and Table 4). The fivecancer average AUPRC for VAEXGBoost was 0.441, a performance gain of 9.1% over the fivecancer average AUPRC for PCAXGBoost (0.403), a gain of 8.2% over the ICAXGBoost (0.406), and a gain of 8.5% over the fullysupervised model’s average (0.405).
PCA & VAE feature importance scores, for COAD
Having established that the semisupervised VAEXGBoost outperforms the semisupervised PCAXGBoost approach for tumor transcriptomebased prediction of chemotherapy response for four out of five cancer types, we sought to understand the basis for the higher performance of PCAXGBoost over VAEXGBoost on the fifth cancer type, colon adenocarcinoma (COAD). Specifically, we investigated whether the strong performance of PCAXGBoost on COAD is attributable to differences in the distributions of XGBoost feature importance scores of the PCA features versus VAE latentspace features. We found that the distribution of feature importance scores (as a function of rank) was more sharply peaked at lowestranked features in the VAE than in the PCA (Fig. 4), suggesting that the performance gain with PCA reflects a broader spectrum of informative features for that particular cancer type.
Discussion
As far as we are aware, this work is the first report of a multicancer investigation of the potential for a VAEbased, semisupervised approach for predicting in vivo chemotherapy response from the tumor transcriptome. Across the five cancer types that we studied, the ratio of responding tumors to progressive disease tumors ranged from a low of 0.77 for pancreatic cancer to a high of 8.61 for breast cancer, reflecting a broad range of resistances to standardofcare chemotherapy. Our results clearly demonstrate the utility of the VAE for compressing highdimensional data to a continuous, lowdimensional latent space while retaining features that are essential for distinguishing different cancer types and for predicting response to chemotherapy. Nevertheless, three limitations of this work bear noting.
The first limitation concerns the type(s) of tumor “omics” data from which features are derived for the predictive model. While in this work we focused on tumor transcriptome data which can be measured with high precision over a wide dynamic range of transcript abundances by RNAseq, we note that TCGA datasets of tumor somatic mutations and copy number alteration events are also available [17]. Given the voluminous literature on the use of tumor somatic genomic data for precision cancer diagnosis [37,38,39], tumor DNA datasets are fertile ground for developing a semisupervised, multiomics model for predicting response to chemotherapy.
Second, for decision treebased responsetochemotherapy prediction, the performance of VAEencoded transcriptome features is somewhat sensitive to the type of normalization used for the gene expression levels (data not shown). We explored various published normalization methods for the RNAseq data including standardization of log counts and using FPKM; we ultimately chose minmaxnormalized \(\hbox {log}_2\) totalcountnormalized counts for the gene expression levels to be used to derive features. However, there are additional transcript quantification methods [40] that could be explored in the context of finding optimal tumor transcriptome VAE encodings for precision oncology. A similar comment applies to the specific form of the reconstruction loss function: in our analysis, features from the VAE trained with L1 loss clearly (across five cancers) outperformed those from the VAE trained with L2 or crossentropy loss, and thus, consistent with Way and Greene [28], we used L1 loss for the VAE that we used to address the main question of this work as well as the pancancer tSNE analysis.
The third limitation relates to the VAE architecture. While it is promising that a single deep VAE architecture (VAE1, with a 50dimensional latent space and six fullyconnected layers) yielded features that outperformed PCA and the original RNAseq feature data for two different cancer types (breast and pancreatic), for the other three cancer types, it was necessary to use shallower (twolayer) VAE architectures with bigger latent space dimensions (400 and 500, respectively). Of the five cancer types that we studied, colon cancer and sarcoma had the lowest proportions of labeled samples (0.229 and 0.245 respectively; see Table 1). Our findings suggest that for some cancers, a deep, lowlatentdimension VAE architecture yields optimal features for predicting response, while for other cancers, a shallow, mediumsizedlatentdimension VAE architecture is more effective. Hu and Greene [41], based on a study employing singlecell transcriptome profiling, noted substantial performance differences with hyperparameter tuning on VAE architectures; they further noted that in terms of the robustness of performance with respect to hyperparameter variation, a base VAE with two layers was better than a deeper VAE architecture. Lakhmiri et al. [42] reported VAE architecture hyperparameter tuning as well as the training phase have a great impact on the overall precision of the network and its ability to generalize, and proposed \(\Delta\)MADS, a hybrid derivativefree optimization algorithm for VAE fitting. More study with larger datasets will be required in order to determine whether a single VAE architecture could be successfully used for generalpurpose tumor transcriptome feature extraction for precision oncology.
While our results show promise for the VAE in the context of a semisupervised approach for responsetochemotherapy prediction, for colon cancer, the VAEXGBoost method did not outperform PCAXGBoost (though it did outperform the fully supervised approach of XGBoost trained on the unencoded gene expression data). One possible explanation for the colon cancerspecific superior performance of PCA features over VAE features for predicting response to chemotherapy may be due to the fact that while (for COAD) feature importance for the VAE features is sharply peaked for the first few features and falls off fairly rapidly with feature rank, the PCA features have a significantly flatter distribution of relative feature importance (Fig. 4). Followon studies with larger datasets will be required to delineate under what circumstances transcriptome VAE encodings will prove superior to linear principal components. Multiple groups have argued [43,44,45] that to improve current precision oncology models, significantly expanded training datasets are needed to overcome the challenges posed by tumor heterogeneity, and that models must more broadly leverage somatic genetic and epigenetic information. We anticipate that the performance of VAEXGBoost could improve significantly with more unlabeled and labeled tumor transcriptome data. Finally, we note a possible future extension of this work that will become feasible when larger training datasets are available: because response to chemotherapy is drugdependent, the XGBoost classifier can easily include and use the chemotherapy drug type used for the patient (Additional file 1: Table S1) as a categorical feature.
Conclusions
For four of the five cancer types that we studied, the semisupervised VAEXGBoost approach significantly outperformed a semisupervised PCAXGBoost approach for tumor transcriptomebased prediction of response to chemotherapy, reaching a top AUROC of 0.738 for pancreatic adenocarcinoma. For three of the five cancer types that we studied, the semisupervised VAEXGBoost approach significantly outperformed a semisupervised ICAXGBoost approach for tumor transcriptomebased prediction of response to chemotherapy. For BLCA and SARC, the semisupervised VAEXGBoost and ICAXGBoost models’ performances were statistically indistinguishable. For five out of five cancer types, the semisupervised VAEXGBoost approach significantly outperformed a fullysupervised approach consisting of XGBoost applied to the expression levels of the top 20% most variably expressed genes. Given highdimensional “omics” data, the VAE is a powerful tool for obtaining a nonlinear lowdimensional embedding; it yields features that retain biological patterns that distinguish between different types of cancer and that enable more accurate tumor transcriptomebased prediction of response to chemotherapy than would be possible using the original data or their principal components.
Methods
We carried out all data processing and machinelearning tasks on a Dell XPS 8700 workstation equipped with Nvidia Titan RTX GPU and running the Ubuntu GNU/Linux operating system version 16.04. All of the analysis code that we implemented was executed in Python version 3.5.5 except that we used R version 3.3.3 for statistical analysis of AUROC and AUPRC values (“Area Under ROC Curve (AUROC), Area Under the precisionrecall Curve (AUPRC)” sections), genelevel MAD calculations (“Gene expression data” section) and plotting (“Lowerdimensional embedding” section). We carried out all statistical tests using the R computing environment (version 3.3.3) and using the R software package stats version 3.4.4.
Gene expression data
From the Xena data portal [46], we obtained TCGA Level 3 tumor RNAseq transcriptome data of 33 cancer types (totaling 11, 057 tumors) and, for the responsetochemotherapy prediction problem, five cancer types [colon adenocarcinomas (COAD), pancreatic adenocarcinoma (PAAD), bladder carcinoma (BLCA), sarcoma (SARC), and breast invasive carcinoma (BRCA)] totaling 2,606 tumors. We selected the five cancer types based on two criteria: (1) a sufficient number (at least 65) of paired tumortranscriptome and clinical data samples available for the cancer type; and (2) a sufficient number (at least 180) of tumor transcriptome samples available (regardless of the clinical data availability) for the cancer type. We obtained both the RNAseq (genelevel) totalreadcountnormalized \(\hbox {log}_2 (1+C)\) read counts and normalized (fragments per kilobase of transcript per million mapped reads, FPKM [47]) expression data for for 60,483 human genes. To focus the machinelearning on the portion of the tumor transcriptome that had the most variation from tumor to tumor, we identified the top 20% most variable genes as measured by the median absolute deviation (MAD) across tumors, of gene expression in terms of FPKM (we used FPKM for this purpose in order to mitigate bias due to read length and tumorspecific depth of sequencing) based on our preliminary results for prediction of responsetochemotherapy for SARC, for different quantile thresholds of genes by variability of expression (Additional file 1: Fig. S4). For deriving featuresets for XGBoost prediction directly based on transcript abundances or based on VAE or PCA encoding, the 20% criterion applied to each of the five cancer types yielded a set of 13,584 genes. We computed MAD using the R package stats version 3.4.4 [48] with default parameters. After the variancefiltering step, we used the \(\hbox {log}_2 (1+C)\) of totalcountnormalized count values for the top20% highestvariance genes (that were selected as described above) to obtain (or encode) feature values. We compared the performance—in terms of minimizing the VAE reconstruction loss (see “Variational autoencoder (VAE)” section)—of different feature scaling methods (no scaling, minmax normalization, and standardization [49]) and selected minmax normalization as the method that we used to rescale genelevel count data for input into the VAE.
Lowerdimensional embedding
We computed tSNE embedding components of the tumors using the function sklearn.manifold.TSNE from the python software package scikitlearn version 0.19.1 with parameters \(\mathtt{init = ``pca''}\), \(\mathtt{perplexity = 20}\), \(\mathtt{learning\_rate=300}\), and \(\mathtt{n\_iter=400}\). We computed UMAP embedding components using the function sklearn.manifold.umap.UMAP from the python software package scikitlearn version 0.19.1 with parameters \(\mathtt{n\_neighbors = 50}\), \(\mathtt{min\_dist = 0.3}\), and \(\mathtt{metric=``euclidean''}\). For plotting the embeddings, we used the R software package ggplot2 version 3.1.1.
Variational autoencoder (VAE)
An autoencoder is a type of model that combines “encoder” and “decoder” neural networks to learn a lowdimensional continuous data encoding from which the input signal can be approximately reconstructed [50]. A key advantage of an autoencoder is that it is unsupervised, i.e., it can be trained without labeled examples. Unlike classical autoencoders (e.g., sparse or denoising autoencoders), the variational autoencoder (VAE) is a generative probabilistic model which maps an input vector to a latentspace random variable (r.v.). Below, we mathematically define the VAE.
Let \({{\mathbb {T}}}\) denote the set of tumors for which the VAE is to be fit to the tumor transcriptomes (with \(n \equiv {{\mathbb {T}}}\)) and let m denote the number of genes for which transcript abundances are used to represent the tumor transcriptome. After minmax transformation of the tumor transcriptome measurements (“Gene expression data” section), each tumor’s transcriptome is represented as a vector \({{\varvec{x}}} \in [0,1]^m\). Let \({{\varvec{X}}}\) denote the random variable representing the population distribution from which tumor transcriptomes are sampled, and let \(\varvec{\mathrm {X}} \in [0,1]^{m \times n}\) represent the composite matrix of all sampled tumor transcriptomes). We aim to learn a VAE that will comprise an encoder and decoder, with the encoder consisting of mean and variance functions \({\varvec{\mu }}: [0,1]^{m} \rightarrow {{\mathbb {R}}}^h\) and \({\varvec{\sigma }}: [0,1]^m \rightarrow {{\mathbb {R}}}_{+}^h\), respectively. Together, \({\varvec{\mu }}\) and \({\varvec{\sigma }}\) map the tumor transcriptome vector \({{\varvec{x}}}_t\) to a hdimensional r.v. \({{\varvec{Z}}}{{\varvec{x}}}_t\),
where \(\text {diag}({\varvec{m}})\) is a matrix whose diagonal elements are the elements of the vector \({\varvec{m}}\). This equation is the same as Eq. 1 in Zemouri et al. [51]. The decoder is a function \({\varvec{g}}: {{\mathbb {R}}}^h \rightarrow [0,1]^m\) that, for an outcome \({{\varvec{Z}}}{{\varvec{x}}}_t = {\varvec{\boldsymbol z}}_t \in {{\mathbb {R}}}^h\), maps
the tilde on \(\widetilde{{\varvec{x}}}_t\) denotes that it is the decoded data for the tumor transcriptome \({{\varvec{x}}}_t\). A good autoencoder should have low reconstruction error L, which is convenient to define in terms of the pnorm of the difference between the tumor transcriptome data \({{\varvec{x}}}_t\) and the reconstructed data \(\widetilde{{\varvec{x}}}_t\), i.e., \({{\varvec{x}}}_t  \widetilde{{\varvec{x}}}_t_p^{\,p}\), where \(\;\;_p\) denotes the pnorm. However, this definition of the reconstruction error is only deterministic in the context of a specific outcome \({\varvec{Z}}{\varvec{x}}_t = {\varvec{\boldsymbol z}}_t\). Thus, it is conventional to define the reconstruction error as an expectation value over outcomes of \({\varvec{Z}}{\varvec{x}}_t\),
where \({\mathbb {E}}_{\Omega }\) represents an expectation value over a space of outcomes \(\Omega\). It should be noted the above representation of the reconstruction error is in terms of the outcome, \({{\varvec{\boldsymbol z}}}_t\), of a r.v. (\({\varvec{Z}}{{\varvec{x}}}_t\)) whose distributional parameter functions \({\varvec{\mu }}\) and \({\varvec{\sigma }}\) have hyperparameters (neural network coefficients) that will be fitted. This equation is similar to Eq. 3 in Zemouri et al. [51]. Compared to the binary crossentropy loss used in Eq. 3 in Zemouri et al. [51], our Eq. 3 used L1 loss instead (see findings from an empirical study in “L1 loss is better than L2 loss and crossentropy loss for this application” section demonstrating the superiority of L1 over L2 or binary crossentropy for the VAE reconstruction loss function). Because Eq. 3 is illsuited to backpropagation, it is helpful to recast it in terms of a new random variable \({\varvec{{\mathcal {E}}}}_t\) that depends on \({{\varvec{Z}}}{{\varvec{x}}}_t\) by
It follows from Eqs. 4 and 1 that \({\varvec{\mathcal E}}_t\) is standard multivariate normal,
where \(\varvec{\mathrm {I}}\) is the \(h \times h\) identity matrix, and thus, \({\varvec{{\mathcal {E}}}}_t\) does not depend on \({\varvec{\mu }}\), \({\varvec{\sigma }}\), or t. We therefore drop the subscript t and simply denote the rescaled latentspace random variable as \(\varvec{{\mathcal {E}}}\). Solving Eq. 4 for \({\varvec{Z}}{\varvec{x}}_t\) and applying it to Eq. 3, the reconstruction error \(L({{\varvec{X}}}\!=\!{{\varvec{x}}}_t)\) can be represented by
which is amenable to backpropagation because the only r.v. in it is \(\varvec{{\mathcal {E}}}\), whose distributional parameters do not depend on the neural network coefficients that we will be varying. In practice, rather than computing the multivariate integral over outcomes of \(\varvec{{\mathcal {E}}}\), \(L({\varvec{X}}\!=\!{{\varvec{x}}}_t)\) is typically approximated by averaging over a limited number J of samples from \(\varvec{{\mathcal {E}}}\),
where \(\langle \rangle _j\) denotes average over \(j \in \{1,\ldots , J\}\) and \(\varvec{\epsilon }_j\) is sample j from \(\varvec{{\mathcal {E}}}\). Following Way and Greene [28], we used a number of samples that is equivalent to the dimension of the transcriptome, i.e., \(J = m\). For the case of \(p=2\) (i.e., L2 norm), minimizing \(L({\varvec{X}}={{\varvec{x}}}_t)\) as defined above is equivalent to maximizing the expectation value of the loglikelihood \(\log (P({\varvec{g}}({{\varvec{Z}}})={\varvec{x}}_t \mid {{\varvec{X}}}={\varvec{x}}_t))\). However, following Way and Greene [28] and consistent with empirical evidence (“L1 loss is better than L2 loss and crossentropy loss for this application” section), for our fivecancer study of the utility of a VAEbased approach for responsetochemotherapy prediction, as well as for the pancancer tSNE analysis (“VAE encoding preserves cancer type features” section), we chose to use L1 reconstruction loss, i.e., \(p=1\) in Eq. 3.
The reconstruction loss measures bias error, whose minimization must be balanced against the simultaneous goal of controlling variance error through regularization. In the VAE, regularization requires incentivizing (in the learning of \({\varvec{\mu }}\), \(\varvec{\sigma }\), and \({\varvec{g}}\)) the latent space distributions of \({\varvec{Z}}{{\varvec{x}}}\) to be close to standard multivariate normal. This is accomplished by assigning a penalty based on the KullbackLeibler divergence between the distribution of \({\varvec{Z}}{{\varvec{x}}}_t\) and the target distribution \(\varvec{{\mathcal {E}}}\), represented by \(D_{\text {KL}}(P({{\varvec{Z}}}{{\varvec{x}}}_t) \,  \, P({\varvec{{\mathcal {E}}}}))\). This regularization is analytically tractable [52], and for a given tumor t yields (Supplementary Equation, Eq. S2) the following regularization function:
where \(\log ({\varvec{\sigma }}_t)\) denotes an elementwise log and \(\;\;_1\) is the L1 norm.
Fitting the VAE to \(\varvec{\mathrm X}\) requires selecting \(\varvec{\mu }\), \(\varvec{\sigma }\), and \({{\varvec{g}}}\) from their respective function spaces; in practice, we search over functions that can be represented using a neural network for \(\varvec{\mu }\) and \(\varvec{\sigma }\) (parameterized by the vector \({\varvec{\theta }}\))^{Footnote 1} and a neural network for the function \({{\varvec{g}}}\) (parameterized by the vector \(\varvec{\phi }\)). Exploring the space of functions \(\varvec{\mu }_{\varvec{\theta }}\), \(\varvec{\sigma }_{\varvec{\theta }}\), and \({\varvec{g}}_{\varvec{\phi }}\) corresponds to computationally searching for the vector pair \((\hat{\varvec{\theta }}, \hat{\varvec{\phi }})\) that together minimize the joint (over all tumors) sum of the tumorspecific reconstruction loss and the regularization penalty,
Applying Eqs. 6, 7, and 8, and setting \(p=1\) as discussed above, we obtain the explicit formula for fitting a VAE to \(\varvec{\mathrm {X}}\),
We implemented Eq. 10 in Tensorflow version 1.4.1 with Keras version 2.1.3 as the modellevel library. We solved Eq. 10 using the Adam optimization algorithm [53] (with batch normalization) from the python package kerasgpu version 2.1.3 with parameters \(\mathtt{learning\_rate}=2\times 10^{3}\), \(\mathtt{beta\_1 = 0.9}\), and \(\mathtt{beta\_2 = 0.999}\), to obtain \((\widehat{\varvec{\theta }}, \widehat{\varvec{\phi }})\). Then, for each tumor t, we used a single sample \({\varvec{Z}}{\varvec{x}}_t = {\varvec{z}}_t\) from the distribution \({\mathcal N}(\varvec{\mu }_{\widehat{\varvec{\theta }}}({\varvec{x}}_t), \text {diag}(\varvec{\sigma }_{\widehat{\varvec{\theta }}}( {\varvec{x}}_t)))\) as the final latentspace encoding of the tumor to be used for supervised learning (“Regularized gradient boosted decision trees (XGBoost)” section).
VAE model architectures
We trained six transcriptomeencoding VAEs based on three VAE architectures, the pancancer VAE architecture (for the 33cancer unsupervised analysis, “VAE encoding preserves cancer type features” section) and three cancer typespecific VAE architectures for responsetochemotherapy prediction (“Chemotherapy response classification results” section) (VAE1 was used for two different cancer types, BRCA and PAAD, VAE2 was used for COAD, and VAE3 was used for two different cancer types, BLCA and SARC). For the pancancer, we used the VAE1 model with a latent space dimension \(h=50\). For the cancer typespecific VAE architectures, we again used the same number of fullyconnected layers in the encoder as in the decoder (Table 5).
Labeling tumors based on response to chemotherapy
From Xena and cBioPortal [54, 55], we obtained and combined TCGA clinical data (where available) for the patients whose tumor transcriptomes we acquired (“Gene expression data” section). From Xena, we extracted the variables \(\mathtt{submitter\_id.samples}\), \(\mathtt{therapy\_type}\), and \(\mathtt{measure\_of\_response}\); from cBioPortal, we extracted the variables \(\mathtt{Sample\_ID}\), \(\mathtt{Disease.Free.Status}\), and \(\mathtt{Pharmaceutical.Therapy.Indicator}\). We coanalyzed the Xena and cBioPortalobtained clinical data to label tumors “responded” (\(y=0\)) or ”progressive” (\(y=1\)), by assigning \(y = 0\) when the clinical record had \(\mathtt{Complete\ response}\) or \(\mathtt{partial\ response}\) in the \(\mathtt{measure\_of\_response}\) column of the clinical data from Xena, or with value \(\mathtt{Disease Free}\) in the \(\mathtt{Disease.Free.Status}\) column of the clinical data from cBioPortal while therapy type is recorded as \(\mathtt{Chemotherapy}\) in both. We assigned \(y = 1\) to tumors whose clinical records had values \(\mathtt{Radiographic\ progressive\ disease}\), \(\mathtt{Clinical\ progressive\ disease}\), or \(\mathtt{stable\ disease}\) in the Xena clinical data column \(\mathtt{measure\_of\_response}\), or had value \(\mathtt{Recurred}\)/ \(\mathtt{progressed}\) in the cBioPortal data column \(\mathtt{Disease.Free.Status}\) while the \(\mathtt{therapy\_type}\) is recorded as \(\mathtt{Chemotherapy}\) in both files. This yielded 806 labeled tumors out of 2,606 total. A total of 39 different drugs were used to treat the 794 patients (Additional file 1: Table S1).
Regularized gradient boosted decision trees (XGBoost)
For predicting whether or not (based on its transcriptomederived featureset: raw, PCA, ICA, or VAE) a tumor would respond to chemotherapy, we used XGBoost [34], an efficient implementation of regularized gradient boosted decision trees. We used the classifier function XGBClassifier from the python software package xgboost version 0.80, with gamma=0. We tuned eight hyperparameters (Table 6) by exhaustive gridsearch with fivefold crossvalidation, using model_selection.GridSearchCV from scikitlearn version 0.19.1. To obtain feature importance scores, we used get_score with importance_type = cover.
Principal component analysis (PCA) and independent component analysis (ICA)
For PCA, we used the function decomposition.PCA (with parameters \(\mathtt{svd\_solver}\!=\!\mathtt{``full''})\) and \(\mathtt{n\_components}\!=\!0.9\) (90% variance, yielding 387 components) from the python package scikitlearn version 0.19.1. For plotting, we used matplotlib version 2.1.2. For ICA, we used the function decomposition.FastICA (with parameters \(\mathtt{n\_components}\!=\!387\) (i.e., the same number of components as used in the PCA method) from the python package scikitlearn version 0.19.1. For plotting, we used matplotlib version 2.1.2.
Support vector machine (SVM)
For predicting whether or not (based on its transcriptomedrived featureset: raw or VAE) a tumor would respond to chemotherapy, we used SVM [56]. We used the classifier function SVC from the python software package sklearn.svm, with gamma = “auto”. We tuned three hyperparameters (Table 7) by exhaustive gridsearch with fivefold crossvalidation, using model_selection.GridSearchCV from scikitlearn version 0.19.1.
Knearest neighbors vote (KNN)
For predicting whether or not (based on its transcriptomedrived featureset: raw or VAE) a tumor would respond to chemotherapy, we used KNN [57], an implementation based on the \({\varvec{k}}\) nearest neighbors of each query point. We used the classifier function neighbors.KNeighborsClassifier from the python software package scikitlearn. We tuned five hyperparameters (Table 8) by exhaustive gridsearch with fivefold crossvalidation, using model_selection.GridSearchCV from scikitlearn version 0.19.1.
Area under ROC curve (AUROC)
For computing the AUROC (i.e., sensitivity versus false positive error rate curve), we used the function metrics.roc_auc_score from the python software package scikitlearn version 0.19.1 with parameter average=“weighted”. We logittransformed AUROC values before testing (using twotailed Welch’s ttest). For the L1 versus L2 analysis (“L1 loss is better than L2 loss and crossentropy loss for this application” section), we carried out 30 replications of fivefold crossvalidation; within each replication, across the five folds, we obtained prediction scores for each tumor from the fold in which the tumor was in the test set, enabling us to compute an overall AUROC within each replication. For each training data set, we carried out 30 replications of fivefold crossvalidation by altering the random seed used for assigning data to folds, during the crossvalidation. We used the same procedure for five different cancer types (BLCA, BRCA, COAD, PAAD, SARC) as shown in the panel names of Additional file 1: Figure S7.
Area under the precisionrecall curve (AUPRC)
For computing the AUPRC, we used the function metrics.precision_recall_curve and metrics.auc from the python software package scikitlearn version 0.19.1. We logittransformed AUPRC values before testing (using twotailed Welch’s ttest). We carried out 30 replications of fivefold crossvalidation; within each replication, across the five folds, we obtained prediction scores for each tumor from the fold in which the tumor was in the test set, enabling us to compute an overall AUPRC within each replication. For each training data set, we have done 30 replications of fivefold crossvalidation by altering the random seed used for assign split of data during crossvalidation. We have conducted the same procedure for five different cancer types (BLCA, BRCA, COAD, PAAD, SARC) as shown in the panel names of Additional file 1: Figure S8.
Availability of data and materials
Software code written for this project “VAE for chemotherapy drug response prediction” are freely available under an opensource license, platform independent, written in Python and R with CUDA and tensorflow installed, at the URL: https://github.com/ATHED/VAE_for_chemotherapy_drug_response_prediction. Supplementary data are available at BMC Bioinformatics online.
Notes
 1.
Note, functions \(\varvec{\mu }\) and \(\varvec{\sigma }\) are just two different outputs of the encoding neural network, differing only at the final layer, and thus for simplicity of notation we represent them as having a common parameter vector \(\varvec{\theta }\).
References
 1.
Airley R. Cancer chemotherapy. New York: WileyBlackwell; 2009.
 2.
Skeel RT. Handbook of cancer chemotherapy. 6th ed. Philadelphia: Lippincott Williams & Wilkins; 2003.
 3.
Chabner BA, Longo DL. Cancer chemotherapy and biotherapy: principles and practice. 4th ed. Philadelphia: Lippincott Willians & Wilkins; 2005.
 4.
Kaestner SA, Sewell GJ. Chemotherapy dosing part I: scientific basis for current practice and use of body surface area. Clin Oncol. 2007;19:23–37. https://0doiorg.brum.beds.ac.uk/10.1016/j.clon.2006.10.010.
 5.
Gurney H. How to calculate the dose of chemotherapy. Br J Cancer. 2002;86:1297–302. https://0doiorg.brum.beds.ac.uk/10.1038/sj.bjc.6600139.
 6.
Corrie PG. Cytotoxic chemotherapy: clinical aspects. Medicine. 2008;36(1):24–8. https://0doiorg.brum.beds.ac.uk/10.1016/j.mpmed.2007.10.012.
 7.
Whelan T, Sawka C, Levine M, Gafni A, Reyno L, Willan A, Julian J, Dent S, AbuZahra H, Chouinard E, Tozer R, Pritchard K, Bodendorfer I. Helping patients make informed choices: a randomized trial of a decision aid for adjuvant chemotherapy in lymph nodenegative breast cancer. JNCI: J Natl Cancer Inst. 2003;95(8):581–7. https://0doiorg.brum.beds.ac.uk/10.1093/jnci/95.8.581.
 8.
Malfuson JV, Etienne A, Turlure P, de Revel T, Thomas X, Contentin N, Terré C, Rigaudeau S, Bordessoule D, Vey N, Gardin C, Dombret H. for the Acute Leukemia French Association (ALFA): risk factors and decision criteria for intensive chemotherapy in older patients with acute myeloid leukemia. Haematologica. 2008;93(12):1806–13. https://0doiorg.brum.beds.ac.uk/10.3324/haematol.13309.
 9.
Chiu YC, Chen HIH, Zhang T, Zhang S, Gorthi A, Wang LJ, Huang Y, Chen Y. Predicting drug response of tumors from integrated genomic profiles by deep neural networks. BMC Med Genom. 2019;12(1):18. https://0doiorg.brum.beds.ac.uk/10.1186/s1292001804609.
 10.
Geeleher P, Cox NJ, Huang RS. Clinical drug response can be predicted using baseline gene expression levels and in vitrodrug sensitivity in cell lines. Genome Biol. 2014;15(3):47. https://0doiorg.brum.beds.ac.uk/10.1186/gb2014153r47.
 11.
Weir B, Zhao X, Meyerson M. Somatic alterations in the human cancer genome. Cancer Cell. 2004;6(5):433–8. https://0doiorg.brum.beds.ac.uk/10.1016/j.ccr.2004.11.004.
 12.
GámezPozo A, TrillaFuertes L, PradoVázquez G, Chiva C, LópezVacas R, Nanni P, BergesSoria J, Grossmann J, DíazAlmirón M, Ciruelos E, Sabidó E, Espinosa E, Fresno VJ. Prediction of adjuvant chemotherapy response in triple negative breast cancer with discovery and targeted proteomics. PLoS ONE. 2017;12:6. https://0doiorg.brum.beds.ac.uk/10.1371/journal.pone.0178296.
 13.
Casado E, García VM, Sánchez JJ, Blanco M, Maurel J, Feliu J, FernándezMartos C, de Castro J, Castelo B, BeldaIniesta C, Sereno M, SánchezLlamas B, Burgos E, Ángel GarcíaCabezas M, Manceñido N, Miquel R, GarcíaOlmo D, GonzálezBarón M, Cejas P. A combined strategy of SAGE and quantitative PCR provides a 13gene signature that predicts preoperative chemoradiotherapy response and outcome in rectal cancer. PLoS ONE. 2011;17:4145–54. https://0doiorg.brum.beds.ac.uk/10.1158/10780432.CCR102257.
 14.
Del Rio M, Molina F, BascoulMollevi C, et al. Gene expression signature in advanced colorectal cancer patients select drugs and response for the use of leucovorin, fluorouracil, and irinotecan. J Clin Oncol. 2007;25(7):773–8. https://0doiorg.brum.beds.ac.uk/10.1200/JCO.2006.07.4187.
 15.
Kurokawa Y, Matoba R, Nagano H, Sakon M, Takemasa I, Nakamori S, Dono K, Umeshita K, Ueno N, Ishii S, Kato K, Monden M. Molecular prediction of response to 5fluorouracil and interferonα combination chemotherapy in advanced hepatocellular carcinoma. AACR. 2004;10(18):6029–38. https://0doiorg.brum.beds.ac.uk/10.1158/10780432.CCR040243.
 16.
Rezaeian I, Eliseos JM, Katherina B, Huy QP, Iman R, Dimo A, Alioune N, Luis R, Peter KR. Predicting outcomes of hormone and chemotherapy in the molecular taxonomy of breast cancer international consortium (METABRIC) study by biochemicallyinspired machine learning. F1000Research. 2017;5:2124. https://0doiorg.brum.beds.ac.uk/10.12688/f1000research.9417.3.
 17.
Hutter C, Zenklusen JC. The cancer genome atlas: creating lasting value beyond its data. Cell. 2018;173(2):283–5.
 18.
Wen H, Huang F. Personal loan fraud detection based on hybrid supervised and unsupervised learning. In: 2020 5th IEEE international conference on big data analytics (ICBDA); 2020. p. 339–343 https://0doiorg.brum.beds.ac.uk/10.1109/ICBDA49040.2020.9101277
 19.
Qin J, Li Y, Liu Q. ICA based semisupervised learning algorithm for BCI systems. In: Rosca J, Erdogmus D, Príncipe JC, Haykin S, editors. Independent component analysis and blind signal separation. Berlin: Springer; 2006. p. 214–21.
 20.
An J, Cho S. Variational autoencoder based anomaly detection using reconstruction probability. Technical Report SNUDMTR201503, Seoul National University. 2015. http://dm.snu.ac.kr/static/docs/TR/SNUDMTR201503.pdf.
 21.
Li X, She J. Collaborative variational autoencoder for recommender systems. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY; 2017. p. 305–314. https://0doiorg.brum.beds.ac.uk/10.1145/3097983.3098077.
 22.
Bouchacourt D, Tomioka R, Nowozin S. Multilevel variational autoencoder: learning disentangled representations from grouped observations. arXiv:1705.08841 2017.
 23.
Kipf TN, Welling M. Variational graph autoencoders. arXiv:1611.07308 2016.
 24.
Kingma DP, Welling M. Autoencoding variational bayes. arxiv:1312.6114 2013.
 25.
Jimenez Rezende D, Mohamed S, Wierstra D. Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082 2014.
 26.
Way GP, Greene CS. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac Symp Biocomput. 2018;23:80–91. https://0doiorg.brum.beds.ac.uk/10.1142/9789813235533_0008.
 27.
Titus AJ, Wilkins OM, Bobak CA, Christensen BC. Unsupervised deep learning with variational autoencoders applied to breast tumor genomewide DNA methylation data with biologic feature extraction. Cold Spring Harbor Laboratory. bioRxiv. 2018. https://0doiorg.brum.beds.ac.uk/10.1101/433763.
 28.
Way GP, Greene CS. Evaluating deep variational autoencoders trained on pancancer gene expression. arXiv:1711.04828 2017.
 29.
George TM, Lio P. Unsupervised machine learning for data encoding applied to ovarian cancer transcriptomes. Cold Spring Harbor Laboratory. bioRxiv. 2019. https://0doiorg.brum.beds.ac.uk/10.1101/855593.
 30.
Dincer AB, Celik S, Hiranuma N, Lee SI. Deepprofile: Deep learning of cancer molecular profiles for precision medicine. bioRxiv. 2018. https://0doiorg.brum.beds.ac.uk/10.1101/278739.
 31.
Theodore S, Konstantinos V, Sonali N, Filippos K, Athanassios K, Alexander P, Tyler JM, et al. A deep learning framework for predicting response to therapy in cancer. Cell Rep. 2019;29(11):3367–33734. https://0doiorg.brum.beds.ac.uk/10.1016/j.celrep.2019.11.017.
 32.
Liu P, Li H, Li S, Leung KS. Improving prediction of phenotypic drug response on cancer cell lines using deep convolutional network. BMC Bioinform. 2019;20(1):408. https://0doiorg.brum.beds.ac.uk/10.1186/s1285901929106.
 33.
Ladislav R, Daniel H, Petr S, Benjamin HK, Anna G. Dr.VAE: improving drug response prediction via modeling of drug perturbation effects. Bioinformatics. 2019;35(19):3743–51. https://0doiorg.brum.beds.ac.uk/10.1093/bioinformatics/btz158.
 34.
Chen T, Guestrin C. XGBoost: A scalable tree boosting system. arXiv:1603.02754 2016.
 35.
Dolezal JM, Dash AP, Prochownik EV. Diagnostic and prognostic implications of ribosomal protein transcript expression patterns in human cancers. BMC Cancer. 2018;18(1):275. https://0doiorg.brum.beds.ac.uk/10.1186/s128850184178z.
 36.
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologistlevel classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8. https://0doiorg.brum.beds.ac.uk/10.1038/nature21056.
 37.
Mitchel J, Chatlin K, Tong L, Wang, MD. A translational pipeline for overall survival prediction of breast cancer patients by decisionlevel integration of multiomics data. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM); 2019. p. 1573–1580. https://0doiorg.brum.beds.ac.uk/10.1109/BIBM47256.2019.8983243
 38.
Zhang Y, Feng T, Wang S, Dong R, Yang J, Su J, Wang B. A novel xgboost method to identify cancer tissueoforigin based on copy number variations. Front Genet. 2020;11:1319. https://0doiorg.brum.beds.ac.uk/10.3389/fgene.2020.585029.
 39.
Lee K, Jeong HO, Lee S, Jeong WK. CPEM: Accurate cancer type classification based on somatic alterations using an ensemble of a random forest and a deep neural network. Sci Rep. 2019;9(1):16927. https://0doiorg.brum.beds.ac.uk/10.1038/s41598019530343.
 40.
Evans C, Hardin J, Stoebel DM. Selecting betweensample RNASeq normalization methods from the perspective of their assumptions. Brief Bioinform. 2017;19(5):776–92. https://0doiorg.brum.beds.ac.uk/10.1093/bib/bbx008.
 41.
Hu Q, Greene CS. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell rna transcriptomics. bioRxiv. 2018. https://0doiorg.brum.beds.ac.uk/10.1101/385534.
 42.
Lakhmiri D, Alimo R, Le Digabel S. Tuning a variational autoencoder for data accountability problem in the Mars Science Laboratory ground data system. arxiv:2006.03962 2020.
 43.
Senft D, Leiserson MDM, Ruppin E, Ronai ZA. Precision oncology: the road ahead. Trends Mol Med. 2017;23(10):874–98. https://0doiorg.brum.beds.ac.uk/10.1016/j.molmed.2017.08.003.
 44.
Marchiano EJ, Birkeland AC, Swiecicki PL, SpectorBagdady K, Shuman AG. Revisiting expectations in an era of precision oncology. Oncologist. 2018;23(3):386–8. https://0doiorg.brum.beds.ac.uk/10.1634/theoncologist.20170269.
 45.
Massard C, Michiels S, Ferté C, Le Deley MC, Lacroix L, Hollebecque A, Verlingue L, Ileana E, Rosellini S, Ammari S, NgoCamus M, Bahleda R, Gazzah A, Varga A, PostelVinay S, Loriot Y, Even C, Breuskin I, Auger N, Job B, De Baere T, Deschamps F, Vielh P, Scoazec JY, Lazar V, Richon C, Ribrag V, Deutsch E, Angevin E, Vassal G, Eggermont A, André F, Soria JC. Highthroughput genomics and clinical outcome in hardtotreat advanced cancers: results of the moscato 01 trial. Cancer Discov. 2017;7(6):586–95. https://0doiorg.brum.beds.ac.uk/10.1158/21598290.CD161396.
 46.
Goldman M, Craft B, Hastie M, Repečka K, McDade F, Kamath A, Banerjee A, Luo Y, Rogers D, Brooks AN, Zhu J, Haussler D. The UCSC Xena platform for public and private cancer genomics data visualization and interpretation. Cold Spring Harbor Laboratory. bioRxiv. 2019. https://0doiorg.brum.beds.ac.uk/10.1101/326470.
 47.
Dillies MA, Rau A, Aubert J, HennequetAntier C, Jeanmougin M, Servant N, Keime C, Marot G, Castel D, Estelle J, et al. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis. Brief Bioinform. 2013;14(6):671–83. https://0doiorg.brum.beds.ac.uk/10.1093/bib/bbs046.
 48.
R Core Team. R: A language and environment for statistical computing. Vienna: R Foundation; 2013. (ISBN 3900051070).
 49.
Kreyszig E, Kreyszig H, Norminton EJ. Advanced engineering mathematics. 10th ed. Hoboken: Wiley; 2011.
 50.
Kramer MA. Nonlinear principal component analysis using autoassociative neural networks. AIChE J. 1991;37(2):233–43. https://0doiorg.brum.beds.ac.uk/10.1002/aic.690370209.
 51.
Zemouri R, Lévesque M, Amyot N, Hudon C, Kokoko O, Tahan SA. Deep convolutional variational autoencoder as a 2dvisualization tool for partial discharge source classification in hydrogenerators. IEEE Access. 2020;8:5438–54. https://0doiorg.brum.beds.ac.uk/10.1109/ACCESS.2019.2962775.
 52.
Duchi J. Derivations for linear algebra and optimization. Technical report, Standford University. 2007. http://web.stanford.edu/~jduchi/projects/general_notes.pdf.
 53.
Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv:1412.6980 2014.
 54.
Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, Jacobsen A, Byrne CJ, Heuer ML, Larsson E, Antipin Y, Reva B, Goldberg AP, Sander C, Schultz N. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012;2:401. https://0doiorg.brum.beds.ac.uk/10.1158/21598290.CD120095.
 55.
Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, Sun Y, Jacobsen A, Sinha R, Larsson E, Cerami E, Sander C, Schultz N. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6:11. https://0doiorg.brum.beds.ac.uk/10.1126/scisignal.2004088.
 56.
Cortes C, Vapnik V. Supportvector networks. Mach Learn. 1995;20(3):273–97. https://0doiorg.brum.beds.ac.uk/10.1023/A:1022627411411.
 57.
Goldberger J, Roweis S, Hinton G, Salakhutdinov R. Neighbourhood components analysis. In: Proceedings of the 17th international conference on neural information processing systems. NIPS’04. MIT Press, Cambridge, MA, USA; 2004. p. 513–520. https://0doiorg.brum.beds.ac.uk/10.5555/2976040.2976105
Acknowledgements
Not applicable.
Funding
SAR acknowledges support from the Animal Cancer Foundation.
Author information
Affiliations
Contributions
Designed the study: SAR and QW; wrote the software: QW; carried out the computational analyses: QW; processed and analyzed the data: QW and SAR; wrote the article: SAR and QW. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. This supplementary file contains Supplementary Figures S1, S2, S3, S4, S5, S6, S7, and S8, as well as Table S1 and Supplementary Note C.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wei, Q., Ramsey, S.A. Predicting chemotherapy response using a variational autoencoder approach. BMC Bioinformatics 22, 453 (2021). https://0doiorg.brum.beds.ac.uk/10.1186/s12859021043396
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s12859021043396
Keywords
 Variational autoencoder
 Transcriptome
 TCGA
 Chemotherapy drug response classification
 Cancer
 Colon adenocarcinomas
 Pancreatic adenocarcinoma
 Bladder carcinoma
 Sarcoma
 Breast invasive carcinoma