Skip to main content

Similarity-assisted variational autoencoder for nonlinear dimension reduction with application to single-cell RNA sequencing data



Deep generative models naturally become nonlinear dimension reduction tools to visualize large-scale datasets such as single-cell RNA sequencing datasets for revealing latent grouping patterns or identifying outliers. The variational autoencoder (VAE) is a popular deep generative method equipped with encoder/decoder structures. The encoder and decoder are useful when a new sample is mapped to the latent space and a data point is generated from a point in a latent space. However, the VAE tends not to show grouping pattern clearly without additional annotation information. On the other hand, similarity-based dimension reduction methods such as t-SNE or UMAP present clear grouping patterns even though these methods do not have encoder/decoder structures.


To bridge this gap, we propose a new approach that adopts similarity information in the VAE framework. In addition, for biological applications, we extend our approach to a conditional VAE to account for covariate effects in the dimension reduction step. In the simulation study and real single-cell RNA sequencing data analyses, our method shows great performance compared to existing state-of-the-art methods by producing clear grouping structures using an inferred encoder and decoder. Our method also successfully adjusts for covariate effects, resulting in more useful dimension reduction.


Our method is able to produce clearer grouping patterns than those of other regularized VAE methods by utilizing similarity information encoded in the data via the highly celebrated UMAP loss function.

Peer Review reports


Deep generative models [1,2,3] have been naturally utilized for large-scale biological datasets such as RNA-sequencing data [4,5,6,7,8]. The generative models assume lower-dimensional latent random variables and map them to high-dimensional data. Under these models, the inverse mapping of high-dimensional data to lower-dimensional embeddings can be used as a nonlinear dimension reduction tool.

Among the deep generative models, the variational autoencoder (VAE) [9] and the generative adversarial network (GAN) [10] have been widely investigated. In a VAE, probabilistic hierarchical models are used with the encoder/decoder structure. The encoder is an approximation of the posterior distribution, which maps high-dimensional data to the lower-dimensional embedding, and the decoder is the data distribution function that matches the lower-dimensional embedding to the high-dimensional data. In a GAN, generative mapping is inferred by optimizing a min-max problem. GAN in its original form is not equipped with an encoder because its best use is in generating realistic data. Although it can be trained with an auxiliary encoder, as in [11, 12], we do not pursue GAN in this work because our study showed that, compared to GAN, VAE tends to yield a more stable latent representation.

On the other hand, similarity-based dimension reduction tools show excellent performance and have been widely adopted, as evidenced by t-distributed stochastic neighbor embedding (t-SNE) [13] and uniform manifold approximation and projection (UMAP) [14]. These methods present a clear grouping structure, although it is not equipped with the encoder/decoder structure, which makes new sample extensions of the methods challenging.

To bridge this gap, we propose a new method, a similarity-assisted variational autoencoder (saVAE), which adopts similarity information in the framework of the VAE. We pursue this by adding pull-push regularization to the evident lower bound of the likelihood function. Our method produces a powerful latent representation superior to both VAE and similarity-based approaches. Besides this, for single-cell RNA sequencing data applications, a meaningful group pattern can be easily disguised by many other covariates such as donor or batch effects. We thus extend our approach to the conditional VAE (CVAE) to adjust for such covariate effects.

Our contributions are (1) formulating an objective function that combines similarity-based and model-based approaches to promote a type of dimension reduction that best reflects hidden structures in data without informative priors, (2) adjusting for covariates in the dimension reduction step so that the meaningful lower dimensional representation is not disguised by the covariate effects, and (3) comparing our methods to state-of-art dimension reduction methods.


Overview of saVAE

Our saVAE aims at combining the objective functions of VAE and UMAP (see Methods). The VAE training requires estimating two sets of parameters involved with the encoder and decoder. On the other hand, UMAP is a non-parametric method to directly infer individual embeddings. Thus it is not trivial to integrate UMAP into the VAE framework. To resolve this problem, we connect the two objective functions by utilizing the expected UMAP loss function, because the expectation of the UMAP loss is the function of the decoder parameters. Our model is summarized in Fig. 1. There are several important computational challenges in combining these two methods, such as implementing batching schemes and balancing the convergence of the two approaches. These issues are further discussed in the Methods section.

Fig. 1
figure 1

Overview of saVAE. Our saVAE connects the VAE and UMAP objective functions by utilizing the expected UMAP loss function, which yields a dimension reduction result with a more informative grouping structure

When there are external covariates available, they need to be included in the model to remove extra sources of variation. For example, in a single-cell RNA sequencing data analysis, differences in batches or donors may induce extra groupings. To address the covariates, we utilize CVAE, which models covariate effects with an additional multi-layered perceptron (MLP). We then regularize the CVAE with the expected UMAP loss function, called saCVAE.

In summary, our method has the following advantages: (1) saVAE bridges the parametric method with the nonparametric one by computing the posterior expectation of the nonparametric latent variables. (2) By using batches with carefully sampled pairs, saVAE becomes scalable to handle large-scale datasets. (3) Differences in algorithmic convergence are balanced using a regularization parameter and by adopting proportional updates, which removes extra groups in inferred latent structures.

Synthetic data analysis

We first present the synergies obtained by combining VAE and UMAP with a simulated dataset. The dataset is generated following the sklearn API [15], called the two moons dataset. As shown in Fig. 2a, the dataset consists of two interweaving half circles in a two-dimensional space. Each half-circle has 2000 points and becomes a class. The sklearn method sklearn.datasets.make_moons generates the exact same dataset, with \(\text{noise} = 0.05\) and \(\text{random}\_\text{state} = 42\).

The embeddings inferred from VAE, UMAP, and saVAE are presented in Fig. 2d–f, respectively. The VAE learns neither meaningful latent representation nor accurate reconstruction. On the other hand, the UMAP yields a latent representation revealing two half-circles but does not show the interweaving features between them. Our saVAE captures the interweaving half-circles accurately. Unlike UMAP, VAE and saVAE can produce the reconstructed data points via the inferred decoder. The reconstructed points are compared in Fig. 2b, c. The saVAE’s reconstruction is much closer to the original than VAE’s. This simulation study clearly shows the benefits of saVAE, which are that it inherits the structure of VAE while incorporating the power of the UMAP method.

Fig. 2
figure 2

Method comparison using two moons dataset. The saVAE performs better than either VAE or UMAP in terms of finding meaningful latent embeddings and producing accurate reconstruction

MNIST data analysis

The deep clustering via a Gaussian-mixture variational autoencoder with graph embedding (DGG) [16] is perhaps the most similar approach to our saVAE: it inferred the latent representation using the variational deep embedding (VaDE) [17] model with distance regularization. A detailed discussion of the approach is in the Related Works section. We compare the embeddings from DGG and saVAE as well as UMAP to illustrate the differences among the methods.

The MNIST dataset consists of 60k digit images from 0 to 9. Each data point is in the form of a 28 \(\times\) 28 matrix whose entries are values in the set [0,255]. Dividing the whole dataset by the maximum value 255, all values are in [0,1]. We binarize the resulting values to use a Bernoulli model (6).

In Fig. 3, the UMAP in general separates the groups accurately (Fig. 3a), except for two clusters that needed to be further separated. The DGG forms almost perfect groups in the latent space (Fig. 3b), yet the gap between the two clusters (digits 4 and 9) is still slim. The saVAE shows perfect groups in the latent space, including digits 4 and 9 (Fig. 3c).

Fig. 3
figure 3

Embeddings from MNIST dataset. a The UMAP in general separated the groups accurately except for two clusters that needed to be further separated. b The DGG formed almost perfect groups in the latent space, yet the gap between the two clusters (digits 4 and 9) was still slim. c The saVAE showed perfect groups in the latent space including digits 4 and 9

Single-cell RNA sequencing data analysis when there is no covariate

We compare our saVAE to UMAP, scvis [18], single-cell Variational Inference (scVI) [6, 19], and Ivis [20] as these approaches have available code with excellent performance. The following two datasets are used: (1) cortex [21] and (2) peripheral blood mononuclear cells (PBMC) [22]. The cortex dataset is from the cortex cells of mice. The number of cells is 3005 with seven cell types. The PBMC dataset is one of the most commonly used biological datasets. For example, the dataset is used for immunology [23], regenerative medicine [24], and drug discovery [25]. The number of cells is about 12k, and there are nine cell types. We remark that the PBMC dataset is comprised of two batches of PBMCs (4K PBMCs and 8K PBMCs), but we did not use the batch information because its effect was negligible (see Additional file 1: Figure S1). For both datasets, we used 1200 genes selected by using the highest varying genes (HVGs) method [26]. We then compared the inferred embeddings visually as well as quantitatively.

For quantitative evaluation, we compare clustering accuracy after performing clustering on inferred embeddings. The embedding with higher clustering accuracy is accepted as a better one. As clustering methods, we consider Kmeans [27] and DBSCAN [28] methods. The Kmeans method is a widely used clustering algorithm, but it can capture only spherically shaped clusters. To avoid this limitation, DBSCAN is utilized because it is another well-known clustering method capable of capturing non-spherical clusters. We applied both clustering algorithms to inferred latent embeddings and reported the better clustering result. Here, clustering performance is compared using the average of the Adjusted Rand Index (ARI) [29] and Normalized Mutual Information (NMI) [30]. In this way, the clustering comparison does not depend on the choice of clustering method. The DBSCAN heavily depends on its parameters (eps and min sample). Thus, we searched for parameters that yield the highest average of ARI and NMI.

The embeddings using the cortex and the PBMC datasets are presented in Figs. 4 and 5, respectively. In the cortex data, saVAE shows a clear separation of groups. In the PBMC dataset, the embeddings from our saVAE are superior to those from the other two methods in terms of establishing clusters: CD8 T cells are separated from CD4 T cells, and Dendritic cells and FCGR3A+ Monocytes move away from CD14+ Monocytes. The findings from visual inspection are confirmed in Fig. 6, where our saVAE showed higher ARI and NMI values than UMAP, scvis, Ivis, and scVI.

Fig. 4
figure 4

Embeddings from cortex dataset. Compared to other methods, saVAE and scvis show better performance in separating groups

Fig. 5
figure 5

Embeddings from PBMC dataset. The embeddings from our saVAE are superior to those from the other two methods in terms of establishing clusters: CD8 T cells are separated from CD4 T cells, and Dendritic cells and FCGR3A+ Monocytes move away from CD14+ Monocytes

Fig. 6
figure 6

Performance comparison. Our saVAE performs better than UMAP, scvis, Ivis, and scVI. The box plots were created from 10 trials

Single-cell RNA sequencing data analysis when there are covariates

We compare our method to others using single-cell RNA sequencing datasets when covariates are available. The following two datasets are used: (1) the retina [31] and (2) the heart cell atlas [32] datasets. The retina dataset consists of bipolar cells that exist between photoreceptors and ganglion cells. It contains 20k cells with fifteen cell types in two batches. The original heart cell atlas data consists of 486k cells with fourteen donors of age 40–75. We randomly sample 20k cells and use 4 covariates (cell source, donor, percent mito, and percent ribo). The first two covariates are categorical variables with 4 and 14 levels, respectively, and the others are continuous ones.

Fig. 7
figure 7

Embeddings from retina dataset The UMAP fails to separate the cell groups. The scVI shows better separation, but there remain several groups that need to be pushed away. The saCVAE completely separates cell types, e.g., BC3A and BC3B

Fig. 8
figure 8

Embeddings from heart cell atlas dataset. The UMAP fails to separate the cell groups. The scVI groups a part of neuronal cells as endothelial cells, whereas saCVAE correctly separates them

Fig. 9
figure 9

Performance comparison. Our saCVAE formulation effectively eliminates the covariate effects, resulting in a better cell-type clustering performance. The box plots were created from 10 trials

When fitting saCVAE, the CVAE framework is first applied to the training data. The resulting latent embeddings are then used for computing the similarity weights of the UMAP loss function. The details of saCVAE are in the Method section. Among other existing methods, scVI is the only one that adjusts for covariate effects. We thus compare our approach to scVI, not to scvis or Ivis.

Figures 7 and 8 show that embeddings from UMAP, scVI, and saCVAE, where saCVAE tends to group cell types more accurately. In the retina dataset, saCVAE completely separates BC3A and BC3B cells. In the heart cell atlas dataset, scVI groups a part of neuronal cells as endothelial cells, whereas saCVAE correctly separates them. Figure 9 demonstrates this finding with the accuracy measures. Our saCVAE yields much higher ARI and NMI than the other methods.


In this work, we proposed saVAE, a variant of VAE reflecting the similarity of data in the latent space. Moreover, we extended our approach to CVAE to adjust for biological covariate effects. Finally, we illustrated that saVAE is superior to other nonlinear dimension reduction methods through the various applications to synthetic and real-world datasets. However, saVAE heavily relies on the quality of a similarity table, so the distance metric needs to be selected carefully. Furthermore, our saVAE has higher computation complexity (\(O(N^2)\)) than a vanilla VAE (O(N)) as it unifies two different approaches. Here N represents the number of samples. To alleviate this problem, we used computational techniques of the negative sampling and the update ratio balancing. In the case of saCVAE, it is a two-stage method that may severely depend on the performance of the first-stage approach. When the first stage incurs bad errors, the second stage cannot correct that. Our future work will deal with the mentioned challenges.



VAE is an approach for approximating a generative model. A typical generative model uses a latent variable z and an observable variable x, and the probability density function of \(p_{\theta }(x)\) is obtained by \(\int p_{\theta }(x|z) p_{\theta }(z) dz\). Often, the marginal probability distribution involves intractable integration; the variational lower bound is also utilized.

$$\begin{aligned} \log p_{\theta }(x) \ge \mathbb{E}_{q_{\phi }(z|x)}\log \frac{p_{\theta }(z,x)}{q_{\phi }(z|x)} = \mathbb{E}_{q_{\phi }(z|x)} \log p_{\theta }(x|z) - \mathbb{E}_{q_{\phi }(z|x)} \log \frac{q_{\phi }(z|x)}{p_{\theta }(z)} \end{aligned}$$

Here \(q_{\phi }(z|x)\) is a variational distribution that approximates the posterior distribution \(p_{\theta }(z|x)\). This lower bound is called the evidence lower bound (ELBO).

The Monte Carlo approximation [33] is widely adopted to compute the expectation of the loss (1), as in [34]. The sampling process does not interfere with computing the gradient with respect to \(\theta\). However, the gradient computation with respect to \(\phi\) is not feasible when the parameters are integrated with the random samples.

To eliminate this problem, [9] introduced the following reparametrization trick. Assuming \(q_{\phi }(z|x)\) is a form of a Gaussian variable, the variable z can be written as \(z = \varvec{\mu }_{\phi }(x) + \varvec{\sigma }_{\phi }(x) \odot \varvec{\epsilon }\) where \(\varvec{\epsilon } \sim N(0,1)\) and \(\odot\) denote the element-wise product. Now, we obtain

$$\begin{aligned} \nabla _{\phi }\mathbb{E}_{q_{\phi }(z|x)}\log p_{\theta }(x|z)&= \nabla _{\phi }\mathbb{E}_{\varvec{\epsilon } \sim N(0,1)}\log p_{\theta }(x|\varvec{\mu }_{\phi }(x) + \varvec{\sigma }_{\phi }(x) \odot \varvec{\epsilon }) \nonumber \\&= \mathbb{E}_{\varvec{\epsilon } \sim N(0,1)} \nabla _{\phi } \log p_{\theta }(x|\varvec{\mu }_{\phi }(x) + \varvec{\sigma }_{\phi }(x) \odot \varvec{\epsilon }) \end{aligned}$$

The last expectation is still subject to the law of the auxiliary variable \(\varvec{\epsilon }\), so we can apply the Monte Carlo approximation to the Eq. (2). Therefore, in practice, the VAE is trained with the reparametrization trick. For more details, see [9]. Here, \(\varvec{\mu }_{\phi }\) and \(\varvec{\sigma }_{\phi }\) take the form of MLPs, called encoder networks. The likelihood \(p_{\theta }(x|z)\) plays the role of the decoder. The VAE algorithm is shown in Algorithm 1.

figure a

Algorithm 1 Algorithm of VAE


UMAP is one of the most popular dimension reduction algorithms in the data-science field [35,36,37,38]. The algorithm consists of two primary steps: computing the similarity weights among the data and finding lower-dimensional embeddings that best match the computed similarity weights. What follows is a specific description of the UMAP. We denote the k-nearest neighbors by \(\lbrace x_{i_{j}} \rbrace _{j=1}^{k}\) for each observation \(x_{i}\). The distance between observations is transformed to the following weight

$$\begin{aligned} w(x_{i},x_{j}) = {\left\{ \begin{array}{ll} \exp \left( \frac{-\max (0, d(x_{i},x_{j}) - \rho _{i})}{\sigma _{i}}\right) &{} \text{for } j \in \lbrace i_{1},...,i_{k} \rbrace \\ 0 &{} \text{otherwise}, \end{array}\right. } \end{aligned}$$

where \(\rho _{i}\) is the positive minimum distance from \(x_{i}\), and \(\sigma _{i}\) is the normalizing factor. The symmetrized weight \(\mu _{ij} = w(x_{i},x_{j}) + w(x_{j},x_{i}) - w(x_{i},x_{j})w(x_{j},x_{i})\) becomes the target similarity between \(x_{i}\) and \(x_{j}\). In the optimization step, the following objective function is solved

$$\begin{aligned} \min _{\{y_i\}_{i=1}^n} -\sum _{i} \sum _{j \ne i} \left( \mu _{ij}\ln \left( \frac{1}{1+a\Vert y_{i} - y_{j} \Vert ^{2b}}\right) + (1-\mu _{ij})\ln \left( 1-\frac{1}{1+a\Vert y_{i} - y_{j} \Vert ^{2b}}\right) \right) , \end{aligned}$$

where a and b are hyperparameters. The UMAP finds the embedding to best reproduce the target similarities \(\mu _{ij}\).

The optimization is not scalable for large data, because the computational complexity of (3) is proportional to the square of the size of the data due to the double summation. The following sampling scheme [39] is used to relieve the computational problem. For an embedding \(y_{i}\), one neighbor \(y_j\) with \(\mu _{ij}>0\) is sampled. The sample is called a positive sample. Additionally, \(M \ll N\) embeddings are randomly sampled; these are called negative samples, and are used when the double summation of (3) is computed. The use of negative samples reduces the computational cost of the algorithm to O(NM). Further details for the optimization step can be found in [14, 40].


Fig. 10
figure 10

Graphical model of our saVAE. All plain arrows are multi-layered perceptrons, except for the curved one. The curved arrow reflects the similarities among positive samples in the latent space. The dotted arrows represent the process of making the paired samples according to the precalculated similarity table of the observed data. The gray circles are observable; the white ones are not. The variable z is computed using the reparametrization trick [9], that is, \(z = \varvec{\mu } + \varvec{\epsilon } \odot \varvec{\sigma }\), where \(\varvec{\epsilon } \sim N(\varvec{0},I)\). Here, h is a hidden variable, and s is an optional input. See Methods section

Our saVAE approach bridges VAE and UMAP using a regularization framework. Directly combining the VAE objective with the UMAP loss is not feasible for the following three reasons. First, the optimization arguments of VAE and UMAP are not the same because VAE optimizes its MLP parameters, while UMAP finds lower dimensional embeddings. Second, in the stochastic gradient algorithm, VAE uses a single data point, whereas UMAP uses a pair of data points. Third, the algorithmic convergence rates of the two are quite different, requiring careful balancing.

To address the first problem, we make the UMAP loss a function of the MLP parameters shared with the VAE. Specifically, assuming the embeddings are from the variational distribution, the lower dimensional similarities are replaced with their expectations \(q_{\phi }(z|x)\). Our loss function is then as follows:

$$\begin{aligned}{} & {} -\mathbb{E}_{q_{\phi }(z|x)} \log p_{\theta }(x|z) + \mathbb{E}_{q_{\phi }(z|x)} \log \frac{q_{\phi }(z|x)}{p_{\theta }(z)} \nonumber \\{} & {} \quad -\lambda \sum _{i}\sum _{j \ne i} \mu _{ij}\ln (\tilde{\nu }_{ij}) + (1-\mu _{ij})\ln (1-\tilde{\nu }_{ij}), \end{aligned}$$

where \(\tilde{\nu }_{ij} = \left[ \mathbb{E}_{q_{\phi }(z|x_{i})}\mathbb{E}_{q_{\phi }(z'|x_{j})}\frac{1}{1+a\Vert z - z' \Vert ^{2b}} \right]\) and \(\lambda (>0)\) is a weight.

Here, the influence on the parameters of the conditional density function of z|x is informed by the precomputed high dimensional similarities \(\mu _{ij}\). The similarity information then affects the VAE inference when the parameters are shared. The regularizer not only makes a more well-grouped representation of the data but also assists the generator \(p_{\theta }(x|z)\) to reconstruct better with reliable similarity information. If the similarity is not reliable, the regularization hurts the reconstruction performance, and the inference network will correct the variational distribution to modify the embeddings. In either case, the variational distribution is adjusted to make the loss (4) decrease. Hence, the two loss functions will have positive influences on each other during the training. Our model is visualized in Fig. 10.

The second problem is resolved by the following implementation. In the application of the stochastic gradient descent algorithm, the VAE is usually trained using a mini-batch, whereas the UMAP is based on sampling (especially pairs). So there is no natural way to implement them simultaneously. To resolve this problem, for each data point \(x_{i}\) in a mini-batch, we sample \(P_{i}\) positive data \((x_{i}^{1},...,x_{i}^{P_{i}})\), where \(P_i\) is the number of positive samples, according to the precalculated similarities with \(x_{i}\); we then concatenate \(x_{i}\) with \(x_{i}^{j}\) so that we obtain a paired dataset \(\lbrace (x_{i},x_{i}^{j})\rbrace _{j=1}^{P_{i}}\), which can be used as input for the UMAP. In this case, the VAE inference is made using the unpacked data by treating \(\lbrace x_{i},x_{i}^{1},x_{i},x_{i}^{2},...,x_{i},x_{i}^{P_{i}}\rbrace\) as a mini-batch. In contrast to the UMAP, the negative samples of \(x_{i}\) are randomly taken from the augmented mini-batch \(\lbrace x_{i}^{j}\rbrace _{i,j}\).

The third problem is solved by the selection of an appropriate \(\lambda\) in (4) and the introduction of parameter update frequencies. Specifically, we select \(\lambda\) as the largest value that decreases the VAE loss. Motivated by [41], we further introduce the pre-defined number I, which has the role of controlling the update frequencies. That is, the UMAP loss is updated I times per update of the VAE loss. This value strikes a balance between VAE and UMAP, and reduces the training time. With these two hyperparameters \(\lambda\) and I, the convergence rates of the two losses become similar. Our saVAE algorithm is presented in Algorithm 2.

The specific probability density functions adopted in our method are as follows.

$$\begin{aligned} p_{\theta }(z)&= N(\varvec{0},I), \end{aligned}$$
$$\begin{aligned} p_{\theta }(x|z)&= {\left\{ \begin{array}{ll} \text{Bernoulli}(\varvec{\mu }_{\theta }(z)) &{} \text{or} \\ N(\varvec{\mu }_{\theta }(z),I) &{} \text{or} \\ \text{NegativeBinomial}(\varvec{\mu }_{\theta }(z), \varvec{r}) \end{array}\right. } \end{aligned}$$
$$\begin{aligned} q_{\phi }(z|x)&= N(z; \varvec{\mu }_{\phi }(x),\text{diag}[\varvec{\sigma }^{2}_{\phi }(x)]) \end{aligned}$$

where the bold symbols denote the vector or pre-defined parameters and the sub-indexed symbols are the neural network parameters. The distribution of \(p_{\theta }(x|z)\) is dependent on the datatypes, which include Bernoulli, Gaussian, and Negative Binomial distributions. When the prior and variational distributions are assumed to be the Gaussian distribution of (5) and (7), the negative KL divergence can be computed as follows:

$$\begin{aligned} -D_{\text{KL}}(p_{\theta }(z) \parallel q_{\phi }(z|x)) = \frac{1}{2}\sum _{j=1}^{J} \big [ 1 + \log (\varvec{\sigma }^{2}_{\phi }(x))_{j} - (\varvec{\mu }^{2}_{\phi }(x))_{j} - (\varvec{\sigma }^{2}_{\phi }(x))_{j} \big ] \end{aligned}$$

where J is the dimension of the latent space and \((\cdot )_{j}\) is the jth component.

figure b

Algorithm 2 Algorithm of saVAE


In biological applications including single-cell RNA sequencing data analyses, to find a meaningful latent representation, it is important to adjust for additional covariates. For example, hundreds of cells are collected from a donor, and accounting for the donor-to-donor variations is crucial in the dimension reduction step. Other sources of variability include laboratory conditions, individual characteristics, and instruments used in experiments. These batch-specific effects need to be adjusted to yield an informative latent representation.

Fig. 11
figure 11

Embeddings from retina dataset. VAE and CVAE are applied to the retina dataset. From a and c, we find that the retina dataset is subject to a covariate effect, which causes additional groupings beyond the cell types. As shown in b and d, after the correction with CVAE, the covariate effect is well resolved. In c and d, the labels 0 and 1 represent the batch ID

The CVAE [42] accounts for such covariate effects. Specifically, for the provided covariate s, the loss function of the CVAE is given by

$$\begin{aligned} -\mathbb{E}_{q_{\phi }(z|x,s)} \log p_{\theta }(x|z,s) + \mathbb{E}_{q_{\phi }(z|x,s)} \log \frac{q_{\phi }(z|x,s)}{p_{\theta }(z)}. \end{aligned}$$

The notation |s means that the given covariate s, the inputs of the neural network is concatenated into s. If it is categorical, we transformed it into a one-hot encoded vector before inputting the network. After training, a fixed value \(s_0\) is used for a covariate shift; in other words, a sample from the distribution \(q_{\phi }(z|x, s_0)\) is used as a corrected sample. For example, in Fig. 11, we present the latent representations without batch correction using single-cell RNA sequencing data obtained from the retina. We then use CVAE to estimate the conditional distribution for the covariate shift and visualize the latent representations after adjusting for the covariate effects. The additional separation disappears, suggesting that the grouping comes from the covariate differences. From our experiments across diverse datasets, we found that the CVAE successfully eliminates variabilities from the covariates. We also found that the batch correction is more effective when we use \(q_{\phi }(z|x)\) in (8) instead of \(q_{\phi }(z|x, s_0)\). It appears that covariate shift constantly occurs during the training of \(q_{\phi }(z|x)\). In our paper, we used \(q_{\phi }(z|x)\) to adjust for covariate effects.

Replacing the VAE loss function with CVAE loss alone does not solve the covariate problem in our framework, because the similarity weights in the UMAP loss function are still subject to covariate effects. To solve this problem, we propose a two-stage approach in which similarity weights are computed via the covariate-adjusted embeddings from the first round CVAE and, in the second step, the CVAE with the UMAP loss function is trained. Our final objective function of the saCVAE is given as follows:

$$\begin{aligned}{} & {} -\mathbb{E}_{q_{\phi }(z|x)} \log p_{\theta }(x|z,s) + \mathbb{E}_{q_{\phi }(z|x)} \log \frac{q_{\phi }(z|x)}{p_{\theta }(z)} \nonumber \\{} & {} -\lambda \sum _{i}\sum _{j \ne i} \tilde{\mu }_{ij}\ln (\tilde{\nu }_{ij}) + (1- \tilde{\mu }_{ij})\ln (1-\tilde{\nu }_{ij}), \end{aligned}$$

where \(\tilde{\mu }_{ij}\) is the weight computed from the covariate-adjusted embeddings and \(\tilde{\nu }_{ij}\) and \(\lambda\) are defined in the previous section.

Related works

To the best of our knowledge, the first attempt at bridging the VAE to similarity-based dimension reduction was variational manifold probabilistic linear discriminant analysis (vm-PLDA) [43]. However, the probability model for the data was limited to linear factor models that cannot be easily extended to other discrete datasets. Later, scvis [18] used the objective of non-linear dimensional reduction to regularize the latent space of VAEs. More precisely, it captured underlying low-dimensional structures in a given dataset by imposing the t-SNE objective on the latent space of the VAEs to achieve a more compressed representation of the data. Because this method employs parametric mapping from the high-dimensional space to a low-dimensional embedding, it is possible to naturally add new data points to a learned latent space. On the other hand, this method still inherits the limitations of t-SNE. In addition, [16] proposed deep clustering via a Gaussian-mixture variational autoencoder with graph embedding (DGG); this method inferred the latent representation using the VaDE model with distance regularization. The authors selected the Jenson-Shannon (JS) divergence as a distance between two density functions. The computation, however, was not feasible, so they relaxed it to its upper bound to promote computational efficiency. Additionally, they trained a Siamese network [44, 45] to better learn the similarities among the observed data. This led to robust and excellent performance. One problem in DGG is that it is not equipped with a controlling parameter because the upper bound is derived under only one particular set-up.

In scRNA-seq analysis, VAE-based models have successfully performed many tasks, such as imputation, batch correction, clustering, dimension reduction, and prediction [4,5,6, 8, 18, 46,47,48]. Dhaka [46] is one of the standard VAE-based methods for scRNA data, specializing in tumor cell identification. In scVAE [5], the variational approximation is improved by using a mixture of Gaussian distributions. Yet the number of mixture components needs to be optimally selected, which may be challenging in actual application. VASC [4] introduced a zero-inflated layer into VAE for modeling dropout events in scRNA data. Empirically, however, the estimates exhibited large variabilities. In netAE [47], a VAE-based cell-type prediction model was developed, which utilizes the modularity measures. The approach requires a well trained classifier to guarantee its efficiency. A specialized method, bmVAE [48], is proposed for genotype prediction by integrating VAE, GMM, and Gibbs sampling. Its actual application requires special attention because doublets or triplets may appear in tumor data. Finally, scVI [6] is a versatile tool for various tasks and commonly used as a baseline model in [49]. As a result, performance in each task requires data-specific adjustments.

Transcriptome data preprocessing

Because these transcriptomes have many features (or genes), the outlier removal and feature selection were conducted via the HVGs method [26]. We used 1200 screened features (see Additional file 1: section B for more discussions). We use the Negative Binomial distribution (6) because it is a common assumption of single-cell RNA-sequencing modeling. As a variational distribution, we used \(q_{\phi }(z|x) = N(z; \varvec{\mu }_{\phi }(\log (1+x)),\text{diag}[\varvec{\sigma }^{2}_{\phi }(\log (1+x))])\)). We found that this particular choice produced the best results throughout the experiments. For the retina and heart cell atlas, some covariate information is available. We used the CVAE to adjust for covariates, as described in the saCVAE section.

Implementation details

Table 1 Hypermarameters used in the model

In the implementation, the multi-layered perceptrons used in the VAE consist of three fully connected layers with ReLU [50] activation. The multi-layered perceptrons in the CVAE have only one fully-connected layer with ReLU, BatchNormalization [51], and Dropout [52]. The expectation in Equation (4) was computed via the Monte Carlo method with a single sample [33], and the network weights were trained using the Adam optimizer [53] with the learning rate of 0.001 and the batch size of 128. The parameters in UMAP are the number of nearest neighbors and the min-dist, which are specified in the lower-dimensional similarity functions. We set \(K=30\) and \(\text{ min-dist } = 0\), which showed the best performance in our experiments. Table 1 summarizes the hyperparameters used in our experiments.

For real-world datasets, our model was compared to UMAPFootnote 1 [14], DGGFootnote 2 [16], scvisFootnote 3 [18], IvisFootnote 4 [20], and scVIFootnote 5[6, 19]. These baseline models were utilized to find latent representations of the various datasets.

The MNIST dataset is different from other single-cell RNA sequencing datasets in its likelihood specification. The available DGG code works only on the MNIST dataset (because DGG exploits a pre-trained VaDE and the author provided only the trained parameters of the MNIST without the training code), while the scVI code cannot be used for the MNIST dataset. When not available, those methods are not included in the comparison. Because the retina and heart cell atlas datasets are subject to covariate effects, we compared our methods with scVI because it is the only other method that is equipped with covariate adjustment. Lastly, the implementation of the scvis was too outdated to be modified: to alleviate the computational burden, we used PCA to reduce the dimension of the features to 100. We also fixed the degree of freedom of the t-distribution at one; the t-distribution was used as the similarity criterion of the scvis method.

For a fair comparison, all methods were trained to learn ten-dimensional latent embeddings. We set the hyperparameters of the UMAP to be the same as those of saVAE by setting \(K = 30\) and \(\text{ min-dist } = 0\). scVI supports many implementation options. We set the likelihood and library size availability at NB and False, respectively. The rest of the baseline methods’ hyperparameters were set according to the suggested default numbers.

To visualize the latent embeddings in the 2d plots, we ran UMAP again with the default parameters (\(K = 15\) and \(\text{ min-dist } = 0.1\)). However, we found that the spectral embedding [54], suggested as a default initialization of UMAP, was not suitable for this type of dimension reduction. On the other hand, PCA [55] shows fast convergence of UMAP. We thus used PCA initialization for our implementation.

Availability of data and materials

The datasets and source code are available at


  1. The implementation is available at







Variational autoencoder


T-distributed stochastic neighbor embedding


Uniform manifold approximation and projection


Conditional VAE


Generative adversarial network


Similarity-assisted variational autoencoder


Similarity-assisted conditional variational autoencoder


Multi-layered perceptron


Peripheral blood mononuclear cell


Adjusted Rand Index


Normalized mutual information


  1. Salakhutdinov R. Learning deep generative models. Annu Rev Stat Appl. 2015;2:361–85.

    Article  Google Scholar 

  2. Bond-Taylor S, Leach A, Long Y, Willcocks CG. Deep generative modelling: a comparative review of VAEs, GANs, normalizing flows, energy-based and autoregressive models. IEEE Trans Pattern Anal Mach Intell. 2022;44:7327–47.

    Article  PubMed  Google Scholar 

  3. Gm H, Gourisaria MK, Pandey M, Rautaray SS. A comprehensive survey and analysis of generative models in machine learning. Comput Sci Rev. 2020;38: 100285.

    Article  Google Scholar 

  4. Wang D, Gu J. VASC: dimension reduction and visualization of single-cell RNA-seq data by deep variational autoencoder. Genom Proteom Bioinform. 2018;16:320–31.

    Article  Google Scholar 

  5. Grønbech CH, Vording MF, Timshel PN, Sønderby CK, Pers TH, Winther O. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics. 2020;36:4415–22.

    Article  PubMed  Google Scholar 

  6. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nat Methods. 2018;15:1053–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Way GP, Greene CS. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. In: Biocomputing. World Scientific. 2018; pp. 80–91.

  8. Simidjievski N, Bodnar C, Tariq I, Scherer P, Andres Terre H, Shams Z, Jamnik M, Liò P. Variational autoencoders for cancer data integration: design principles and computational practice. Front Genet. 2019;10:1205.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Kingma DP, Welling M. Auto-encoding variational Bayes. In: Bengio Y, LeCun Y, editors. 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track Proceedings. 2014.

  10. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, editors. Advances in neural information processing systems. 2014; vol. 27.

  11. Donahue J, Krähenbühl P, Darrell T. Adversarial feature learning. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. 2017.

  12. Mukherjee S, Asnani H, Lin E, Kannan S. ClusterGAN: latent space clustering in generative adversarial networks. Proc AAAI Conf Artif Intell. 2019;33:4610–7.

    Google Scholar 

  13. van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res. 2008;9:2579–605.

    Google Scholar 

  14. McInnes L, Healy J, Saul N, Grossberger L. UMAP: uniform manifold approximation and projection. J Open Source Softw. 2018;3:861.

    Article  Google Scholar 

  15. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, Niculae V, Prettenhofer P, Gramfort A, Grobler J, Layton R, VanderPlas J, Joly A, Holt B, Varoquaux G. API design for machine learning software: experiences from the scikit-learn project. In: ECML PKDD workshop: languages for data mining and machine learning. 2013; pp. 108–122.

  16. Yang L, Cheung NM, Li J, Fang J. Deep Clustering by Gaussian Mixture Variational Autoencoders With Graph Embedding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019.

  17. Jiang Z, Zheng Y, Tan H, Tang B, Zhou H. Variational deep embedding: an unsupervised and generative approach to clustering. In: Proceedings of the 26th international joint conference on artificial intelligence. 2017; pp. 1965–1972.

  18. Ding J, Condon A, Shah SP. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun. 2018;9:1–13.

    Article  Google Scholar 

  19. ...Gayoso A, Lopez R, Xing G, Boyeau P, Valiollah Pour Amiri V, Hong J, Wu K, Jayasuriya M, Mehlman E, Langevin M, Liu Y, Samaran J, Misrachi G, Nazaret A, Clivio O, Xu C, Ashuach T, Gabitto M, Lotfollahi M, Svensson V, da Veiga Beltrame E, Kleshchevnikov V, Talavera-López C, Pachter L, Theis FJ, Streets A, Jordan MI, Regier J, Yosef N. A python library for probabilistic analysis of single-cell omics data. Nat Biotechnol. 2022;40:163–6.

    Article  CAS  PubMed  Google Scholar 

  20. Szubert B, Cole JE, Monaco C, Drozdov I. Structure-preserving visualisation of high dimensional single-cell datasets. Sci Rep. 2019;9:1–10.

    Article  CAS  Google Scholar 

  21. Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, Manno GL, Juréus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J, Linnarsson S. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–42.

    Article  CAS  PubMed  Google Scholar 

  22. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12.

    Article  Google Scholar 

  23. Tuller T, Atar S, Ruppin E, Gurevich M, Achiron A. Common and specific signatures of gene expression and protein–protein interactions in autoimmune diseases. Genes Immunity. 2013;14(2):67–82.

    Article  CAS  PubMed  Google Scholar 

  24. Beer L, Mildner M, Gyöngyösi M, Ankersmit HJ. Peripheral blood mononuclear cell secretome for tissue repair. Apoptosis. 2016;21:1336–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Eriksson A, Österroos A, Hassan S, Gullbo J, Rickardson L, Jarvius M, Nygren P, Fryknäs M, Höglund M, Larsson R. Drug screen in patient cells suggests quinacrine to be repositioned for treatment of acute myeloid leukemia. Blood Cancer J. 2015;5:307–307.

    Article  Google Scholar 

  26. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive integration of single-cell data. Cell. 2019;177:1888–190221.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28:129–37.

    Article  Google Scholar 

  28. Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. 1996; pp. 226–231

  29. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2:193–218.

    Article  Google Scholar 

  30. Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning. 2009; pp. 1073–1080.

  31. Shekhar K, Lapan SW, Whitney IE, Tran NM, Macosko EZ, Kowalczyk M, Adiconis X, Levin JZ, Nemesh J, Goldman M, McCarroll SA, Cepko CL, Regev A, Sanes JR. Comprehensive classification of retinal bipolar neurons by single-cell transcriptomics. Cell. 2016;166:1308–132330.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Litviňuková M, Talavera-López C, Maatz H, Reichart D, Worth CL, Lindberg EL, Kanda M, Polanski K, Heinig M, Lee M. Cells of the adult human heart. Nature. 2020;588:466–72.

    Article  PubMed  PubMed Central  Google Scholar 

  33. Metropolis N, Ulam S. The Monte Carlo method. J Am Stat Assoc. 1949;44:335–41.

    Article  CAS  PubMed  Google Scholar 

  34. Spall JC. Introduction to stochastic search and optimization: estimation, simulation, and control. New York: Wiley; 2005.

    Google Scholar 

  35. Dorrity MW, Saunders LM, Queitsch C, Fields S, Trapnell C. Dimensionality reduction by umap to visualize physical and genetic interactions. Nat Commun. 2020;11:1–6.

    Article  Google Scholar 

  36. Allaoui M, Kherfi ML, Cheriet A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: a comparative study. In: El Moataz A, Mammass D, Mansouri A, Nouboud F, editors. Image and signal processing. Springer. 2020; pp. 317–325.

  37. Vermeulen M, Smith K, Eremin K, Rayner G, Walton M. Application of uniform manifold approximation and projection (UMAP) in spectral imaging of artworks. Spectrochim Acta Part A Mol Biomol Spectrosc. 2021;252: 119547.

    Article  CAS  Google Scholar 

  38. Milošević D, Medeiros AS, Stojković Piperac M, Cvijanović D, Soininen J, Milosavljević A, Predić B. The application of uniform manifold approximation and projection (UMAP) for unconstrained ordination and classification of biological indicators in aquatic ecology. Sci Total Environ. 2022;815: 152365.

    Article  PubMed  Google Scholar 

  39. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Burges CJ, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in neural information processing systems, 2013; vol. 26.

  40. Ghojogh B, Ghodsi A, Karray F, Crowley M. Uniform manifold approximation and projection (UMAP) and its variants: tutorial and survey. arXiv preprint. 2021. arXiv:2109.02508

  41. Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Precup D, Teh YW, editors. Proceedings of the 34th international conference on machine learning. 2017; vol. 70, pp. 214–223.

  42. Sohn K, Lee H, Yan X. Learning structured output representation using deep conditional generative models. In: Cortes C, Lawrence N, Lee D, Sugiyama M, Garnett R, editors. Advances in neural information processing systems. 2015; vol. 28.

  43. Chien JT, Hsu CW. Variational manifold learning for speaker recognition. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2017; pp. 4935–4939.

  44. Shaham U, Stanton KP, Li H, Basri R, Nadler B, Kluger Y. SpectralNet: spectral clustering using deep neural networks. In: ICLR (Poster). 2018.

  45. Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06). 2006; pp. 1735–1742.

  46. Rashid S, Shah S, Bar-Joseph Z, Pandya R. Dhaka: variational autoencoder for unmasking tumor heterogeneity from single cell genomic data. Bioinformatics. 2019;37(11):1535–43.

    Article  Google Scholar 

  47. Dong Z, Alterovitz G. netAE: semi-supervised dimensionality reduction of single-cell RNA sequencing to facilitate cell labeling. Bioinformatics. 2020;37(1):43–9.

    Article  Google Scholar 

  48. Yan J, Ma M, Yu Z. bmVAE: a variational autoencoder method for clustering single-cell mutation data. Bioinformatics. 2022;39(1):790.

    Article  Google Scholar 

  49. Flores M, Liu Z, Zhang T, Hasib MM, Chiu Y-C, Ye Z, Paniagua K, Jo S, Zhang J, Gao S-J, Jin Y-F, Chen Y, Huang Y. Deep learning tackles single-cell analysis: a survey of deep learning for scRNA-seq analysis. Brief Bioinform. 2021;23(1):531.

    Article  Google Scholar 

  50. Agarap AF. Deep learning using rectified linear units (relu). 2018. arXiv preprint arXiv:1803.08375

  51. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning. 2015; pp. 448–456.

  52. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(56):1929–58.

    Google Scholar 

  53. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y, editors. 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. 2015. arXiv: 1412.6980

  54. Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 2003;15:1373–96.

    Article  Google Scholar 

  55. FRS, KP. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2. 1901; 559–572.

Download references


Not applicable.


This research was supported by Samsung Electronics (G01220537) and by a National Research Foundation of Korea (NRF) grant funded by the Korean Government (NRF-2022M3J6A1063021).

Author information

Authors and Affiliations



GK developed the model, performed the experiments, and wrote the manuscript. HC conceived the study and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hyonho Chun.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. The PBMC dataset has only negligible batch effects. When varying the HVG selection criteria to 900, 1200, 1500, and 3000, the performance of our method was not significantly affected on any dataset. Finally, a sensitivity analysis of the hyperparameters shows that although other hyperparameter values provide better results, the values we chose also yield strong performance.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, G., Chun, H. Similarity-assisted variational autoencoder for nonlinear dimension reduction with application to single-cell RNA sequencing data. BMC Bioinformatics 24, 432 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: