 Research
 Open access
 Published:
Cauchy hypergraph Laplacian nonnegative matrix factorization for singlecell RNAsequencing data analysis
BMC Bioinformatics volumeÂ 25, ArticleÂ number:Â 169 (2024)
Abstract
Many important biological facts have been found as singlecell RNA sequencing (scRNAseq) technology has advanced. With the use of this technology, it is now possible to investigate the connections among individual cells, genes, and illnesses. For the analysis of singlecell data, clustering is frequently used. Nevertheless, biological data usually contain a large amount of noise data, and traditional clustering methods are sensitive to noise. However, acquiring higherorder spatial information from the data alone is insufficient. As a result, getting trustworthy clustering findings is challenging. We propose the Cauchy hypergraph Laplacian nonnegative matrix factorization (CHLNMF) as a unique approach to address these issues. In CHLNMF, we replace the measurement based on Euclidean distance in the conventional nonnegative matrix factorization (NMF), which can lessen the influence of noise, with the Cauchy loss function (CLF). The model also incorporates the hypergraph constraint, which takes into account the highorder link among the samples. The CHLNMF model's best solution is then discovered using a halfquadratic optimization approach. Finally, using seven scRNAseq datasets, we contrast the CHLNMF technique with the other nine top methods. The validity of our technique was established by analysis of the experimental outcomes.
Introduction
Studying a single cell reveals complex biochemical processes [1, 2]. Processing ScRNAseq data presents distinct computational challenges as it involves the high dimensionality of the data, the existence of disruptions, and technological quirks [3, 4]. Negative matrix factorization (NMF) is a commonly used technique to reduce dimensionality and extract features from singlecell RNA sequencing (scRNAseq) data [5, 6]. The conventional nonnegative matrix factorization (NMF) techniques may not be able to accurately represent the intrinsic structure and interconnections present in the data [7, 8]. The combination of Cauchy Hypergraph Laplacian NonNegative Matrix Factorization (CHLNMF) uses hypergraph Laplacian regularization in conjunction with Cauchy distributionbased sparsity to improve the robustness and interpretability of scRNAseq data analysis [9]. With the advancement of singlecell RNA sequencing (scRNAseq) technology in recent years, a vast amount of scRNAseq data has been generated [10]. Researchers [11], delve into the wealth of biological insights inherent in scRNAseq data by scrutinizing cell information and uncovering heterogeneity among cells, thereby offering valuable insights into the relationships between cells, genes, and diseases [12].
Clustering is a common method to analyze gene expression data [13]. Traditional clustering techniques include Kmeans and spectral clustering (SC), among others [14, 15]. The efficacy of conventional clustering approaches is significantly impacted by the high dimension, high noise, and high sparsity of singlecell RNAseq data. Numerous innovative singlecell clustering techniques have thus far been put forth by researchers [16, 17]. As an illustration, Lu, Wang, Liu, Zheng and Kong [18] introduced SinNLRR, an enhanced Lowrank Representation (LRR) approach that adds nonnegative restrictions to the LRR model. To determine how closely related cells are, this approach can map the data into the many subspaces to which it is assumed that the data belong. A multikernel learning approach dubbed SIMLR was put out by Guo, Wang, Hong, Li, Yang and Du [19]. The key concept of this approach is the adaptive selection of several kernel functions to measure the various data sets, ensuring that it is broadly applicable. To combine several basic partitions into consistent partitions that are as consistent as feasible with the basic partitions, Liu, Zhao, Fang, Cheng, Fu and Liu [20], introduced a technique known as entropybased consensus clustering (ECC). Additionally, the high noise and high dimension in highthroughput sequencing data may be successfully addressed by this strategy. Using variance analysis, Bhattacharjee and Mitra [21] created the Corr clustering technique. This algorithm's benefit is that it can quickly ascertain how many clusters there are, which helps it recognize cell types more accurately.
In the context of higherorder spatial structure in the original data, the aforementioned strategies are unable to lessen the influence of noise. The large dimension makes the dimensionality reduction of the data before clustering a typical practice [22]. As a reliable approach for reducing the dimensionality of data, nonnegative matrix factorization (NMF) is frequently employed in data analysis activities [23]. NMF is a traditional dimension reduction technique that has been used in a wide variety of applications [24,25,26].
We created the Cauchy Hypergraph Laplacian Nonnegative Matrix Factorization technique (CHLNMF) for singlecell data clustering to overcome the issues raised above. To lessen the effect of noise, CHLNMF specifically substitutes the Euclidean distance in the conventional NMF with the Cauchy loss function (CLF). To maintain the higherorder manifold structure found in the original data, the hypergraph regularisation term is also included in the model. The deconstructed coefficient matrix is then clustered using the Kmeans method as per the investigations of Liu, Cao, Gao, Yu and Liang [27].
This Study suggests a fresh approach for processing and analyzing singlecell datasets, named CHLNMF. In this model, we replace the Euclidean distance used in the original NMF model with CLF, which reduces the impact of noise and improves the stability of the model. Second, the CHLNMF techniques include regularisation terms for hypergraphs to maintain the original data's manifold structure. The nonconvex optimization issue is changed into an iterative weighted problem using the halfquadratic (HQ) optimization approach, and the efficient iterative updating rules of the proposed model are derived. To test the viability of the CHLNMF approach, we ran many studies on scRNAseq data sets. Experimental findings demonstrate that our strategy outperforms other methods in terms of overall performance.
Materials and methods
Nonnegative matrix factorization
Highdimensional data may be handled with NMF [15], which denotes the number of genes and samples, respectively, in a nonnegative matrix of dimensions. The goal of NMF is to identify two nonnegative matrices that meet two requirements [16]. It must be much smaller than and is the first requirement. The second requirement is that the product of these two matrices comes close to matching the matrix. The following describes NMF's objective function:
where denotes the Frobenius norm. The updating rules are as below [28]:
Cauchy loss function
In nature, noise is prevalent in data processing. Meanwhile, they are, for the most part, complex and unknown. Therefore, how effectively overcoming the impact of noise is crucial when analyzing data. The CLF is a reliable loss function that has been used for face recognition and picture clustering. In addition to improving the modelâ€™s resilience to nonGaussian noise and outliers, CLF may effectively slow the rise of noise and outliers. According to [17], the Cauchy loss function is as follows:
where the parameter controls the size of the Cauchy loss function's upward opening. In other words, when it is larger, the faster the slope of the function tends to be zero.
It is easy to see that the CLF is a natural logarithm based on the quadratic function. Due to the nature of the logarithmic function, as the independent variable increases, the slope of the function at that point will get closer and closer to zero. Therefore, when the independent variable becomes large, the Cauchy function can slow down the growth rate of the function value at this point, which can mitigate the impact of noise.
The graph of the CLF is explored in Fig.Â 1 and shows independent variable variability, the function value of \(L_{2}\) the norm tends to infinity. When the independent variable exists at a certain point, even if there is a tiny fluctuation, the function value may change considerably. Compared with the Cauchy function, the growth of function value is restrained. Therefore, using CLF to replace the measurement based on Euclidean distance in the standard NMF model is helpful to increase the stability of the method.
Hypergraph regularization
There are some similarities and differences between hypergraphs and simple graphs [29, 30]. The fact that the edges of the hypergraph can be linked to additional nodes differs from the fact that they all take into account the original data's complex structure. As a result, the original data's higherorder spatial structure can be preserved via hypergraph constraint.
Nonempty vertex sets, nonempty hyperedge sets, and a hyperedge weight matrix make up a hypergraph. Typically, a hypergraph is expressed by \({\mathbf{G = (V,E,W)}}\), where \({\mathbf{E}} = \{ e_{i} i = 1,2,...,n\}\) are nonempty hyperedges sets, \({\mathbf{V}} = \{ v_{j} j = 1,2,...,m\}\) nonempty vertex sets, and is a weight matrix of hyperedge. \(e_{i}\) is a subset of the hyperedge set \({\mathbf{E}}\), which is a hyperedge. Each includes a lot of vertexes \(v_{j}\). FigureÂ 2 illustrates the hypergraph's structural layout.
In a schematic diagram of the hypergraph, vertexes are data points and each vertex exists in one or more hyperedges, such as belongs to hyperedge and \(e_{3}\). At the same time, each hyperedge has multiple vertices, such as a hyperedge \(e_{2}\) containing three vertices, which are \(v_{4}\), \(v_{5}\) and \(v_{6}\) respectively. In other words, the hyperedge is a subset of vertex sets \({\mathbf{V}}\). Based on these basic concepts, hypergraphs have a series of related definitions.
We give each hyperedge an initialization weight and draw a hypergraph. Firstly, given an affinity matrix \({\mathbf{A}}\) which is defined as \({\mathbf{A}}_{{{\text{ij}}}} = \exp \left( {  {{\parallel v_{i}  v_{j} \parallel^{2} } \mathord{\left/ {\vphantom {{\parallel v_{i}  v_{j} \parallel^{2} } {\sigma^{2} }}} \right. \kern0pt} {\sigma^{2} }}} \right),\) in which \(\sigma\) represents the average separation across each vertex. The starting weight for each hyperedge may therefore be defined as follows:
Usually, using the incidence matrix \({\mathbf{H}}(v,e)\) shows the relationships between a vertex and a hyperedge. The definition \({\mathbf{H}}\) is as follows:
Add the weights of all hyperedges connected on the same vertex \(v_{j} \in {\mathbf{V}}\), and the total is referred to as the vertex's degree. The degree of hyperedge is typically the number of vertices that belong.
Given a diagonal matrix \({\mathbf{D}}_{v}\), the element \({\mathbf{D}}_{v}\) is the degree of a vertex. And define a matrix \({\mathbf{D}}_{e}\) in which elements are the degree of hyperedge. From the literature [18], The unnormalized hypergraph Laplacian matrix can be known. \({\mathbf{L}}_{hyper} = {\mathbf{D}}_{v}  {\mathbf{S}}\), where \({\mathbf{S}} = {\mathbf{HWD}}_{e}^{  1} {\mathbf{H}}^{T}\).
Objective function of CHLNMF
NMF has been successfully used in several sectors and is an efficient dimensionreduction technique. Realworld applications typically have a lot of outliers and noise in their data. Nevertheless, nonGaussian outliers and noise can affect the typical NMF. However, it cannot also learn the original data's highdimensional manifold structure.
An approach dubbed CHLNMF is suggested as a solution to the aforementioned problems. In particular, CLF is used instead of the traditional Euclidean distance to measure error. CLF can significantly mitigate the impact of data noise. It is beneficial to make the model more resilient. The manifold structure in highdimensional space is preserved concurrently with the addition of the hypergraph constraint component to the CHLNMF model. In conclusion, the objective purpose \(O_{CHLNMF}\) of CHLNMF is as below:
where \(c\) is a regularisation parameter for the hypergraph, is the trace of the matrix, and regulates the slope's rate of descent to zero is a parameter which controls the rate of the slope going to zero, \(\alpha\) is a hypergraph regularization parameter, and \(Tr( \cdot )\) is the trace of the matrix. Our model Framework is shown in Fig.Â 3.
Optimization and updating rule of CHLNMF
It is challenging to directly find the optimal solution of the CHLNMF model since its objective function is nonconvex. Therefore, using Semiquadratic programming theory to solve the objective function \(O_{CHLNMF}\) to find the optimal solution. The primary concept is to add an auxiliary variable and change the objective function into an enhanced objective function. According to the halfquadratic programming theory [31], The following issue is identical to the objective function in Eq.Â (9):
where \(\theta (\omega_{j} )\) is a conjugate of Cauchy functions and \(\omega_{j}\) is an auxiliary variable. Three variables need to be optimized in this optimization problem; therefore, it can be solved by alternating iteration updates.

(1)
Fixed \(\omega\), solve for \({\mathbf{U}}\) and \({\mathbf{V}}\):
Because of the fixed \(\omega\), The following issue is produced by reducing Eq.Â (10):
To solve this problem, \({\mathbf{U}} \ge 0\) and \({\mathbf{V}} \ge 0\) are constrained through two introduced Lagrange multipliers \({{\varvec{\uppsi}}} = \left[ {\psi_{ik} } \right]\) and \({\mathbf{\varphi }} = \left[ {\varphi_{kj} } \right]\), respectively. And then, we obtain a Lagrange function. Which show as follows:
where \(\Lambda = diag(\omega ).\)
The partial derivative of the function \(L\) is obtained concerning \({\mathbf{U}}\) and \({\mathbf{V}}\), respectively:
According to the KarushKuhnTucher (KKT) conditions, let \({\mathbf{\psi U}} = 0\) and \({\mathbf{\varphi V}} = 0\). Updating rules are as below [32]:

(2)
Fixed \({\mathbf{U}}\) and \({\mathbf{V}}\), solve for \(\omega\):
Because of the fixed \({\mathbf{U}}\) and \({\mathbf{V}}\), the Eq.Â (11) is reduced to the following problem:
The best answer to this issue is clear, and it looks like this:
In conclusion, the detailed process of the CHLNMF algorithm is shown in Algorithm 1:
Data sets
The data sets can download from the NCBI (http://0wwwncbinlmnihgov.brum.beds.ac.uk/) and EMBLEBI (http://www.ebi.ac.uk/arrayexpress/), including Pollen [33], Grover [34], Deng [35], Darmains [36], Goolam [37], Treutlin [38], and Ting [39]. The details of seven scRNAseq data sets were summarized in TableÂ 1.
Evaluation metrics
In the experiment, we utilise NMI and ARI as evaluation indexes of experimental performance. The NMI is defined as:
where \(IE( \cdot )\) and \(M( \cdot , \cdot )\) reflect the mutual information and the entropy of the information, accordingly. \(Q = \{ Q_{1} ,Q_{2} ,...,Q_{k} \}\) and \(J = \{ J_{1} ,J_{2} ,...,J_{k} \}\) represent the actual cell clusters andÂ the anticipated labels, accordingly.
The ARI is defined as:
where \(d_{ij}\) represents the mean of \(Q_{i}\) and \(J_{j}\). \(o_{i}\) and \(k_{i}\) shows how many cells are in the cluster. \(Q_{i}\) and \(J_{j}\), correspondingly.
Model convergence analysis
To ensure comparability in our numerical studies, we standardized all algorithms by implementing a learning rate of 0.01 and a convergence threshold of 100 iterations. In addition, we used a random bootstrap method and applied a dimensionality reduction method, such as PCA, before clustering. This was done to ensure a fair comparison of algorithms for all participants. The CHLNMF model used Stochastic Gradient Descent (SGD) with a learning rate of 0.001 for optimization. It incorporated hyperparameters such as five clusters or hypergraphs, regularization parameters Î»1â€‰=â€‰0.1 and Î»2â€‰=â€‰0.01, and parameters Î±â€‰=â€‰0.5 and Î²â€‰=â€‰0.1 for the Cauchy loss function. The goal of developing the CHLNMF model for processing singlecell RNA sequencing data was to provide accurate clustering results and efficient dimensionality reduction. The selection of these properties was based on preliminary tests and theoretical considerations. We verified the convergence of the CHLNMF model through experiments, as shown in Fig.Â 4, representing the error value converges to a certain range within five iterations, proving that our algorithm converges rapidly.
Results and discussion
Parameters setting
In the CHLNMF model, two parameters need to be determined: the hypergraph regularization parameter \(\alpha\), and the scale factor of the CLF \(c\). Verifying the impact of parameters on the model requires: Our team have carried out corresponding experiments, and the experimental results are as follows.
For the scale parameter \(c\), we take eight values in the range of 0.01 to 5 to verify its impact on seven scRNAseq data sets and selected ARI as the evaluation index. In Fig.Â 5, the experimental findings are displayed. The model's illustrative figure makes this clear, strong robustness to the parameter \(c\), and the model is less dependent on it \(c\). Therefore, the parameter is set to 0.5 in subsequent experiments.
For the hypergraph regularization parameter \(\alpha\), its size affects the learning degree of higherorder space structure. In the experiment, \(\alpha\) is set in \(\{ 10^{t} \left {r \in [  5,  4,  3,...,3,4} \right.,5]\}\). The outcomes of the experiment are displayed in Fig.Â 6. The parameter significantly affects the model's performance in the majority of data sets. When the parameter is set to \(10^{1}\), the model performs better in all data sets. Therefore, the parameter is set to in subsequent experiments.
Clustering results analysis
To demonstrate the efficacy of the CHLNMF approach, we ran it on seven human or mouse scRNAseq datasets. Aside from that, we used SinNLRR [40], ECC [20], Corr [41], SIMLR [42], SC [43], SSC [44], Kmeans [45], PCA [46], and tSNE [47] as comparison methods. The matrix were obtained after the dimensionality reduction of the original data by the CHLNMF model, and Kmeans clustering is performed on the coefficient matrix \({\mathbf{V}}\). Except for the Corr method, the number of the cell population of methods is known in the clustering process. NMI values range from 0 to 1, while the values of ARI are betweenâ€‰âˆ’â€‰1 and 1. Performance improves as the index value increases. Tables 2, 3, and 4 display the clustering results, and we may infer the following findings:

1.
The CHLNMF technique is an enhanced iteration of the NMF approach that can efficiently reduce the dimension of singlecell RNA sequencing data, identify various cell types using the coefficient matrix produced after processing, and discover cell heterogeneity. The experimental findings of NMI and ARI in Tables 2 and 3 demonstrate that the lowrank subspace model performs very well in classifying various cell types. However, because it takes into account the effects of noise and manifold structure in highdimensional data, CHLNMF performs better overall than the SinNLRR technique. The PCA, CHLNMF, and SinNLRR methods all decompose the data by matrix, but the characteristic solution of principal component analysis is gained through neutralization, It's not sensitive to cell heterogeneity, so its performance is worse than CHLNMF and SinNLRR.

2.
Tables 2 and 3 provide the parameters that can be used to further analyze the performance of the CHLNMF model against the Kmeans technique. When comparing the two models, the CHLNMF model consistently produces better normalized mutual information (NMI) values (from 0.9142 to 0.9632) compared to the Kmeans model (from 0.7174 to 0.8813). Furthermore, the CHLNMF model applies to all data sets. CHLNMF outperforms Kmeans in terms of NMI values, with an average increase of about 11% to 14%. Adjusted Rand Index (ARI) values for CHLNMF range from approximately 0.805 to 0.9501, while those for Kmeans range from 0.3453 to 0.8567. This difference is consistent across all datasets. When comparing Kmeans with CHLNMF, it is seen that CHLNMF consistently achieves ARI values that are around 15% to 30% higher. This indicates a considerable improvement in both clustering accuracy and agreement with the real labels. The results are consistent with previous research [48, 49], that has shown the limitations of using Kmeans and other traditional clustering techniques on highdimensional, noisy scRNAseq data. Previous research [50] has emphasized the importance of clustering algorithms' ability to withstand and filter out noise to represent the intrinsic biodiversity found in singlecell datasets accurately. The superior performance of the CHLNMF model indicates the effectiveness of new techniques such as hypergraph regularization and the Cauchy loss function. This is in line with the goals of previous research [51], efforts aimed at improving the precision and reliability of clustering in singlecell transcriptome analysis. Given the challenges of working with complex and noisy singlecell datasets, our driven model contributes to ongoing efforts to develop advanced computational methods for analyzing scRNAseq data. The superior performance of the CHLNMF model demonstrates its potential as a robust method to gain meaningful insights from scRNAseq data in many biological scenarios and solve complex problems as earlier seen in multiple studies [52, 53].

3.
Different clustering methods demonstrate varying levels of performance when applied to singlecell RNA sequencing (scRNAseq) data. The basic techniques, such as Kmeans, tSNE, and SCC, provide satisfactory performance with average ARI scores ranging from approximately 0.805 to 0.9592. In contrast, these less intricate techniques exhibit higher average ARI scores compared to more complicated ones, such as SIMLR and Corr. When it comes to capturing complex data structures and relationships between cells, SIMLR and Corr perform exceptionally well, achieving average Adjusted Rand Index (ARI) scores of 0.9415 and 0.9803, respectively. Although basic clustering methods are straightforward, they still achieve competitive Adjusted Rand Index (ARI) scores, making them suitable for analyzing singlecell RNA sequencing (scRNAseq) data. However, the better ARI scores achieved by SIMLR and Corr indicate that not all modifications to traditional methods result in improved performance. Researchers must carefully evaluate the suitability of clustering algorithms based on the distinct characteristics and goals of their scRNAseq datasets.

4.
Tables 2 and 3 show that our technique outperforms previous NMI index and ARI index methods on the Pollen, Grover, Deng, and Darmanis data sets. On the remaining three datasets, it outperforms the majority of techniques as well. Table 4 presents a summary of the performance of different clustering algorithms on seven datasets, indicating that CHLNMF performs better than the other methods. The integration of the Cauchy loss function and the preservation of the manifold structure using hypergraphs be effective in improving the understanding of cell properties. CHLNMF has the highest level of agreement between real and projected clusters, as seen by its superior average ARI and NMI values compared to the other investigated methods. A sensitivity of 0.85 and a specificity of 0.72 for CHLNMF explored that the method correctly identifies 85% of positive instances and 72% of negative instances, highlighting its strength in capturing diverse data patterns. Therefore examining the specificity and sensitivity relative to the ARI and NMI can provide a comprehensive assessment that highlights each tool's ability to reliably identify positive and negative instances. This illustrates the potential benefits of CHLNMF to effectively capture complex data structures and emphasizes the need to use many evaluation metrics to gain a better understanding of the performance of clustering methods.
Gene markers prioritization result analysis
The prioritization of gene markers has always been the focus of attention. There are many of unknown biological information in cell gene markers which is very helpful for us to distinguish cell subpopulations and discover the complexity of cells [54, 55]. In our experiment, firstly, the original data are processed by the CHLNMF model to attain the coefficient matrix \({\mathbf{V}}\). The similarity matrix was created using the learned similarity of the coefficient matrix and Pearson's coefficient. Following that, we utilized the Laplacian Score to choose the genes that had a differential expression on the similarity matrix. The nearest neighbor graph is built using the Laplacian Score, which also incorporates the original gene expression matrix and similarity matrix to determine each gene's score. We predict that the gene's importance is inversely correlated with the Laplacian explored score. The markers were chosen as the genes with the highest scores and the top ten marker genes were selected according to the sequence of scoring genes from high to low as depicted in Fig.Â 7.
GSEA analysis, ADAMTS6 is full of cancerrelated pathways, like VEGF, which regulates angiogenesis. Vascular endothelial growth factor signaling pathway inhibition has been demonstrated to impede cardiovascular formation, preventing the development and propagation of tumors as earlier researchers investigated [12, 56]. This indicates that ADAMTS6 is closely related to endothelial cells [57]. AACS is an enzyme that uses ketones to provide cholesterol [58]. The DNA of the AACS promoter in the rat fetal adrenal is hypermethylated as a result of prenatal nicotine exposure. These modifications may lower AACS expression and cholesterol supply, which would impede the fetal adrenal gland's ability to produce steroids. Other genes are nevertheless interesting to research even though their precise roles are yet unknown. The study of these genes may be given greater focus in the subsequent effort, which will lead to the discovery of more useful data.
Conclusions
The fast advancement of scRNAseq technology has led to the discovery of an increasing amount of important singlecell data, which is very helpful for our understanding of singlecell but also presents several obstacles. In singlecell data, there are many noises and outliers, which pose challenging issues for our analysis procedure. In this study, we propose a novel approach to analyze singlecell data called CHLNMF by introducing the Cauchy loss function into the NMF model to replace the square loss in the fundamental model. The effect of noise may be lessened, as well as the method's robustness can be increased, by adding the Cauchy loss function. The model may retain more spatial information by including the hypergraph, which will enhance the algorithm's performance. On seven scRNAseq data sets, the experiment compares the CHLNMF model with nine sophisticated scRNAseq data processing models. The experimental findings demonstrate that the CHLNMF model performs more comprehensively. Although the CHLNMF model has good performance, there are still many problems for us to study. We need to further find the loss function with better robustness to improve the performance of the model and find more valuable information. Prioritizing hyperparameters shows the impact on the performance of the CHLNMF model. It may be important to finetune or optimize the hyperparameters for certain datasets or research objectives. It is crucial to examine if the model can handle larger datasets or other types of data, since processing time and computer resources may provide limitations. Additionally, due to the assumption of nonnegativity in the CHLNMF model, it may fail to capture complex data structures or intercellular interactions. Therefore, it is crucial to exercise caution when interpreting clustering results obtained from this model. Finally, it remains uncertain if the CHLNMF model can be applied to other biological scenarios and experimental conditions with confidence. Despite certain limitations, our work establishes a foundation for future research to enhance and broaden the capabilities of the CHLNMF model for processing scRNAseq data.
Future directions
In future work, in addition to solving the above problems, we will continue to study new singlecell analysis methods. Interpreting a large amount of information in scRNAseq data is the direction and driving force of our future work. The important conclusions and consequences of the work should be succinctly explained in the Conclusions section, underscoring the value and significance of the work.
Availability of data materials
The accession numbers of datasets Pollen, Grove, Deng, Darmanis, Goolam, Treutlein, and Ting are SRP041736, GSE70657, GSE45719, GSE67835, EMTAB3321, GSE52583, and GSE51372, to which the National Centre for Biotechnology Information has received submissions (NCBI) (http://0wwwncbinlmnihgov.brum.beds.ac.uk/). And accession number of dataset Goolam is EMTAB3321, which has been submitted to the European Bioinformatics Institute (http://www.ebi.ac.uk/arrayexpress/).
References
Dickinson DJ, Schwager F, Pintard L, Gotta M, Goldstein B. A singlecell biochemistry approach reveals PAR complex dynamics during cell polarization. Dev Cell. 2017;42(4):416â€“34.
Hwang B, Lee JH, Bang D. Singlecell RNA sequencing technologies and bioinformatics pipelines. Exp Mol Med. 2018;50(8):1â€“14.
Flores M, Liu Z, Zhang T, Hasib MM, Chiu YC, Ye Z, Huang Y. Deep learning tackles singlecell analysisâ€”a survey of deep learning for scRNAseq analysis. Brief Bioinform. 2022;23(1):bbab531.
Fan J, Slowikowski K, Zhang F. Singlecell transcriptomics in cancer: computational challenges and opportunities. Exp Mol Med. 2020;52(9):1452â€“65.
Wang CY, Gao YL, Kong XZ, Liu JX, Zheng CH. Unsupervised cluster analysis and gene marker extraction of scRNAseq data based on nonnegative matrix factorization. IEEE J Biomed Health Inf. 2021;26(1):458â€“67.
Hozumi Y, Wei GW. Analyzing single cell RNA sequencing with topological nonnegative matrix factorization. J Comput Appl Sci. 2024;5:115842.
He C, Fei X, Cheng Q, Li H, Hu Z, Tang Y. A survey of community detection in complex networks using nonnegative matrix factorization. IEEE Trans Comput Soc Syst. 2021;9(2):440â€“57.
Chen G, Xu C, Wang J, Feng J. Robust nonnegative matrix factorization for link prediction in complex networks using manifold regularization and sparse learning. Physica A Stat Mech Appl. 2020;539:122882.
Zhang W, Xue X, Zheng X, Fan Z. NMFLRR: clustering scRNAseq data by integrating nonnegative matrix factorization with low rank representation. IEEE Biomed Health Inf. 2021;26(3):1394â€“405.
Jovic D, Liang X, Zeng H, Lin L, Xu F, Luo Y. Singlecell RNA sequencing technologies and applications: a brief overview. Clin Transl Med. 2022;12(3):e694.
AlJanahi AA, Danielsen M, Dunbar CE. An introduction to the analysis of singlecell RNAsequencing data. Mol Therapy Methods Clin Dev. 2018;10:189â€“96.
Zafar I, Anwar S, Yousaf W, Nisa FU, Kausar T, ul Ain Q, Sharma R. Reviewing methods of deep learning for intelligent healthcare systems in genomics and biomedicine. Biomed Signal Process Control. 2023;86:105263.
Qi R, Ma A, Ma Q, Zou Q. Clustering and classification methods for singlecell RNAsequencing data. Brief Bioinform. 2020;21(4):1196â€“208.
Hicham N, Karim S. Analysis of unsupervised machine learning techniques for an efficient customer segmentation using clustering ensemble and spectral clustering. Int J Adv Comput Sci Appl. 2022;13(10):25.
Ali S, Noreen A, Qamar A, Zafar I, Ain Q, Nafidi HA, Sharma R. Amomum subulatum: a treasure trove of anticancer compounds targeting TP53 protein using in vitro and in silico techniques. Front Chem. 2023;11:1174363.
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of singlecell RNAseq data clustering for celltype identification and characterization. RNA. 2023;29(5):517â€“30.
Adil A, Kumar V, Jan AT, Asger M. Singlecell transcriptomics: current methods and challenges in data acquisition and analysis. Front Neurosci. 2021;15:591122.
Lu C, Wang J, Liu J, Zheng C, Kong X, Zhang X. Nonnegative symmetric lowrank representation graph regularized method for cancer clustering based on score function. Front Genet. 2020;10:1353.
Guo W, Wang Z, Hong S, Li D, Yang H, Du W. Multikernel support vector data description with boundary information. Eng Appl Artif Intell. 2021;102:104254.
Liu H, Zhao R, Fang H, Cheng F, Fu Y, Liu YY. Entropybased consensus clustering for patient stratification. Bioinformatics. 2017;33(17):2691â€“8.
Bhattacharjee P, Mitra P. A survey of density based clustering algorithms. Front Comp Sci. 2021;15:1â€“27.
Jia W, Sun M, Lian J, Hou S. Feature dimensionality reduction: a review. Complex Intell Syst. 2022;8(3):2663â€“93.
Nebgen BT, Vangara R, HombradosHerrera MA, Kuksova S, Alexandrov BS. A neural network for determination of latent dimensionality in nonnegative matrix factorization. Mach Learn Sci Technol. 2021;2(2):025012.
Ray P, Reddy SS, Banerjee T. Various dimension reduction techniques for high dimensional data analysis: a review. Artif Intell Rev. 2021;54(5):3473â€“515.
Peng X, Xu D, Chen D. Robust distributionbased nonnegative matrix factorizations for dimensionality reduction. Inf Sci. 2021;552:244â€“60.
Xia J, Zhang Y, Song J, Chen Y, Wang Y, Liu S. Revisiting dimensionality reduction techniques for visual cluster analysis: an empirical study. IEEE Trans Visual Comput Graph. 2021;28(1):529â€“39.
Liu J, Cao F, Gao XZ, Yu L, Liang J. A clusterweighted kernel kmeans method for multiview clustering, pp. 4860â€“4867.
Lee DD, Seung HS. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999;401(6755):788â€“91.
Ye J, Jin Z. Hypergraph regularized discriminative concept factorization for data representation. Soft Comput. 2018;22(13):4417â€“29.
Leng CC, Zhang H, Cai GR, Cheng I, Basu A. Graph regularized L(p) smooth nonnegative matrix factorization for data representation. IEEECAA J Autom Sin. 2019;6(2):584â€“95.
He R, Zheng WS, Tan TN, Sun ZA. Halfquadraticbased iterative minimization for robust sparse representation. IEEE Trans Pattern Anal Mach Intell. 2014;36(2):261â€“75.
Birbil SI, Frenk JBG, Still GJ. An elementary proof of the FritzJohn and KarushKuhnTucker conditions in nonlinear programming. Eur J Oper Res. 2007;180(1):479â€“84.
Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, Li N, Szpankowski L, Fowler B, Chen P, Ramalingam N, Sun G, Thu M, Norris M, Lebofsky R, Toppani D, Kemp DW, Wong M, Clerkson B, Jones BN, Wu S, Knutsson L, Alvarado B, Wang J, Weaver LS, May AP, Jones RC, Unger MA, Kriegstein AR, West JAA. Lowcoverage singlecell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol. 2014;32(10):1053â€“8.
Grover A, SanjuanPla A, Thongjuea S, Carrelha J, Giustacchini A, Gambardella A, Macaulay I, Mancini E, Luis TC, Mead A. Singlecell RNA sequencing reveals molecular and functional platelet bias of aged haematopoietic stem cells. Nat Commun. 2016;7:11075.
Deng Q, Ramskold D, Reinius B, Sandberg R. Singlecell RNAseq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014;343(6167):193â€“6.
Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Gephart MGH, Barres BA, Quake SR. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci USA. 2015;112(23):7285â€“90.
Goolam M, Scialdone A, Graham SJL, Macaulay IC, Jedrusik A, Hupalowska A, Voet T, Marioni JC, ZernickaGoetz M. Heterogeneity in Oct4 and Sox2 targets biases cell fate in 4cell mouse embryos. Cell. 2016;165(1):61â€“74.
Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, Desai TJ, Krasnow MA, Quake SR. Reconstructing lineage hierarchies of the distal lung epithelium using singlecell RNAseq. Nature. 2014;509(7500):371â€“5.
Ting DT, Wittner BS, Ligorio M, Jordan NV, Shah AM, Miyamoto DT, Aceto N, Bersani F, Brannigan BW, Xega K, Ciciliano JC, Zhu HL, MacKenzie OC, Trautwein J, Arora KS, Shahid M, Ellis HL, Qu N, Bardeesy N, Rivera MN, Deshpande V, Ferrone CR, Kapur R, Ramaswamy S, Shioda T, Toner M, Maheswaran S, Haber DA. Singlecell RNA sequencing identifies extracellular matrix gene expression by pancreatic circulating tumor cells. Cell Rep. 2014;8(6):1905â€“18.
Zheng RQ, Li M, Liang ZL, Wu FX, Pan Y, Wang JX. SinNLRR: a robust subspace clustering method for cell type detection by nonnegative and lowrank representation. Bioinformatics. 2019;35(19):3642â€“50.
Jiang H, Sohn LL, Huan HY, Chen LN. Single cell clustering based on cellpair differentiability correlation and variance analysis. Bioinformatics. 2018;34(21):3684â€“94.
Wang B, Zhu JJ, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of singlecell RNAseq data by kernelbased similarity learning. Nat Methods. 2017;14(4):414â€“6.
von Luxburg U. A tutorial on spectral clustering. Stat Comput. 2007;17(4):395â€“416.
Lu C, Yan S, Lin Z. Convex sparse spectral clustering: singleview to multiview. IEEE Trans Image Process. 2016;25(6):2833â€“43.
Wong JAHA. Algorithm AS 136: a Kmeans clustering algorithm. J Roy Stat Soc. 1979;28(1):100â€“8.
Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom Intell Lab Syst. 1987;2(1â€“3):37â€“52.
Yang Z, Wang C, Oja E. Multiplicative updates for tSNE. In: 2010 IEEE international workshop on machine learning for signal processing; 2010. pp. 19â€“23.
Mittal M, Goyal LM, Hemanth DJ, Sethi JK. Clustering approaches for highdimensional databases: a review. Wiley Interdiscip Rev Data Min Knowl Discov. 2019;9(3):e1300.
Steinbach M, ErtÃ¶z L, Kumar V. The challenges of clustering high dimensional data. New directions in statistical physics: econophysics, bioinformatics, and pattern recognition. Springer; 2004. pp. 273â€“309.
Alibuhtto M, Mahat N. Distance based kmeans clustering algorithm for determining number of clusters for high dimensional data. Decis Sci Lett. 2020;9(1):51â€“8.
Yan J, Liu W. An ensemble clustering approach (consensus clustering) for highdimensional data. Secur Commun Netw. 2022;2022(6):1â€“9.
Ikotun AM, Almutari MS, Ezugwu AE. Kmeansbased natureinspired metaheuristic algorithms for automatic data clustering problems: recent advances and future directions. Appl Sci. 2021;11(23):11246.
Khan I, Luo Z, Shaikh AK, Hedjam R. Ensemble clustering using extended fuzzy kmeans for cancer data analysis. Expert Syst Appl. 2021;172:114622.
Papalexi E, Satija R. Singlecell RNA sequencing to explore immune cell heterogeneity. Nat Rev Immunol. 2018;18(1):35â€“45.
Saviano A, Henderson NC, Baumert TF. Singlecell genomics and spatial transcriptomics: discovery of novel cell states and cellular interactions in liver physiology and disease biology. J Hepatol. 2020;73(5):1219â€“30.
Arshad I, Kanwal A, Zafar I, Unar A, Hanane M, Razia IT, Arif S, Ahsan M, Kamal MA, Rashid SJER. Multifunctional role of nanoparticles for the diagnosis and therapeutics of cardiovascular diseases. Environ Res. 2023;8:117795.
Zhu YZ, Liu Y, Liao XW, Luo SS. Identified a disintegrin and metalloproteinase with thrombospondin motifs 6 serve as a novel gastric cancer prognostic biomarker by bioinformatics analysis. Biosci Rep. 2021;41(4):4359.
Hasegawa S, Noda K, Maeda A, Matsuoka M, Yamasaki M, Fukui T. AcetoacetylCoA synthetase, a ketone bodyutilizing enzyme, is controlled by SREBP2 and affects serum cholesterol levels. Mol Genet Metab. 2012;107(3):553â€“60.
Funding
No funding.
Author information
Authors and Affiliations
Contributions
GaoFei Wang and Longying Shen contributed jointly to the research, with GaoFei Wang focusing on main methods and writings and Longying Shen concentrating on details methods and results. Both authors actively participated in the conception, design, data acquisition, analysis, and interpretation. They also played a significant role in drafting and critically revising the manuscript. Both authors have read and approved the final version for submission.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interest
Regarding the publishing of this article, the author thus declares that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, GF., Shen, L. Cauchy hypergraph Laplacian nonnegative matrix factorization for singlecell RNAsequencing data analysis. BMC Bioinformatics 25, 169 (2024). https://0doiorg.brum.beds.ac.uk/10.1186/s12859024057974
Received:
Accepted:
Published:
DOI: https://0doiorg.brum.beds.ac.uk/10.1186/s12859024057974