Skip to main content

Identification of pre-microRNAs by characterizing their sequence order evolution information and secondary structure graphs

Abstract

Background

Distinction between pre-microRNAs (precursor microRNAs) and length-similar pseudo pre-microRNAs can reveal more about the regulatory mechanism of RNA biological processes. Machine learning techniques have been widely applied to deal with this challenging problem. However, most of them mainly focus on secondary structure information of pre-microRNAs, while ignoring sequence-order information and sequence evolution information.

Results

We use new features for the machine learning algorithms to improve the classification performance by characterizing both sequence order evolution information and secondary structure graphs. We developed three steps to extract these features of pre-microRNAs. We first extract features from PSI-BLAST profiles and Hilbert-Huang transforms, which contain rich sequence evolution information and sequence-order information respectively. We then obtain properties of small molecular networks of pre-microRNAs, which contain refined secondary structure information. These structural features are carefully generated so that they can depict both global and local characteristics of pre-microRNAs. In total, our feature space covers 591 features. The maximum relevance and minimum redundancy (mRMR) feature selection method is adopted before support vector machine (SVM) is applied as our classifier. The constructed classification model is named MicroRNA −NHPred. The performance of MicroRNA −NHPred is high and stable, which is better than that of those state-of-the-art methods, achieving an accuracy of up to 94.83% on same benchmark datasets.

Conclusions

The high prediction accuracy achieved by our proposed method is attributed to the design of a comprehensive feature set on the sequences and secondary structures, which are capable of characterizing the sequence evolution information and sequence-order information, and global and local information of pre-microRNAs secondary structures. MicroRNA −NHPred is a valuable method for pre-microRNAs identification. The source codes of our method can be downloaded from https://github.com/myl446/MicroRNA-NHPred.

Background

Mature microRNAs (miRNAs) are small single-stranded, non-coding RNAs (about 22 nucleotides in length), which play significant regulatory roles in various biological processes of animals, plants and viruses [1, 2]. There are two other forms of miRNAs: primary miRNAs (pri-miRNAs) and precursor microRNAs (pre-miRNAs). Mature miRNAs are cleaved from 90nt pre-miRNAs which are derived from the processing of a long pri-miRNA by a ribonucluease [3]. Precursor miRNAs have been widely studied at the earliest time, and many commercialized miRNA libraries take this form. With the advent of the post genome era and the development of sequencing technology, how to find all forms of miRNAs from millions of reads has become one of the challenging topics in bioinformatics. It is also difficult to experimentally identify the lowly expressed miRNAs or the miRNAs that are expressed in the specific tissues or in the developmental stage. On the other hand, as mature miRNAs are very short, the traditional feature engineering approaches [4] are usually failed to extract effective features from their sequences and structures. Therefore, current computational methods are focusing on the identification of pre-miRNAs instead of mature miRNAs.

These methods to identify pre-microRNAs can be grouped into four categories. The first category contains the earliest methods which are based on searching homologous genes [5]. The search process is a typical alignment problem of sequences and structures. The main alignment algorithms include the Smith-Waterman algorithm [5], the FASTA algorithm, and the BLAST algorithm [69]. However, these methods can only find highly homologous miRNAs with known miRNA sequences and require a large amount of computational resource for whole genomes. The second category contains comparative genome methods which predict miRNAs in the study of species of early stages. These methods mainly utilize the conservation characteristics of miRNAs and their precursor sequences in multiple species to search for the conserved sequences in the intergenic region. These sequences have a better secondary structure of stem ring. Based on comparative genomics, the limitation of predicting miRNAs is that the predicted miRNA candidates are highly conserved in multiple species, and these methods cannot be used to predict miRNAs which are not conserved [1013]. At the same time, these methods are also subject to challenges of both time complexity and space complexity. The third category is based on conservation of binding sites of miRNAs which are the short sequences of miRNA binding the target mRNAs. These short sequences have conserved properties among multiple species [1416]. The miRNAs and the target mRNAs usually have perfect complementary features in plants, while it does not match well in animals. Therefore, this category of methods is usually used in plants. The fourth category is based on machine learning methods [1721].

Machine learning uses the information on sequences, structural and thermodynamic energy of pre-microRNAs. These methods can discover new, non-homologous pre-microRNAs. So, machine learning is the main approach for miRNA prediction and identification at present. The difficulty of the method is how to select the positive/negative samples which are able to describe sufficiently the whole sample space and how to find a better distinction between true/false pre-miRNAs. In addition, high false positive rates and computational complexity likely occur in the prediction of whole genome data. Thus, further improvement in sensitivity and specificity of the pre-miRNA classification is necessary. It is also a desirable task to explore a solution based on machine learning prediction.

By the problem of pre-microRNA identification, two major procedures are required: feature extraction and machine learning. In the past few decades, extracted features of pre-microRNAs are related to three sources: primary sequences, secondary structures and thermodynamical properties. Among them, the k-mer sequence composition (based on the primary sequence) is the most successful approach for the representation of pre-microRNAs [22]. Many studies have shown that most of pre-microRNAs have the properties of stem loop hair-pin structures [19]. Therefore, secondary structures can be predicted, and features derived from these structures, for instance, Xue et al. extracted 32 local structure features in triplet-SVM to predict human pre-microRNAs [19]. Energy characteristics are another kind of important features of pre-microRNAs [23, 24]. It is well studied that good features and positive / negative (real / pseudo pre-microRNA) datasets are the basis of constructing effective classification models.

In this study, we extract some novel features of pre-microRNAs for improving the current classification performance. To describe local or short-range sequence order information and evolution information of pre-microRNAs, we introduce PSI-BLAST profiles into the analysis of pre-microRNAs for the first time. And also, we introduce the Hilbert-Huang transform [25] for the first time, which is a time-frequency analysis method. Hilbert-Huang transforms are capable of capturing the local and long-range relationship between sequence bases. We obtain the topological parameters of small molecular networks constructed from the secondary structures of pre-microRNAs, which contain refined secondary structure information. These features are carefully selected so that they can depict both global and local characteristics of pre-microRNA structures. After these feature extraction, we apply support vector machine (SVM) as our classifier, and use the maximum relevance and minimum redundancy (mRMR) [26] method in the feature selection. Then, a new predictor MicroRNA −NHPred is constructed using the optimal feature set, which achieves an accuracy of up to 94.83% on a benchmark dataset. Our newly constructed predictor also improves the sensitivity and specificity of precursor miRNA identification.

Methods

Datasets

The benchmark dataset is adopted from [2731], which consists of positive samples (true pre-microRNAs) and negative samples (pseudo pre-microRNAs). The set of positive samples is originated from the miRBase (released on 20 June, 2013) [32], composed of 1872 experimentally confirmed pre-microRNA sequences of homo sapiens. These sequences were filtered by the CD-HIT software [33], and the redundant sequences were filtered out with a threshold of 80% sequence identity. Finally, we obtained 1612 true pre-microRNA sequences as positive samples. Exactly as done by the literature works [1719, 24], we used 8494 human pseudo pre-microRNAs. This set of negative samples collected from human protein coding regions was downloaded from [19]. These sequences are very similar to the real pre-microRNAs in the sequence length, the minimum base pair of their stem of hairpin structure and the maximum energy of secondary structure. In the same way as positive samples, we used the CD-HIT software to filter the sequences so that sequence similarity of the negative samples is kept below 80%. To overcome the sample imbalance problem [27, 28], 1612 sequences are selected randomly as negative samples from the filtered sequences.

The classification performance of our method in comparison with other methods was also tested on an independent test set. This test set comes from the latest released miRBase 22 [34] (released on March 2018) which contains 1917 pre-microRNA sequences of homo sapiens. Note that miRBase 20 (released on June 2013) contains only 1872 homo sapiens pre-microRNA sequences. The 78 new homo sapiens pre-microRNA sequences are used as the independent test set, which is named hsa dataset. We also used 410 non-coding datasets filtered out by us in Reference [18] as our negative test set (named ncRNA dataset). Meanwhile, we randomly selected 1000 human pseudo pre-microRNAs from the remaining 6882 sequences as our second negative test set (named human negative dataset).

Feature extraction

We take three steps to extract different features of pre-microRNAs from PSI-BLAST profiles [35, 36], parameters of networks [37] and spectrum analysis based on the Hilbert-Huang transform [25].

PSI −BLAST profile −based features

The PSI-BLAST profile is represented as a so-called position specific score matrix (PSSM), which is acquired through aligning a query amino acid sequence to the NCBI’s nonredundant (NR) database using PSI-BLAST [35]. In this work, we apply this idea to nucleotide sequences.

First, we build a new database, which is composed of all the pre-microRNA sequences in the miRBase (http://www.mirbase.org/) and the 8494 human pseudo pre-microRNAs [19] and 754 non-coding RNAs studied in [18].

Second, we use PSI-BLAST to align a query nucleotide sequence in the dataset to the newly built database and to get the PSSM for the sequence. The PSSM is a matrix of size L×5, where L is the length of the query sequence and 5 is due to the 4 nucleotide symbols (A,C,G,U) and the symbol −. Its elements are 10× loge of the ratios between the observed base frequencies and the background base frequencies, and rounded down to the nearest integer.

Third, our feature extraction method also starts by transforming each element sij of the PSSM into \(s_{ij}^{^{\prime }}\) using

$$ s_{ij}^{^{\prime}}=2^{0.1\times s_{ij}}. $$
(1)

The resulting value \(s_{ij}^{^{\prime }}\) is guaranteed to be non-negative even when sij is negative. We further apply normalization to the values \(s_{ij}^{^{\prime }}\) so that each row sums to one. Let fij denote the normalized value of \(s_{ij}^{^{\prime }}\). All the values fij form a matrix, which are called the frequency matrix (FM).

Fourth, to extract PSI-BLAST profile features, a so-called concensus sequence (CS) [38] is constructed from the FM as follows:

$$ \mu (i)=\mathrm{arg\ max}\{f_{ij}:\ 1\leq j\leq 4\},1\leq i\leq L. $$
(2)

The i-th base CS(i) of the consensus sequence is set to be the μ(i)-th nucleotide in the nucleotide alphabet. It can be seen that a consensus sequence retains the most valuable evolutionary information from the PSSM.

Fifth, we compute

$$ \text{NCCS}(j)=\frac{n(j)}{L},1\leq j\leq 4, $$
(3)

where n(j) is the number of the nucleotide j occurring in the CS. It gives 4 features corresponding to the nucleotide of the CS. Moreover, we include the entropy into our feature set, that is,

$$ \text{ECS}=-\sum\limits_{j=1}^{5}\text{NCCS}(j)\log_{e}\text{NCCS}(j). $$
(4)

Another entropy-based feature is directly computed from FM to reflect the global characteristic of the PSSM:

$$ \text{EFM}=-\frac{1}{L}\sum\limits_{i=1}^{L}\sum\limits_{j=1}^{5}f_{ij} \log_{e}f_{ij}. $$
(5)

Most of the extracted features of k-mer features shown in many articles are based on the original sequences. In this study, we extract their features from the CS of the original nucleotide sequences. Since a pre-microRNA sequence is too short (about 60bp-130bp), longer k are less likely to be exactly conserved among species. So, we computed k-mers with k=2,3 resulting in 80 (16+64) different features. At the same time, we calculate the content of GC from the consensus sequences.

In summary, for each query sequence, a total of 87 features are extracted from its PSI-BLAST profile. Our experimental results show that the features extracted from CS are more effective to discriminate between real pre-miRNAs and pseudo pre-miRNAs than those from the original nucleotide sequences.

Topological parameters of small molecular networks extracted from secondary structures

The pre-microRNA has a very significant secondary structure in the hairpin shape. There are many methods based machine-learning to identify pre-microRNAs which take advantage of the hairpin shape, so that the prediction accuracy has been greatly improved. There are more representative Triplet-SVM [19], iMiRNA-PseDpc [27], and properties based on networks [37] in these methods. In Refs. [39, 40], the authors have verified that the features based on networks have higher prediction accuracies. Meanwhile, in Ref. [37], Childs et al. further discussed the topological properties of the networks, which can reflect more essential characteristics of the pre-microRNAs. Therefore, in this work, we extract features based on networks constructed from the secondary structure, and the process is as follows:

Firstly, each nucleotide sequence of positive and negative samples is folded into a stem-loop secondary structure by RNAfold [41]. Secondly, we use a two-dimensional network (graph) to represent the RNA secondary structure, where all nucleotides are converted to nodes and all bonds between nucleotides are converted to edges. Network elements, including nodes and edges, can be defined by the network itself or parameters which may relate to limited or full knowledge of the network. According to [37], Childs et al. classified the network parameters into three types: local, local-global and global structural properties that can be used as a method in identification of RNA family. Here we use the summary statistics for the local-global properties, since they provide insight not only on the global level of the graph itself, but also on the level of its nodes and edges. Thirdly, all properties were calculated using the igraphR package (http://igraph.org) for complex networks. In this study, 24 network parameters are extracted to describe the stem-loop structure of pre-microRNAs based on previous works and experimental criteria [37] although a number of network parameters are available. We also choose the following features: degree, path length, shortest path, graph motifs, articulation point, modularity, graph density, coreness, closeness, centrality, bibliographic coupling, transitivity, cocitation coupling, diameter, node betweenness, edge betweenness, grith, constraint, hub score, and so on. A brief definition of all graph properties used in this study is provided in [37].

Extraction of sequence-order features based on the Hilbert-Huang transform

The features of the pre-microRNAs based on k-mers, with k small, they can only describe the short-range relationship between the nucleotide sequences. When k is larger, they can describe the long-range relationship of the nucleotide sequences, but the dimension of extracted feature vector is too large, which leads to the curse of dimensionality, and the classifier’s performance will be reduced. Since most of the previous methods extracted k-mer composition information from a nucleotide sequence (for pre-microRNAs, k generally takes the values 2, 3, 4), the sequence-order information is missing. Although Chen and Li [42] considered local sequence-order information based on Chou’s concept of pseudo amino acid composition, the overall prediction accuracy was not significantly improved. In order to depict the long range relationship and order information of the sequence, we introduce the Hilbert-Huang transform [25] based on the physical and chemical properties of the known dinucleotides.

The Hilbert-Huang transformation consists of two parts: empirical mode decomposition (EMD) and Hilbert spectral analysis (HSA). The EMD method, which was originally proposed by Huang et al. [25] for the study of ocean waves, is a time-frequency analysis, and has been used by our group to simulate geomagnetic field data [43] and to predict protein subnuclear localization [44]. In EMD, the base functions, which are called intrinsic mode functions (IMFs), are obtained adaptively from the original signal. The principle and details of Hilbert spectral analysis can be found in [25, 44]. Combining the sequences of the pre-microRNAs and the physical and chemical characteristics of the dinucleotides, the feature extraction method based on the Hilbert-Huang transform is described as follows:

  1. 1

    According to the physical and chemical properties of dinucleotides and the intrinsic characteristics of Hilbert-Huang transform, we selected 15 physical and chemical properties for RNAs from the database [45], including: enthalpy, enthalpy2, entropy, entropy2, free energy, free energy2, hydrophilicity, hydrophilicity2, rise, roll, shift, slide, stackingenergy, tilt, twist.

  2. 2

    According to the physical and chemical properties of dinucleotides, the sequence of each pre-microRNA was converted into 15 time series by sliding a window along the sequence.

  3. 3

    At first, we got the intrinsic mode functions of each time series by EMD. The EMD for a hydrophilicity2 time series of the pre-microRNA hsa-mir-6843 is shown in Fig. 1. And then we applied HSA to every intrinsic mode function to obtain the analysis signals. Finally, we obtained 32 features for each time series. The specific signal analysis process can be found in [44].

    Fig. 1
    figure 1

    Six IMF components and the residual obtained by EMD of hydrophilicity2 time series of the pre-microRNA hsa-mir-6843

In this study, we transformed all the RNA sequences into time series according to the 15 physical and chemical properties of dinucleotides. In total, we extracted 480 Hilbert-Huang features.

Feature selection method

After the feature extraction, some extracted features may be redundant and some may not be related to any class. There are many ways to remove redundant or useless features (in the sense that they have no significant relation to a class), such as mRMR [26], FOCUS [46], Wrapper [47], and so on. In this work, we choose the mRMR method as our feature selection method.

Let Ω be the whole feature space which contains all of the aforementioned 591 features in this work; each sequence is represented by a vector consisting of the values of these 591 features. We assume that E and F are two disjoint subsets of Ω and Ω=EF. In order to select a feature fj in E with maximum relevance and minimum redundancy in F, we use the following formula:

$$ \max\limits_{f_{j}\in E}\left[I(f_{j},\theta)-D(f_{j},F)\right],\ j=1,2,\ldots,\sharp E, $$
(6)

where θ is a vector characterizing the class of all nucleotide sequences in the sample set, E denotes the cardinality of the subset E, I(fj,θ) measures the relevance of characteristic fj and class vectors θ, D(fj,F) measures the redundancy of characteristic fj and the feature subset F. The definitions of I(fj,θ) and D(fj,F) are given in Ref. [26].

In the actual computation process, we regard E as a feature set to be selected, and F as an already selected feature set. At the beginning, E is the feature space, F is the null space, the process of the mRMR method is as follows: First, we select a feature that is most relevant to the class vector in E, then remove it from E and add it to F. Second, according to the mRMR function, repeat the first step. After Ω cycles, E is null and F is the entire feature set. According to the order in which the feature is added to F, the features in the whole feature set are reordered, and we use S to represent the ordered feature set:

$$ S=\left\{f_{i_{1}},f_{i_{2}},f_{i_{3}},\ldots,f_{i_{\sharp \Omega }}\right\}. $$
(7)

After all features are ranked, we can determine the optimal feature components by an incremental feature selection (IFS) method [48]. For the ranked feature set S, we can construct the feature component sets by adding one component at a time in an ascending order as follows:

$$ S_{k}=\{f_{i_{1}},f_{i_{2}},f_{i_{3}},\ldots,f_{i_{k}}\}\ (1\leqq k\leqq \sharp \Omega). $$
(8)

For each feature component set, a predictor is constructed and the accuracy is obtained by the rigorous jackknife validation. Finally, we choose the feature component set for the best jackknife validation accuracy as the optimal feature set.

Support vector machine

A Support Vector Machine (SVM) is a class of supervised learning algorithms first introduced in [49]. It is based on statistical theory, and has a good general application. In this work, we use an SVM as a classifier to identify the real and pseudo pre-microRNAs.

Given a set of labelled training vectors (positive and negative input samples), SVM learns a linear decision boundary from both positive and negative training samples to discriminate between the unknown RNA sequences. The RNA sequence in the training set and the test set are transformed into fixed-dimension feature vectors following the process introduced above, and then the training vectors are input into SVM to construct the classifier. The SVM gives a predicted class for each RNA sequence in the test set.

The LIBSVM algorithm [50] was employed, which is a type of software for SVM classification and regression. The radial basis function (RBF) defined as

$$ k(\mathbf{x}_{i},\mathbf{x}_{j})=\exp\left(-\gamma (\parallel \mathbf{x}_{i}- \mathbf{x}_{j}\parallel)^{2}\right), \gamma >0 $$
(9)

is used as the kernel function k(x,y) in the SVM. Here, {x1,…,xn} is a given dataset. For a Gaussian RBF, γ is parametrized as \(\gamma =\frac {1}{2\sigma ^{2}}\). The parameter γ and the soft margin parameter C are optimized on the benchmark dataset by adopting the grid tool provided by LIBSVM [50]. The parameters of the predictor constructed by different feature sets are shown in Table 1. More details are provided in [51].

Table 1 The performance of different feature sets

The proposed identification method

Figure 2 illustrates the overall architecture of our proposed method which is called MicroRNA-NHPred. Firstly, the query nucleotide (RNA) sequences are input into PSI-BLAST to obtain PSSM, and entropy of sequences and consensus sequences (CS) [38]. We then obtain k-mer composition of CS. The query nucleotide sequence is submitted to RNAfold software to generate a secondary structure.

Fig. 2
figure 2

Flow chart of the identification method in this study

We build a single molecule network from the secondary structure, then extract network topological parameters. Each RNA molecule is represented by the topological parameters of a single molecule network.

On the other hand, the query nucleotide sequence is converted into a time series based on the physicochemical properties of the RNA. The obtained time series are transformed and 480 characteristics are obtained. Ultimately, we get 591 features in total. These features are finally put into an SVM-based classifier for pre-microRNA classifier recognition.

Performance evaluation

The performance of the predictor should be objectively evaluated. In statistical prediction, three cross-validation tests are often used to evaluate the prediction performance: independent dataset test, sub-sampling (or K-fold crossover validation) test and jackknife test. Only the jackknife test is the least arbitrary that can always yield a unique result for a given benchmark dataset [52, 53]. That is why researchers have a preference for the jackknife test for examining the quality of various machine learning based predictors such as [30, 31, 44]. Hence, in this paper we also use the jackknife test to evaluate the accuracy of the current predictor, and use independent test samples to further verify the reliability of our method. In the jackknife test, each sequence in the samples is singled out in turn as a test sample and the remaining sequences are used as training samples. Although the jackknife test consumes more computing resources, it is worthwhile to have a single output for a given set of samples.

When the cross-validation method is selected, we need to choose the performance metrics of the predictor. The identification of pre-microRNAs is a binary classification problem. For this problem, we select the following indicators to evaluate our predictor: Sn (sensitivity), Sp (specificity), Acc (overall accuracy), Mcc (Mathew correlation coefficient) [54], calculated by Sn=TP/(TP+FN), Sp=TN/(TN+FP), Acc=(TP+TN)/(TP+TN+FP+FN), and

$$Mcc=\frac{(TP\times TN)-(FP\times FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}. $$

In the above equations, TP means the true positive, TN the true negative, FP the false positive and FN the false negative. The sensitivity denotes correct identification of positive pre-microRNAs by avoiding false negative, while the specificity denotes correct identification of negative pre-microRNAs by avoiding false positive. The sensitivity and the specificity range between 0 and 1, the bigger the value, the better the predictor. The Mathew correlation coefficient (Mcc) ranges between -1 and 1, the overall accuracy (Acc) ranges between 0 and 1.

Discussion and results

Parameter selection by mRMR

We develop three steps to extract 591 features, and those features are shown in Additional file 1. Since some of these features are not essential and may not be significantly related to the classes of pre-microRNAs, we used the method in subsection “Feature selection method” to sort the features first and used the increment feature selection method to select the optimal feature set. For each feature subset, we constructed a classifier and derived its jackknife validation accuracy. Finally, we obtained the best feature subset corresponding to the best jackknife validation accuracy as the optimal feature subset. And the optimal feature set is shown in Additional file 2. We used all the feature sets to construct the predictor, whose jackknife validation accuracy turns out to be 89.73%. We used the optimal feature subset to construct a predictor with a jackknife validation accuracy of 94.83% being achieved.

At the same time, we also enumerate the top 30 features in the optimal feature set, as shown in Table 2. We can see from the Table 2 that the most relevant category of true / false pre-microRNAs is the Efm (the entropy of the frequency matrix) feature which is extracted by PSI-BLAST profiles. The average degree of node which can portray the base pairing property of RNA sequence is the second most relevant feature. In addition, we can also see that 11 features of the top 30 come from network features, 13 from HHT features, and 6 from PSI-BLAST profiles. This shows that we use three different methods to extract different levels of pre-microRNA features, which are informative and complementary.

Table 2 The top 30 features by feature selection

Performance of predictor on different feature sets

As shown in subsection “Feature extraction methods”, we used 3 different methods to extract 3 different feature sets. In order to study the effect of different feature sets on the performance of the predictor, we tested the single feature set and different feature combinations respectively on prediction performance, as shown in Table 1. We can see that the three feature sets have different contributions to the recognition of pre-microRNAs, of which the contribution of the network feature set is the most significant and the accuracy of the predictor is 87.85%.

We firstly introduced PSI-BLAST to the prediction of pre-microRNAs. In order to verify the performance contribution of the k-mers from CS, we separately extracted k-mers (k=2, 3) from the original sequence and the CS for jackknife test verification. The result of the test is shown in Table 3. The accuracy of jackknife test validation shows that the consensus sequences contain much more evolution information than the nucleotide sequences, thereby leading to more accurate pre-microRNA identification.

Table 3 The performance of different k-mers: (k=2,3)

Secondary structure features have a variety of different representations, e.g, triplet-SVM [19], iMcRNA-PseSSC [27], network [37], and so on. To verify the effect of three secondary structure features on the problem of pre-microRNA classification, we used the jackknife test on the same benchmark dataset. As shown in Table 4, we found that the parameters of networks reflect the pre-microRNA secondary structure. So, we used the parameters of networks to depict the secondary structure of pre-microRNAs in this work.

Table 4 The performance of different features of secondary structure

Comparison with other methods

We compared our predictor with the best and most accurate predictors in this field, triplet-SVM [19], miPred [24], iMcRNA-EXPseSSC [27], microR-Pred (SVM) [31]. The comparison indicates that the accuracy of our predictor is higher than other predictors in the same larger and more stringent benchmark dataset using rigorous jackknife tests. As can be seen from Table 5, we have the highest prediction accuracy on Mcc, Accuracy and Sn, and only Sp is lower than miPred [24] and microR-Pred (SVM) [31], but also higher than 90%.

Table 5 The performance of different methods on the same benchmark dataset

Performance evaluation on an independent test set

The benchmark dataset was constructed based on miRBase released 20 (June 2013). At present, compared with miRBase released 20, the latest miRBase released 22 reports 78 new homo sapiens pre-microRNAs, which were treated as an independent test set to further evaluate the performance of the proposed MicroRNA-NHPred. The test results are shown in Table 6. This method trained with the benchmark dataset can correctly predict 75 testing samples in the independent dataset as true sapiens pre-microRNAs. The accuracy of the proposed method can reach 96.15%, which demonstrates the stable prediction performance of microRNA-NHPred for predicting sapiens pre-microRNAs.

Table 6 The result of different methods on an independent test set

MicroR-Pred (SVM) [31] and iMcRNA-EXPseSSC [27], which are the most accurate predictors in this field as we know, were also tested on the same independent test set. It is worth noting that microR-Pred (SVM) [31] and iMcRNA-EXPseSSC [27] correctly identified 71 and 67 homo spaeins pre-microRNAs with an accuracy of 91.03% (71/78) and 85.90% (67/78) respectively. Our method is more accurate on these two negative independent test datasets than IMcRNA-EXPseSSC [27], but slightly less accurate than MicroR-Pred (SVM) [31], as shown in Table 7. This further confirms the reliability and validity of our method.

Table 7 Classification accuracy of different methods on independent test sets

Conclusion

Distinction between pre-microRNAs and length-similar pseudo pre-microRNAs is a biologically important problem which can help understand more about RNA regulatory mechanisms. In this study, we have developed a new classification method called MicroRNA-NHPred for pre-microRNA identification. It exploits the sequence evolution information extensively from PSI-BLAST profiles, the sequence order information from Hilbert-Huang transforms and the secondary structure information from small molecule networks. A comprehensive set of 591 features is thus constructed, which depicts both global and local characteristics of sequence and secondary structure. An optimal set of 268 selected features is used by our MicroRNA −NHPred for the classification, and it has achieved an accuracy of up to 94.83% on same benchmark datasets.

Literature works have also used machine learning techniques to identify pre-microRNA. Our research is different in several ways, as summarized below:

  1. 1

    We introduced PSI-BLAST into the analysis of pre-microRNAs. We extracted features from the consensus sequences constructed from PSSMs rather than from their respective nucleotide sequences. The former retains richer sequence evolution information. To our best knowledge, this is the first attempt to extract features from the consensus sequences.

  2. 2

    We introduced the Hilbert-Huang transform into pre-microRNA identification for the first time, and used it to describe the local and long-range relationships between sequence bases.

  3. 3

    We used the network parameters from the single molecule network of a pre-microRNA rather than use the triplet structure to represent the secondary structure of the pre-microRNA. These network parameters can describe more completely the local and global characteristics of RNAs. Under the same benchmark dataset, the accuracy of network parameters can reach 87.85%, while the well-known triplet-SVM reaches only 81.85%.

  4. 4

    We introduced feature selection method, mRMR, into pre-microRNA identification for the first time, which yields that the accuracy of the predictor achieves 94.83% while it is 89.73% before the feature selection. The obtained results verify the significance of the feature selection.

It was observed via the rigorous cross-validation on a larger and more stringent benchmark dataset that the new predictor outperformed or was highly comparable with the best existing predictor in this area. We also performed test on an independent dataset, the results indicate that the new predictor outperforms the two best predictors for the identification of miRNAs precursor [27, 31]. This implies that the feature set obtained in this paper is highly beneficial to pre-microRNA identification. At the same time, we can conclude that hybrid features (both the primary and secondary structural features) as well as mRMR have a key role in performance improvement. If the method proposed in this paper is only used for human pre-microRNA identification, its value is limited. So, our further work is to extend it to identify pre-microRNA for cross species, and further adds some energy features to the features set.

References

  1. Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004; 116(2):281–97.

    CAS  PubMed  Google Scholar 

  2. Chatterjee S, Grobhans H. Active turnover modulates mature microRNA activity in Caenorhabditis elegans. Nature. 2009; 461(7263):546–9.

    Article  CAS  PubMed  Google Scholar 

  3. Wang Y, Chen X, Jiang W. Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM. Genomics. 2011; 98(2):73–8.

    Article  CAS  PubMed  Google Scholar 

  4. Cai R, Zhang Z, Hao Z. BASSUM. A Bayesian semi-supervised method for classification feature selection. Pattern Recog. 2011; 44(4):811–20.

    Article  Google Scholar 

  5. Weber MJ. New human and mouse microRNA genes found by homology search. Febs J. 2005; 272(1):59–73.

    Article  CAS  PubMed  Google Scholar 

  6. Dezulian T, Remmert M, Palatnik JF, Huson DH. Identification of plant microRNA homologs. Bioinformatics. 2006; 22(3):359–60.

    Article  CAS  PubMed  Google Scholar 

  7. Legendre M, Lambert A, Gautheret D. Profile-based detection of microRNA precursors in animal genomes. Bioinformatics. 2005; 21(7):841–5.

    Article  CAS  PubMed  Google Scholar 

  8. Gautheret D, Lambert A. Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol. 2001; 313(5):1003.

    Article  CAS  PubMed  Google Scholar 

  9. Wang X, Zhang J, Li F, Gu J, He T, Zhang X, Li Y.MicroRNA identification based on sequence and structure alignment. Bioinformatics. 2005; 21(18):3610–4.

    Article  CAS  PubMed  Google Scholar 

  10. Lim LP, Lau NC, Weinstein EG, Abdelhakim A, Yekta S, Rhoades MW, Burge CB, Bartel DP. The microRNAs of Caenorhabditis elegans. Genes Dev. 2003; 17(8):991–1008.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Ohler U, Yekta S, Lim LP, Bartel DP, Burge CB. Patterns of flanking sequence conservation and a characteristic upstream motif for microRNA gene identification. Rna-a Publ Rna Soc. 2004; 10(9):1309–22.

    Article  CAS  Google Scholar 

  12. Lai EC, Tomancak P, Williams RW, Rubin GM. Computational identification of Drosphila microRNA genes. Genome Biol. 2003; 4(7):R42.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Wang XJ, Reyes JL, Chua NH, Gaasterland T. Prediction and identification of Arabidopsis thaliana microRNAs and their mRNA targets. Genome Biol. 2004; 5(9):R65.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Jonesrhoades MW, Bartel DP. Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol Cell. 2004; 14(6):787–99.

    Article  CAS  Google Scholar 

  15. Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, Lander ES, Kellis M.Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature. 2005; 434(7031):338–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Adai A, Johnson C, Mlotshwa S, Sundaresan V. Computational prediction of miRNAs in Arabidopsis thaliana. Genome Res. 2005; 15(1):78–91.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Ng KL, Mishra SK. De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures. Bioinformatics. 2007; 23(11):1321–30.

    Article  CAS  PubMed  Google Scholar 

  18. Batuwita R, Palade V. microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics. 2009; 25(8):989–95.

    Article  CAS  PubMed  Google Scholar 

  19. Xue C, Li F, He T, Liu GP, Li Y, Zhang X. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC Bioinformatics. 2005; 6(1):310.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  20. Ding J, Zhou S, Guan J. MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features. BMC Bioinformatics. 2010; 11Suppl 11(Suppl 11):S11.

    Article  Google Scholar 

  21. Nam JW, Shin KR, Han J, Lee Y, Kim VN, Zhang BT. Human microRNA prediction through a probabilistic co-learning model of sequence and structure. Nucleic Acids Res. 2005; 33(11):3570–81.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Yousef M, Khalifa W, Acar İE, Allmer J. MicroRNA categorization using sequence motifs and k-mers. BMC Bioinformatics. 2017; 18(1):170.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  23. Lopes IDO, Schliep A, Carvalho ACDLD. The discriminant power of RNA features for pre-miRNA recognition. BMC Bioinformatics. 2014; 15(1):1–11.

    Article  CAS  Google Scholar 

  24. Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 2007; 35(Web Server issue):W339–344.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Huang NE, Shen Z, Long SR, Wu M, Shih HH, Zheng Q, Yen NC, Tung CC, Liu HH. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc Math Phys Eng Sci. 1998; 454(1971):903–95.

    Article  Google Scholar 

  26. Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005; 27(8):1226–38.

    Article  PubMed  Google Scholar 

  27. Liu B, Fang L, Liu F, Wang X, Chen J, Chou KC. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS ONE. 2015; 10(3):e0121501.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Liu B, Fang L, Chen J, Liu F, Wang X. miRNA-dis: microRNA precursor identification based on distance structure status pairs. Mol Biosyst. 2015; 11(4):1194–204.

    Article  CAS  PubMed  Google Scholar 

  29. Liu B, Fang L, Wang S, Wang X, Li H, Chou KC. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. J Theor Biol. 2015; 385(21):153–9.

    Article  CAS  PubMed  Google Scholar 

  30. Liu B, Fang L, Liu F, Wang X, Chou KC. iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct Dyn. 2016; 34(1):223–35.

    Article  CAS  PubMed  Google Scholar 

  31. Khan A, Shah S, Wahid F, Khan FG, Jabeen S. Identification of microRNA precursors using reduced and hybrid features. Mol Biosyst. 2017; 13(8):1640–5.

    Article  CAS  PubMed  Google Scholar 

  32. Kozomara A, Griffiths-Jones S. miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011; 39(Database issue):D152–7.

    Article  CAS  PubMed  Google Scholar 

  33. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150–2.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence miRNAs using deep sequencing data. Nucleic Acids Res. 2014; 42(Database issue):D68–73.

    Article  CAS  PubMed  Google Scholar 

  35. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389–402.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Yang JY, Chen X. Improving taxonomy-based protein fold recognition by using global and local features. Proteins Struct Funct Bioinforma. 2011; 79(7):2053–64.

    Article  CAS  Google Scholar 

  37. Childs L, Nikoloski Z, May P, Walther D. Identification and classification of ncRNA molecules using graph properties. Nucleic Acids Res. 2009; 37(9):e66.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  38. Patthy L. Detecting homology of distantly related proteins with consensus sequences. J Mol Biol. 1987; 198(4):567–77.

    Article  CAS  PubMed  Google Scholar 

  39. Fera D, Kim N, Shiffeldrim N, Zorn J, Laserson U, Gan HH, Schlick T. RAG: RNA-As-Graphs web resource. BMC Bioinformatics. 2004; 5(1):1–9.

    Article  CAS  Google Scholar 

  40. Gan HH, Fera D, Zorn J. RAG: RNA-As-Graphs database-concepts, analysis, and features. Bioinformatics. 2004; 20(8):1285–91.

    Article  CAS  PubMed  Google Scholar 

  41. Lorenz R, Bernhart SH, Zu Siederdissen CH, Tafer H, Flamm C, Stadler PF, Hofacker IL, Siederdissen C. ViennaRNA Package 2.0. Algoritm Mol Biol. 2011; 6(1):26.

    Article  Google Scholar 

  42. Chen YL, Li QZ. Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition. J Theor Biol. 2007; 248(2):377–81.

    Article  CAS  PubMed  Google Scholar 

  43. Yu ZG, Anh V, Wang Y, Mao D, Wanliss J. Modeling and simulation of the horizontal component of the geomagnetic field by fractional stochastic differential equations in conjunction with empirical mode decomposition. J Geophys Res. 2010; 115:A10219.

    Article  Google Scholar 

  44. Han GS, Yu ZG, Anh V, Krishnajith D, Tian YC. An ensemble method for predicting subnuclear localizations from primary protein structures. PLoS ONE. 2013; 8(2):e57225.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Friedel M, Nikolajewa S, Suhnel J, Wilhelm T. DiProDB: a database for dinucleotide properties. Nucleic Acids Res. 2009; 37(Database issue):D37–40.

    Article  CAS  PubMed  Google Scholar 

  46. Almuallim H, Dietterich TG. Learning with many irrelevant features. In: AAAI’91 Proceedings of the ninth National conference on Artificial intelligence. Anaheim: AAAI Press: 1991. p. 547–52.

    Google Scholar 

  47. John GH, Kohavi R, Pfleger K. Irrelevant Features and the Subset Selection Problem. Eleventh International Conference on International Conference on Machine Learning. New Brunswick: Morgan Kaufmann Publishers Inc.; 1994, pp. 121–9.

    Google Scholar 

  48. Huang T, Shi XH, Wang P, He Z, Feng KY, Hu L, Kong X, Li YX, Cai YD, Chou KC. Analysis and Prediction of the Metabolic Stability of Proteins Based on Their Sequential Features, Subcellular Locations and Interaction Networks. PLoS ONE. 2010; 5(6):e10972.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  49. Vapnik VN, Vapnik V. Statistical learning theory. New York: Wiley; 1998.

    Google Scholar 

  50. Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011; 2(3):1–27.

    Article  Google Scholar 

  51. Cristianini N, Taylor JS. An introduction to support vector machines and other kernel-based methods. Cambridge: Cambridge University Press; 2000.

    Book  Google Scholar 

  52. Chou KC. A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins Struct Funct Bioinforma. 1995; 21(4):319–44.

    Article  CAS  Google Scholar 

  53. Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol. 2011; 273(1):236–47.

    Article  CAS  PubMed  Google Scholar 

  54. Chen J, Liu H, Yang J, Chou KC. Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids. 2007; 33(3):423–8.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This project was supported by National Natural Science Foundation of China (Grant Nos. 11871061 and 11401503); Chinese Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT)(Grant No. IRT_15R58); Research Foundation of Education Commission of Hunan Province of China (Grant No. 17K090); Innovation project of Hunan Province of China (Grant No. Cx2016B252); Collaborative Research project for Overseas Scholars (including Hong Kong and Macau) of National Natural Science Foundation of China (Grant No. 61828203)and partially by the Australian Research Council Grant DP160101366.

Availability of data and materials

All of the sequence data was obtained from www.mirbase.org.

About this supplement

This article has been published as part of BMC Bioinformatics Volume 19 Supplement 19, 2018: Proceedings of the 29th International Conference on Genome Informatics (GIW 2018): bioinformatics. The full contents of the supplement are available online at https://0-bmcbioinformatics-biomedcentral-com.brum.beds.ac.uk/articles/supplements/volume-19-supplement-19.

Author information

Authors and Affiliations

Authors

Contributions

YM contributed to the conception and design of the study, downloaded the datasets, analyzed the results and has been involved in programming. ZY gave the ideas and supervised the project. GH has been discussing on the results. JL gave the ideas and revised the manuscript. VA revised the manuscript. All authors contributed to the writing. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zuguo Yu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1

Describes the set of all the features extracted, which is 591. (XLSX 19.7 kb)

Additional file 2

Describes the optimal feature set, which is 268. (XLSX 15.4 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Y., Yu, Z., Han, G. et al. Identification of pre-microRNAs by characterizing their sequence order evolution information and secondary structure graphs. BMC Bioinformatics 19 (Suppl 19), 521 (2018). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2518-2

Download citation

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-018-2518-2

Keywords