Skip to main content

Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling



Nowadays, many public repositories containing large microarray gene expression datasets are available. However, the problem lies in the fact that microarray technology are less powerful and accurate than more recent Next Generation Sequencing technologies, such as RNA-Seq. In any case, information from microarrays is truthful and robust, thus it can be exploited through the integration of microarray data with RNA-Seq data. Additionally, information extraction and acquisition of large number of samples in RNA-Seq still entails very high costs in terms of time and computational resources.This paper proposes a new model to find the gene signature of breast cancer cell lines through the integration of heterogeneous data from different breast cancer datasets, obtained from microarray and RNA-Seq technologies. Consequently, data integration is expected to provide a more robust statistical significance to the results obtained. Finally, a classification method is proposed in order to test the robustness of the Differentially Expressed Genes when unseen data is presented for diagnosis.


The proposed data integration allows analyzing gene expression samples coming from different technologies. The most significant genes of the whole integrated data were obtained through the intersection of the three gene sets, corresponding to the identified expressed genes within the microarray data itself, within the RNA-Seq data itself, and within the integrated data from both technologies. This intersection reveals 98 possible technology-independent biomarkers. Two different heterogeneous datasets were distinguished for the classification tasks: a training dataset for gene expression identification and classifier validation, and a test dataset with unseen data for testing the classifier. Both of them achieved great classification accuracies, therefore confirming the validity of the obtained set of genes as possible biomarkers for breast cancer. Through a feature selection process, a final small subset made up by six genes was considered for breast cancer diagnosis.


This work proposes a novel data integration stage in the traditional gene expression analysis pipeline through the combination of heterogeneous data from microarrays and RNA-Seq technologies. Available samples have been successfully classified using a subset of six genes obtained by a feature selection method. Consequently, a new classification and diagnosis tool was built and its performance was validated using previously unseen samples.


Cancer is the second leading cause of death worldwide, just behind cardiovascular disease. Specifically, breast cancer is one of the five most dangerous cancers in the world, showing a high mortality rate according to World Health Organization (WHO), and being the cancer with the highest impact among the female population [1]. Nowadays, many breast cancer diagnoses are performed when a patient presents several related symptoms, thus increasing the mortality risk. If the cancer has spread, treatment becomes more difficult, and generally the chances of surviving are significantly lower. However, cancers that are diagnosed at an early stage are more likely to be treated successfully. Therefore, it is primordial to find biomarkers that allow an early diagnosis of breast cancer. Two sequencing technologies, microarray and RNA-Seq, have been used for obtaining gene expression. They are briefly described next.

Microarray technology

Microarray has been the main sequencing technology used in the last two decades until the arrival of Next Generation Sequencing techniques. The most extended microarray platforms are Affymetrix [2] and Illumina [3], leading the second one the RNA-Seq sequencing technology market. Nevertheless, there are other very important microarray manufacturers such as Agilent [4], Exiqon [5] or Taqman [6]. A high simultaneous number of genes can be measured at expression level from the use of microarrays. The expression values are achieved by means of microscopic DNA spots attached to a solid surface which have followed a hybridization process. Once this process is completed, it is possible to read the expression values with a laser, and consequently store the quantification levels in a.CEL file [7].

RNA-Seq technology

As a natural evolutionary step in the treatment of biological information from DNA, RNA-Seq is gradually replacing the widespread use of microarrays. Although its application was originally intended for genomic transcription study, it also allows achieving a mapping between the levels of transcription and gene expression [8]. In this sense, its combination with other functional genomics methods allows enhancing the analysis of gene expression. This is achieved through the quantification of the total number of reads that are mapped to each locus in the transcriptome assembly step. RNA-Seq read counts robustness has been validated against predecessor technologies such as microarrays or quantitative polymerase chain reaction (qPCR) [9].

Comparison between both technologies

RNA-Seq offers an important number of advantages over microarrays, although the cost of RNA-Seq experiments is also higher than in microarray technology nowadays:

  • RNA-Seq allows detecting the variation of a single nucleotide.

  • RNA-Seq does not require genomic sequence knowledgement.

  • RNA-Seq provides quantitative expression levels.

  • RNA-Seq provides isoform-level expression measurements.

  • RNA-Seq offers a broader dynamic range than microarrays.

In spite of these advantages, microarrays are still used due to their lower costs. Besides, as microarrays have been used for a longer period, there exist many robust statistical and operational methods for their processing [1015].

There are many significant microarray experiments already available to the research community, and there is also even a high number of microarray datasets that have not been analyzed so far. These datasets might have information that could reveal important facts and candidate biomarkers. In any case, there is no doubt that RNA-Seq is the present technology, but it can also take advantage of the available data from microarray technology. As Nookaew et al. explained [16], there is a high consistency between RNA-Seq and microarray, which encourages to continue using microarray as a versatile tool for gene expression analysis.

The main objective of this work is to find possible breast cancer biomarkers from patient and control samples acquired via NCBI GEO web platform [17]. To this end, an exhaustive search has been done in order to obtain statistically significant samples from both microarray and RNA-Seq series. Two datasets have been considered in this study, one for training and one for testing. The training dataset has been used to extract the Differentially Expressed Genes (DEGs), and to design a classifier. The test dataset has been considered for the assessment of the DEGs selection and classification processes.

In the case of RNA-Seq samples, cqn package [18] has been used to calculate the expression values from the BAM/SAM file. Once the expression values were available, they were merged and normalized with the microarray data. Gene expression was achieved through a joint study of all series that allowed integration among microarrays and RNA-Seq data.

Most of the previous studies in the selection of biomarkers perform this process through statistical tools over a given dataset and a given technology. However, this work takes an innovative step forward by combining different datasets and microarray technologies together with RNA-Seq data. Furthermore, this research also builds an smart breast cancer classifier with the aim of achieving early diagnosis when unlabeled samples are presented. To this end, the minimum-Redundancy Maximum-Relevance (mRMR) [19] feature selection algorithm was applied in order to select the most relevant genes to perform the classification. Also, three different classification algorithms have been implemented and their results compared. The first classifier makes use of Support Vector Machines (SVM) [20, 21]. Alternatively, Random Forest (RF) [22] and k-Nearest Neighbor (k-NN) [23] classifiers have also been designed.

This paper has been structured as follows. This section has shown the introduction and state of the art of this work. Next section explains the methodology followed in this study. It begins by describing the available data series that have been used for this research. Later, the pipeline for processing and classifying the data is shown. An innovative step for automatic sample classification is described using machine learning techniques. The results and discussion section shows the integrated gene expression, revealing those genes that remain unchanged regardless of the technology used in the gene expression identification process. Furthermore, this section underlines the validity of the proposed approach and its utility in breast cancer early diagnosis using the developed classification tool. Finally, the conclusions section summarizes the most important contributions of this study for breast cancer diagnosis and genetic profiling.


Microarray and RNA-Seq series

The first issue that must be addressed concerns the definition of the kind of sample that is going to be used, along with the determination of the tissue or cell that the sample comes from. As a result, a wide search through the NCBI-GEO platform has been done with the objective of finding datasets belonging to both the selected cell lines and the considered technologies. In this study, control samples have been selected from the MCF10A cell line [24]. This cell line is classified as a healthy non-tumorigenic epithelial cell line. Various breast cancer cell lines were selected as cancer samples (MCF7 and HS578T) [25, 26]. Besides, not every sample from each of the series has been selected, as there are samples that do not belong to the cell lines required, or they have been treated with some kind of drug that could produce some noise in the final results.

Once the requirements for selecting the desired samples were established, an exhaustive search of Affymetrix and Illumina series was carried out for microarray data. On the other hand, RNA-Seq data was selected from Illumina HiSeq technology. Only datasets containing the above-mentioned cell lines were selected. Table 1 summarizes the selected series for this study. As it can be seen, the NCBI GEO database offers a larger availability of microarray data when compared with the number of RNA-Seq samples. Two separated supersets have been created, one for training predictive models, and the other for their testing, both containing microarray as well as RNA-Seq samples. The training dataset is made up of 108 microarray samples: 65 samples from Affymetrix, 43 from Illumina, and 24 RNA-Seq samples. On the other hand, the test set is made up of 120 samples of microarray (108 of Illumina and 12 of Affymetrix) as well as 6 samples of RNA-Seq. These series are publicly available at where S.NAME is the name of each series at NCBI GEO.

Table 1 Description of the training and test series considered with number of samples/outliers

Microarray pipeline

The first step in the methodology for microarray data is to put together all the selected series, independently of their technology (Affimetrix or Illumina). Consequently, a quality analysis assessment was performed across the series, in order to detect and consequently remove any possible outlier. This outliers detection and removal was performed through arrayQualityMetrics R package [27], which computes the Kolmogorov-Smirnov statistic K a between the distribution of each array and the distribution of the pooled data. Next, sample normalization was performed using the limma R package normalizedBetweenArrays function [10], in order to remove dynamic expression variability between samples. Once the samples were normalized, the expressed gene values were obtained. Figure 1 outlines the microarray data analysis pipeline.

Fig. 1

Microarray gene expression pipeline

RNA-Seq pipeline

The pipeline proposed by Anders et al. [28] has been followed for the extraction of RNA-Seq data as it is shown in Fig. 2. Starting from the SRA original files, several tools like sra-toolkit [29], tophat2 [30], bowtie2 [31], samtools [32] and htseq [33] have been used to obtain the read count for each gene. Once the read count files were obtained, the expression values were calculated using the cqn and the NOISeq R packages [34].

Fig. 2

RNA-Seq gene expression integration pipeline

Integrated pipeline

A new data processing pipeline is proposed in this work which extends the classical gene expression data analysis pipeline in two ways. On one hand, this pipeline integrates data from both microarray and RNA-Seq technologies. Furthermore, once the integration has been carried out, a gene selection process and an assessment through a classification process were performed, using separated training and test datasets. The workflow of the entire pipeline is shown in Fig. 3.

Fig. 3

Integrated pipeline followed for this study

In a first step, sample integration of data from both microarrays and RNA-Seq technologies has been carried out using the merge function from base R package. Once the gene expression values have been obtained for each technology separately, a normalization of all joint technologies was applied using the normalizedBetweenArrays function cited before over all datasets available (see Table 1). These tasks are essential in order to have available a right normalization of the biological data and its subsequent processing [35, 36]. We have to note that each of the series in Table 1 were originally differently quantified depending on the respective technology and manufacturer.

The next steps in the pipeline for gene expression levels calculation and extraction of DEGs, were made only over the training dataset, thus leaving the test dataset for later assessment.

Gene extraction was performed at different levels using the limma R package, both at individual levels (microarray data and RNA-Seq data separately) and at integrated level (joined microarray and RNA-Seq data).


Once a set of possible target genes which can be considered as biomarkers for breast cancer were identified, we proceeded to assess the results through three different classification technologies: SVM, RF and k-NN. The main objective of this stage is the validation of the behavior of the selected genes at the arrival of new unseen samples. The selected genes and the training dataset were used for designing the classification models, which were later evaluated over the test dataset.

  • SVM: These models are supervised learning algorithms which assign categories to new samples. This algorithm is based on the idea of separating data from different categories through a hyperplane. The algorithm calculates the maximum-margin hyperplane that maximizes the distance between different classes. For overlapped data, this type of models turn a reduced space into a higher dimensional space using a kernel function, in order to perform the classification in this new space. Moreover, the algorithm tolerates making classification errors, which are controlled by the γ hyperparameter, in order to improve the generalization capability of the model [20, 21].

  • RF: This method grows many single classification trees with the purpose of building a forest of classification trees. For the classification, the algorithm assigns the input vector to be classified to each tree of the forest. Once that each individual tree performs classification, the forest chooses the class having the largest number of votes over all the trees. After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used with the aim of replacing missing data, locating outliers and producing illuminating low-dimensional views of the data [22, 37].

  • k-NN: This supervised method is based on assigning to a new unseen sample, the class corresponding to the predominant one in the k nearest neighbors (most similar samples) from the known labeled data. It is a well-known fast and easy-to-use technique which however provides a comparable performance to other well-known more complex techniques [23, 38].

Ten-fold cross-validation was used over the training dataset to obtain the optimal hyperparameters for these methodologies: σ (kernel width) and γ for SVMs, number of trees for RF and k for k-NN.

Gene ranking: mRMR

Additionally, a feature selection process was performed through the mRMR [19] algorithm over the candidate biomarkers, with the objective of finding a reduced subset of genes that gives similar classification accuracy than the initial complete set of genes. In this way, the reduction of the number of genes allows the creation of a more simple and interpretable classifier, as well as more computationally efficient, while maintaining the robustness of the method. This algorithm creates a ranking of features, DEGs in our case. mRMR algorithm uses mutual information as the criterion for variables relevance, computing relevance and redundancy among variables (i.e. genes), and sorting them so that they bring largest relevance with respect to the class (cancer/no cancer) and, at the same time, they have lowest redundancy among themselves. Therefore, this algorithm will rank in first position the gene that contains the maximum relevance information, but the following genes will provide also minimum redundant information (apart from maximum relevance as regards to the class) with respect to the already selected genes, and so forth.

Results and discussion

This section will focus on presenting and discussing the obtained results coming from the experimentation process followed in this study. It is divided into two subsections: first subsection shows the results for the process of obtaining the set of DEGs; while second subsection will show the results of the classification process making use of the former set of genes.

Integrated gene expression

This subsection describes the process and results of the DEGs extraction. As it was previously stated in the methods section, series belonging to different technologies and platforms have been integrated. The objective of this integration is twofold: first, to increase the number of samples that will be used as input to our method, thus improving the robustness and stability of the results. Second, the obtained results will be independent of a single technology, as they proceed from different sources. The presence of RNA-Seq samples increases the dynamical midrange of the genes, making the results more accurate and robust. Furthermore, the number of available samples is greatly increased thanks to the availability of microarray data stored in public repositories.

When working with heterogeneous data, normalization is one of the most sensitive steps in the whole process, as a mistake in this step could cause interpretation errors, and could lead to a false set of expressed genes. Figure 4 shows the need of normalization for both training and test datasets together due to the difference of the dynamic range between samples. To this end, both training and test datasets have been subjected to a joint normalization using the normalizeBetweenArrays function from the limma R package, thus achieving the same dynamic range for all the samples. Figure 5 shows the results once the joint normalization was applied. As it can be seen, the dynamic range between samples has been corrected. In the next step, only the training dataset will be used in the process for identifying the DEGs.

Fig. 4

Expression profile of training and test datasets before normalization

Fig. 5

Expression profile of training and test datasets after normalization

We therefore proceeded to identify the DEGs both for each technology separately (microarray & RNA-Seq) and for the integrated dataset. Several restrictions were imposed in order to determine the expressed genes: the fold change in the expression values of the selected genes was set to be greater or equal than 2 and the p-value was set to be less or equal than 0.001. These constraints ensure that the chosen expressed genes are statistically significant, therefore showing different behavior between patient and healthy samples. These restrictions were applied to the three microarray, RNA-Seq and integrated datasets, so that three sets with different expressed genes were obtained. Finally, through the intersection of the three groups of expressed genes, a total of common 98 DEGs were found. These genes comply with the restrictions and they are differentially expressed in all datasets as the intersection shows (Fig. 6). Consequently, the obtained genes are differentially expressed independently of the gene expression technology, excluding possible noisy genes.

Fig. 6

Intersection of expressed genes in RNA-Seq, microarray and the integrated dataset

A boxplot of the mean gene expression values of the 98 DEGs for the samples in the training dataset is shown in Fig. 7. It shows a clear differentiation between the average value of the cancer cell lines samples and the average value of the MCF10A non-cancer cell line samples. Furthermore, the statistical information of the intersection set of 98 DEGs is shown in Table 2.

Fig. 7

Gene expression values boxplot for the set of 98 expressed genes. Figure shows significant differences between expression values for MCF7 and HS578T cancer cell lines and MCF10A non-cancer cell line

Table 2 List of 98 expressed genes obtained with limma as the intersection of microarray, RNA-Seq and integrated dataset

Table 2 shows five statistics values computed by the limma package (logFC, t-statistic, p-value, adj.p.val. and B). The log-fold change (logFC) represents the difference between breast cancer and control expressed values. If l o g F C≥2 it means that there exists significant differences between cancer and control values. The second value in Table 2 is the moderated t-statistic, which is the ratio between the log2-fold change value for each gene and it respective standard error. The next value is the p-value (p-val) which represents the probability of obtaining a result equal or higher than what it was observed when the null hypothesis is true. The adjusted p-value indicates which proportion of comparisons within a family of comparisons (hypothesis tests) are significantly different. The B-statistic (B) is the log-odds that a given gene is differentially expressed.

Figure 8 depicts a hierarchical clustering using the list of 98 invariant expressed genes. As it can be seen, the cluster is split into two group of samples, one belonging to control samples and the other to breast cancer samples. Thus verifying that the obtained genes are robust and clearly differentiating.

Fig. 8

Hierarchical cluster using the 98 invariant expressed genes

Classification results

Once the DEGs were identified in the previous subsection, this subsection assesses the performance of these genes through a classification process when new samples are presented. For that purpose, the classification algorithms SVM, RF and k-NN have been implemented. The whole training dataset formed by 132 samples has been used as the input data for the classifier (Table 1). The 98 DEGs values were normalized to range between [-1,1], and have been chosen as classification features, ordered by a mutual information-based ranking provided by the mRMR algorithm. Moreover, for a further assessment of the classifier against new unseen samples, a test dataset made up of 126 samples has been equally normalized and used for testing (Table 1).

Following the proposed integrated pipeline in this work (see Fig. 3), once the samples were correctly integrated and the 98 DEGs were found, a classification method using these genes has been applied. Results for all the algorithms in the validation stage using the 98 genes reached an accuracy equal to 100%. Therefore, all samples belonging to the training dataset were successfully classified. When the classifier using 98 genes was applied to test samples, an accuracy above 95% was reached by the three algorithms, rising up to a 97% in the case of SVMs and RFs, thus confirming the robustness of the proposed pipeline approach (see Table 3).

Table 3 Training and test classification accuracies for SVMs, RFs and k-NN algorithms

Afterwards, a feature selection process has been applied in order to reduce the cardinality of the 98 DEGs. As a result, the mRMR algorithm returned a gene ranking based on mutual information. Figure 9 shows the validation (10-CV over the training dataset) and test results using the three algorithms: SVMs, RFs and k-NN. These validation results are above 98% using only the first gene of the ranking for classification for the three algorithms, and above 99.2% using a reduced set of the first six genes in the ranking. Moreover, classification results when using the new 126 unseen samples of the test set and the three methods, rose up to coherent results with an accuracy of 96.8% using SVMs, 94.1% using k-NN, nevertheless lower for RFs with a 87.4%. Therefore, the classifier performs in a similar way to the behavior observed in the validation results for two of the classifiers. Consequently, the main set of 98 DEGs was reduced to the later six genes set, which allow discerning if new samples are cancerous or not, with an expected error around a 3.2% when using a SVM classifier.

Fig. 9

Validation and test classification results with SVM, RF and k-NN using the most relevant genes obtained by mRMR

These differences in performance among classification techniques are usual in this type of problems, and a number of papers comparing classification techniques for biological data can be found in the literature [37, 3941]. In the results above-mentioned, using only 6 genes, SVMs attains an optimal performance near that attained using the complete set of 98 genes. This behavior is also seen in the k-NN technique, although with a lower performance. RF on the other hand obtains similar results than SVMs when the complete set of 98 genes are used, but fails to design a simpler classifier with a low number of genes with optimal performance [39, 40]. Thus, these results support the design of an optimal classifier based on SVMs with only six genes attaining the excellent aforementioned results.

Finally, once the potential biomarker genes were identified as the reduced subset of six genes, a literature review and biological study was done in order to reveal the relation between those genes and their involvement in breast cancer (Table 4). The first five of these six genes have been formerly reported as genes involved in breast cancer, whilst the sixth gene is present in breast cancerous tissue, although with no evidence of a direct implication with breast cancer development. This means that the results attained by the proposed integrated pipeline are coherent, as the reduced subset of six genes is formed by genes related with breast cancer. Furthermore, these genes can be used for classification and diagnosis purposes over new unseen samples. They can be designated as a new breast cancer biomarker signature when these types of cell lines data are present.

Table 4 Relationship of the top 6 expressed genes with breast cancer

Figure 10 shows a hierarchical cluster built with the small six genes subset. Two distinct groups are clearly identified, as it also happened in Fig. 8: one matching control samples and the other matching breast cancer samples. Therefore, this indicates that the expression profiles of these genes constitute a possible diagnosis criteria for breast cancer.

Fig. 10

Hierarchical cluster over healthy and breast cancer samples using the top 6 genes

Figure 11 shows a boxplot for each of the six genes representing the average expression value for the cancerous samples (red), and control samples (green). As it can be seen, average expression values between cancerous and control samples are clearly differentiated, thus reaffirming their potential as breast cancer biomarkers.

Fig. 11

Average expression value boxplots of the six most relevant genes obtained in this study


This work has presented the possibility of integrating data from different gene expression analysis technologies. On the one hand, microarrays, which have been widely used in the last two decades and, on the other hand, RNA-Seq that is the technology meant to replace microarrays definitely.

An exhaustive search from the NCBI-GEO public repository has been performed in order to collect breast cancer samples from both technologies. The intersection of DEGs in microarray, RNA-Seq, and the integrated dataset, has allowed identifying a set of candidates biomarkers for diagnosis of this disease.

Thereafter, feature selection through mRMR was applied in order to select the most relevant biomarkers subset. Three different classification models (SVMs, RFs, and k-NN) were designed from the training dataset and the selected DEGs and compared. These classifier were validated with the test dataset achieving outstanding results for the three algorithms when the complete set of 98 DEGs were used.

In conclusion, results show that the expressed genes can be designated as robust biomarkers for breast cancer diagnosis when specific cell lines samples are used. Furthermore, even with a small subset of six of those genes, a great validation accuracy was reached (99%). Also, classification results over new unseen data show great accuracy, specially over SVM classification (96.8%). Five of these top six genes have been formerly reported as genes that show biological relation with breast cancer, which reinforce the designation of the expression profiles of these genes for breast cancer diagnosis.



Differentially expressed genes


k-nearest neighbor


Minimum-redundancy maximum-relevance


Random forest


Support vector machines


  1. 1

    OMS. Women’s health. 2013.

  2. 2

    Gohlmann H, Talloen W. Gene Expression Studies Using Affymetrix Microarrays: CRC Press.

  3. 3

    Illumina. Illumina Genes Expression arrays. 2009.

  4. 4

    Zahurak M, Parmigiani G, Yu W, Scharpf RB, Berman D, Schaeffer E, Shabbeer S, Cope L. Pre-processing agilent microarray data. BMC Bioinformatics. 2007; 8(1):142.

    Article  PubMed  PubMed Central  Google Scholar 

  5. 5

    Exiqon. Exiqon Genes Expression arrays. 2009.

  6. 6

    Taqman. Taqman Genes Expression arrays. 2009.

  7. 7

    Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science. 1995; 270(5235):467.

    CAS  Article  PubMed  Google Scholar 

  8. 8

    Wang Z, Gerstein M, Snyder M. Rna-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10(1):57–63.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  9. 9

    Peirson SN, Butler JN. Quantitative polymerase chain reaction. Methods Mol Biol. 2007; 362:349–362. doi:10.1385/1-59745-257-2:349.

    CAS  Article  PubMed  Google Scholar 

  10. 10

    Smyth GK. Limma: linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. Statistics for Biology and Health. New York: Springer. p. 397–420.

  11. 11

    Kerr MK, Churchill GA. Statistical design and the analysis of gene expression microarray data. Genet Res. 2001; 77(2):123–8.

    CAS  PubMed  Google Scholar 

  12. 12

    Sturn A, Quackenbush J, Trajanoski Z. Genesis: cluster analysis of microarray data. Bioinformatics. 2002; 18(1):207–8.

    CAS  Article  PubMed  Google Scholar 

  13. 13

    Hong F, Breitling R, McEntee CW, Wittner BS, Nemhauser JL, Chory J. Rankprod: a bioconductor package for detecting differentially expressed genes in meta-analysis. Bioinformatics. 2006; 22(22):2825–7.

    CAS  Article  PubMed  Google Scholar 

  14. 14

    Parmigiani G, Garrett ES, Irizarry RA, Zeger SL. The analysis of gene expression data: an overview of methods and software. In: The analysis of gene expression data. New York: Springer: 2003. p. 1–45.

    Google Scholar 

  15. 15

    Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W. Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005; 21(16):3439–40.

    CAS  Article  PubMed  Google Scholar 

  16. 16

    Nookaew I, Papini M, Pornputtapong N, Scalcinati G, Fagerberg L, Uhlén M, Nielsen J. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: A case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012; 40(20):10084–10097. doi:10.1093/nar/gks804.

  17. 17

    Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. Ncbi geo: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res. 2007; 35(suppl 1):760–5.

    Article  Google Scholar 

  18. 18

    Hansen KD, Irizarry RA, Zhijin W. Removing technical variability in rna-seq data using conditional quantile normalization. Biostatistics. 2012; 13(2):204–16.

    Article  PubMed  PubMed Central  Google Scholar 

  19. 19

    Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. Proceedings of the 2003 IEEE Bioinformatics Conference, CSB 2003. 2003:523–528. doi:10.1109/CSB.2003.1227396.

  20. 20

    Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995; 20(3):273–97.

    Google Scholar 

  21. 21

    Noble WS. What is a support vector machine?Nat Biotechnol. 2006; 24:1565–7.

    CAS  Article  PubMed  Google Scholar 

  22. 22

    Ho TK. Random decision forests. In: Document Analysis and Recognition, 1995., Proceedings of the Third International Conference On. vol. 1. IEEE: 1995. p. 278–282.

  23. 23

    Parry R, Jones W, Stokes T, Phan J, Moffitt R, Fang H, Shi L, Oberthuer A, Fischer M, Tong W, et al.k-nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. Pharmacogenomics J. 2010; 10(4):292.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  24. 24

    Soule HD, Maloney TM, Wolman SR, Peterson WD, Brenz R, McGrath CM, Russo J, Pauley RJ, Jones RF, Brooks S. Isolation and characterization of a spontaneously immortalized human breast epithelial cell line, mcf-10. Cancer Res. 1990; 50(18):6075–86.

    CAS  PubMed  Google Scholar 

  25. 25

    Soule H, Vazquez J, Long A, Albert S, Brennan M. A human cell line from a pleural effusion derived from a breast carcinoma. J Natl Cancer Inst. 1973; 51(5):1409–16.

    CAS  Article  PubMed  Google Scholar 

  26. 26

    Hackett AJ, Smith HS, Springer EL, Owens RB, Nelson-Rees WA, Riggs JL, Gardner MB. Two syngeneic cell lines from human breast tissue: the aneuploid mammary epithelial (hs578t) and the diploid myoepithelial (hs578bst) cell lines. J Natl Cancer Inst. 1977; 58(6):1795–806.

    CAS  Article  PubMed  Google Scholar 

  27. 27

    Kauffmann A, Gentleman R, Huber W. arrayqualitymetrics - a bioconductor package for quality assessment of microarray data. Bioinformatics. 2009; 25(3):415–6.

    CAS  Article  PubMed  Google Scholar 

  28. 28

    Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, Robinson MD. Count-based differential expression analysis of rna sequencing data using r and bioconductor. Nat Protoc. 2013; 8(9):1765–86.

    Article  PubMed  Google Scholar 

  29. 29

    Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011; 39(SUPPL. 1):D19–D21. doi:10.1093/nar/gkq1019.

  30. 30

    Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. Tophat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013; 14(4):36.

    Article  Google Scholar 

  31. 31

    Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012; 9(4):357–9.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  32. 32

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, et al.The sequence alignment/map format and samtools. Bioinformatics. 2009; 25(16):2078–9.

    Article  PubMed  PubMed Central  Google Scholar 

  33. 33

    Anders S, Pyl PT, Huber W. HTSeq–A Python framework to work with high-throughput sequencing data. Bioinformatics. 2015; 31(2):166–169. doi:10.1093/bioinformatics/btu638.

  34. 34

    Tarazona S, García F, Ferrer A, Dopazo J, Conesa A. Noiseq: a rna-seq differential expression method robust for sequencing depth biases. EMBnet J. 2012; 17(B):18.

    Article  Google Scholar 

  35. 35

    Dobbin KK, Simon RM. Optimally splitting cases for training and testing high dimensional classifiers. BMC Med Genet. 2011; 4(1):31.

    Google Scholar 

  36. 36

    Önskog J, Freyhult E, Landfors M, Rydén P, Hvidsten TR. Classification of microarrays; synergistic effects between normalization, gene selection and machine learning. BMC Bioinformatics. 2011; 12(1):390.

    Article  PubMed  PubMed Central  Google Scholar 

  37. 37

    Díaz-Uriarte R, De Andres SA. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006; 7(1):3.

    Article  PubMed  PubMed Central  Google Scholar 

  38. 38

    Wu W, Xing EP, Myers C, Mian IS, Bissell MJ. Evaluation of normalization methods for cdna microarray data by k-nn classification. BMC Bioinformatics. 2005; 6(1):191.

    Article  PubMed  PubMed Central  Google Scholar 

  39. 39

    Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008; 9(1):319.

    Article  PubMed  PubMed Central  Google Scholar 

  40. 40

    Statnikov A, Aliferis CF. Are random forests better than support vector machines for microarray-based cancer classification? In: AMIA annual symposium proceedings, vol. 2007. Chicago: American Medical Informatics Association: 2007. p. 686.

    Google Scholar 

  41. 41

    Cho S-B, Won H-H. Machine learning in DNA microarray analysis for cancer classification. In: Proceedings of the First Asia-Pacific Bioinformatics Conference on Bioinformatics 2003-Volume 19. Australia: Australian Computer Society, Inc.: 2003. p. 189–98.

    Google Scholar 

  42. 42

    Kim TH, Chang JS, Park KS, Park J, Kim N, Lee JI, Kong ID. Effects of exercise training on circulating levels of dickkpof-1 and secreted frizzled-related protein-1 in breast cancer survivors: A pilot single-blind randomized controlled trial. PLoS One. 2017; 12(2):0171771. doi:10.1371/journal.pone.0171771.

    Google Scholar 

  43. 43

    Kong LY, Xue M, Zhang QC, Su CF. In vivo and in vitro effects of microrna-27a on proliferation, migration and invasion of breast cancer cells through targeting of sfrp1 gene via wnt/beta-catenin signaling pathway. Oncotarget. 2017. doi:10.18632/oncotarget.14662.

  44. 44

    Mitrunen K, Jourenkova N, Kataja V, Eskelinen M, Kosma VM, Benhamou S, Vainio H, Uusitupa M, Hirvonen A. Glutathione s-transferase m1, m3, p1, and t1 genetic polymorphisms and susceptibility to breast cancer. Cancer Epidemiol Biomarkers Prev. 2001; 10(3):229–36.

    CAS  PubMed  Google Scholar 

  45. 45

    Choi JY, Lee KM, Park SK, Noh DY, Ahn SH, Chung HW, Han W, Kim JS, Shin SG, Jang IJ, Yoo KY, Hirvonen A, Kang D. Genetic polymorphisms of sult1a1 and sult1e1 and the risk and survival of breast cancer. Cancer Epidemiol Biomarkers Prev. 2005; 14(5):1090–5. doi:10.1158/1055-9965.EPI-04-0688.

    CAS  Article  PubMed  Google Scholar 

  46. 46

    Xu Y, Liu X, Guo F, Ning Y, Zhi X, Wang X, Chen S, Yin L, Li X. Effect of estrogen sulfation by sult1e1 and papss on the development of estrogen-dependent cancers. Cancer Sci. 2012; 103(6):1000–9. doi:10.1111/j.1349-7006.2012.02258.x.

    CAS  Article  PubMed  Google Scholar 

  47. 47

    Flonta SE, Arena S, Pisacane A, Michieli P, Bardelli A. Expression and functional regulation of myoglobin in epithelial cancers. Am J Pathol. 2009; 175(1):201–6. doi:10.2353/ajpath.2009.081124.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  48. 48

    Kristiansen G, Hu J, Wichmann D, Stiehl DP, Rose M, Gerhardt J, Bohnert A, ten Haaf A, Moch H, Raleigh J, Varia MA, Subarsky P, Scandurra FM, Gnaiger E, Gleixner E, Bicker A, Gassmann M, Hankeln T, Dahl E, Gorr TA. Endogenous myoglobin in breast cancer is hypoxia-inducible by alternative transcription and functions to impair mitochondrial activity: a role in tumor suppression?J Biol Chem. 2011; 286(50):43417–28. doi:10.1074/jbc.M111.227553.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  49. 49

    Bicker A, Brahmer AM, Meller S, Kristiansen G, Gorr TA, Hankeln T. The distinct gene regulatory network of myoglobin in prostate and breast cancer. PLoS One. 2015; 10(11):0142662. doi:10.1371/journal.pone.0142662.

    Article  Google Scholar 

  50. 50

    Ai L, Kim WJ, Alpay M, Tang M, Pardo CE, Hatakeyama S, May WS, Kladde MP, Heldermon CD, Siegel EM, Brown KD. Trim29 suppresses twist1 and invasive breast cancer behavior. Cancer Res. 2014; 74(17):4875–87. doi:10.1158/0008-5472.CAN-13-3579.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.


This work was supported by Project TIN2015-71873-R (Spanish Ministry of Economy and Competitiveness -MINECO- and the European Regional Development Fund -ERDF). The funding body didn’t play any role in the design or conclusion of this study.

Availability of data and materials

All data generated or analyzed during this study are included in this published article and its supplementary information files.

Ethics and consent to participate

Not applicable.

Author information




DCS is the main author of this research and the manuscript. JMGG and DCS analyzed the data. BSRA did the study about the 6 top genes. LJHM, FRR and IRR conducted the experiments. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Daniel Castillo.

Ethics declarations

Consent for publication

Not applicable.

Competing interests

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Castillo, D., Gálvez, J.M., Herrera, L.J. et al. Integration of RNA-Seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics 18, 506 (2017).

Download citation


  • RNA-Seq
  • Microarray
  • Breast cancer
  • Cancer
  • SVM
  • Random Forest
  • k-NN
  • Gene expression
  • Classification
  • Integration