- Open Access
PASTA: splice junction identification from RNA-Sequencing data
BMC Bioinformatics volume 14, Article number: 116 (2013)
Next generation transcriptome sequencing (RNA-Seq) is emerging as a powerful experimental tool for the study of alternative splicing and its regulation, but requires ad-hoc analysis methods and tools. PASTA (Patterned Alignments for Splicing and Transcriptome Analysis) is a splice junction detection algorithm specifically designed for RNA-Seq data, relying on a highly accurate alignment strategy and on a combination of heuristic and statistical methods to identify exon-intron junctions with high accuracy.
Comparisons against TopHat and other splice junction prediction software on real and simulated datasets show that PASTA exhibits high specificity and sensitivity, especially at lower coverage levels. Moreover, PASTA is highly configurable and flexible, and can therefore be applied in a wide range of analysis scenarios: it is able to handle both single-end and paired-end reads, it does not rely on the presence of canonical splicing signals, and it uses organism-specific regression models to accurately identify junctions.
PASTA is a highly efficient and sensitive tool to identify splicing junctions from RNA-Seq data. Compared to similar programs, it has the ability to identify a higher number of real splicing junctions, and provides highly annotated output files containing detailed information about their location and characteristics. Accurate junction data in turn facilitates the reconstruction of the splicing isoforms and the analysis of their expression levels, which will be performed by the remaining modules of the PASTA pipeline, still under development. Use of PASTA can therefore enable the large-scale investigation of transcription and alternative splicing.
Alternative splicing (AS) is the process by which a single gene can generate multiple transcripts, and therefore different proteins, through the alternative use of exons. As our knowledge of the structure and organization of genomes increases, AS is being increasingly recognized as a fundamental process at the basis of the molecular diversity and complexity within the cell, of gene regulation, and of a number of critical biological processes ranging from development to disease . Alterations of AS are linked to human diseases ranging from cancer to muscular dystrophies, from neurodegenerative diseases to obesity . A better understanding of the mechanisms that regulate AS and of the relationships between AS and pathological states will provide new, important insights into these diseases, leading to advances in their diagnosis, and opening the way for the development of novel molecular therapies.
High-throughput RNA sequencing technology (RNA-Seq) provides a very large amount of information about transcriptional state, averaged across a population of cells, under the form of short reads that map to genomic regions corresponding to gene transcripts . Reads mapping to the exons of a gene indicate that the corresponding gene is likely expressed. Exons and junctions that are specific to a single isoform of a gene make it possible to identify the particular splicing isoform or isoforms that are expressed , while more accurate isoform reconstruction can be achieved by analyzing reads that map to exon-exon junctions. Finally, the number of reads aligning to a gene region, when normalized to the length of the region, provides an estimate of the expression level of that gene . The results of an RNA-Seq experiment can therefore provide a snapshot of the RNA landscape in a population of cells in a given state, ranging from a catalog of all RNA molecules represented in it, to their exact structure, to their relative expression levels .
We are developing an innovative computational pipeline for the analysis of alternative splicing through RNA-Seq, called PASTA (Patterned Alignments for Splicing and Transcriptome Analysis). The pipeline is composed of three modules, dealing with splice junction detection, isoform reconstruction, and expression level estimation, respectively. Here we describe the first module of the pipeline, whose purpose is to identify all exon-exon junctions that can be inferred from aligned short reads. This is accomplished through a highly accurate splice-junction identification algorithm combined with heuristic methods to score junctions on the basis of biologically relevant features.
The first step in running PASTA consists in aligning short reads obtained from RNA sequencing to the reference genome. This step is usually performed using an existing, fast alignment tool such as Bowtie or Bowtie2 . Since the reads are aligned against the genomic sequence, reads that are entirely contained within exons will align correctly, while reads falling over the junctions between two exons will, in general, fail to align. The main task performed by PASTA is to infer the exact location of exon-intron boundaries using the unaligned reads.
In contrast to the seed and extend method used by the majority of similar programs, PASTA relies on patterned alignments combined with a logistic regression model. Patterned alignments allow PASTA to identify the position of putative splice junctions with high accuracy. The logistic regression model is used to score putative splice junctions according to their biological “context”: for example, the presence of canonical splice signals  and of regulatory elements such as the Branch-Point Sequence (BPS), and the expected distribution of intron sizes.
The PASTA algorithm considers each unaligned read in turn and generates two sets of “patterned” subsequences from it, by splitting it at different cutoff points. If we denote the read length with n and we choose a stepping distance s and a minimum fragment size m, each patterned pair will consist of the sequence from the start of the read to position p and of the sequence from position p to the end of the read, where p ranges from m to n-m in steps of s. All these fragments then undergo a second round of alignment. Fragments longer than a set minimum size are again aligned to the reference sequence with Bowtie, while each remaining short fragment is aligned to the region around the other fragment in the pair, using a local alignment procedure. For example if n=36, the minimum alignment size is 14 and the step size is 4, patterned pairs (14, 22), (18, 18) and (22, 14), as well as single fragments (−, 30), (−, 26) and (26,-), (30,-) will be aligned to the reference genome (the two numbers in parenthesis represent the lengths of the left and right fragments respectively, while ‘-’ indicates a left or a right fragment too short to undergo whole-genome alignment). To identify exon-intron boundaries, PASTA first looks for alignment matches from all patterned pairs (14, 22), (18, 18) and (22, 14). If alignment matches are reported for both the left and right fragment in a given patterned pair (a “double match”), PASTA will optimize the exon-intron boundary adjusting the fragments by 1 or 2 nucleotides in both directions around the cutoff point. For example, if alignment matches are found for patterned pair (14, 22), PASTA will also test the alternative patterned pairs (12, 24), (13, 23) and (15, 21), (16, 20), to identify the patterned pair that minimizes sequence alignment mismatches. Its position will be used as the putative exon-intron boundary.
The alternative case occurs when there is no double match for a given read; in other words, there is no pair of fragments whose left and right components have a unique alignment to the genome. In this case PASTA will pick the fragment with the longest aligned match from individually aligned fragments, and perform a local alignment step in order to determine the location of the other half of the pair.
Suppose for example that the left fragment (26,-) is the longest fragment observed to have an alignment match. PASTA will search the chromosome region adjacent to its position to locate the optimal position of the remaining right fragment of length 10. The size of the region analyzed by the local alignment procedure can be configured by user, with a default of 100,000 base pairs. This step obviously increases the computational cost of the algorithm, since each short read gives rise to a large number of fragment pairs; on the other hand all these fragments are relatively short and can therefore be aligned very efficiently. On average, the second round of alignment takes 2 to 5 times longer than the initial one, but this performance penalty is compensated by a large gain in accuracy, as shown below.
A logistic regression model for splice junction prediction
Because of the uncertainty involved in identifying the precise location of splice junctions from short RNA-Seq reads, PASTA employs a logistic regression model to assign a score to each putative intron produced by a pair of junctions. The model takes into account several factors that characterize the intronic context, such as the presence or absence of canonical splicing signals, the posterior probability of the intron size from the Pareto distribution, the alignment mismatches [9, 10]. In particular, intron sizes can be modeled by a Pareto distribution whose parameters are organism-dependent, since the intron size distribution in a specific organism follows a characteristic curve (Additional file 1: Figure S1).
The logistic regression model can be written as f(Z) = ß 0 + ßZ + ε, where f(Z) is the logistic regression value, ß 0 and ß are coefficient vectors and Z is a vector containing the values of regression factors described above. The coefficients of the logistic regression model and of the Pareto distribution are estimated from existing splice junction annotations for the species under consideration [9, 11]. The PASTA package provides pre-computed models for many commonly studied organisms.
Individual explanatory variables contribute differently to the logistic regression model as shown by regression coefficients (Additional file 1: Table S1). In general, sequence alignment similarity and intron Pareto score contribute most significantly in the organisms under study. In addition, the coefficient of the same explanatory variable is also different by organisms. For example, intron size contributes more to the logistic regression model in maize compared with mouse. This is consistent with the fact that there is a more stringent preference for shorter introns in maize compared with mouse.
Whenever a set of reads gives rise to multiple putative junctions, the logistic regression model is applied to the resulting introns to generate a score for each. The score is computed as 1/(1 + e-f(Z)) which returns a value in the range 0 to 1, with large negative values of f(Z) producing probabilities close to 0, and large positive values producing probabilities close to 1. The putative junction that produces the highest-scoring intron is then chosen as the predicted junction.
The procedure described above is used to determine the location of a putative splice junction given a single short read. In general, a predicted junction will be supported by multiple short reads aligning to the same general region. Therefore, PASTA will cluster putative splice junctions based on their positions. If a cluster contains a single putative junction (i.e., generated by a single short read), PASTA will directly report its position as a predicted junction. If instead the cluster contains several putative junctions, the putative junction with the highest value of the logistic regression model will be used to determine the position of the junction, and the number of putative junctions in the cluster will be reported as the junction coverage.
It is important to note that, although the process here described does not explicitly look for canonical splicing signals, the presence or absence of the canonical splicing signal is one of the variables included in the logistic regression model. By assigning different weights to this variable, the user can therefore determine how strong the preference for canonical splicing junctions should be.
Paired-end RNA-Seq reads
When working with paired-end reads, PASTA applies the patterned alignment procedure described above to the reads in each pair for which no full alignments were found. Reads in a pair are aligned independently, but information regarding the position of one of the reads can be helpful in aligning the second one. For example, assume that the leftmost read in a pair (read A) has a full-length match to the genome, while the rightmost read (B) requires a patterned alignment. If it becomes necessary to perform local alignment on the left fragment of read B, the region to be scanned will be delimited by the position of read A.
PASTA is distributed as a command-line tool for GNU/Linux systems, and is specifically designed for inclusion in automated RNA-Seq analysis pipelines. PASTA can take advantage of multiple cores by parallelizing operations whenever possible. The program takes as input a file containing short reads in FASTQ format (or two files, in the case of paired-end sequencing), and an argument indicating the organism being analyzed, in order to load the correct parameters for the logistic regression model. The output consists of a file listing all identified junctions in BedGraph format, and a file containing the positions of all matched reads in SAM format. In addition, PASTA generates optional output files containing alignment details for each predicted junction and overall alignment statistics, such as the splice site signal and the probability score of the predicted junction from the statistical model. PASTA requires approximately 10 GB of RAM and a runtime of 5 hours to process 20 million paired-end RNA-Seq reads of 50 bp using 4 parallel threads. The program is highly configurable through command-line arguments or a configuration file, and an integrated help system provides extensive documentation for all available options. The download package provides sample files that can be used to test the program and experiment with the different options.
Results and discussion
We have tested PASTA both on simulated RNA-Seq data, to measure its sensitivity and specificity in identifying known splice junctions, and on two real mouse RNA-Seq datasets, in order to assess the program’s ability to identify splice junctions inferred from ENSEMBL annotations and to estimate the number of novel potential junctions discovered. In both cases, the performance of PASTA was compared to that of TopHat (version 1.1.0), one of the most widely-used programs for splice junction detection .
Comparison of PASTA and Tophat on single end simulated data
As a first test of the performance of PASTA, we compared its ability to detect known splice junctions against TopHat. We generated four simulated datasets of 50nt single-ended RNA-Seq reads from mouse transcripts appearing in ENSEMBL gene annotations, corresponding to average depths of coverage ranging from 1 to 8 reads per nucleotide, and we introduced random sequencing errors at a frequency of 1/1000 basepairs and Single Nucleotide Polymorphism (SNP) at a frequency of 5/1000 basepairs. After running PASTA and TopHat on the four datasets, we measured each program’s sensitivity and specificity. The results are reported in Figure 1. As read depth increases, sensitivity increases (since detecting junctions becomes easier) but specificity also decreases (since the number of false positives increases with the number of reads). The results show that PASTA consistently exhibits a lower false negative rate than TopHat, especially at low coverage level. Sensitivity is consistently higher than TopHat (on average, 20% to 40% higher), especially for transcripts expressed at a low level. PASTA is therefore well-suited for identifying “rare” splicing events, reducing the risk of missing splicing events critical for AS analysis.
This indicates that the use of PASTA may lead to a reduction in sequencing costs, for example by multiplexing more samples in the same run, since it is able to produce reliable results even at low sequencing depths.
Comparison between PASTA and other pipelines on paired-end simulated data
We also compared PASTA with several other splice junction detection pipelines using simulated datasets of 100 nt paired-end reads provided by Grant et al. . Two different datasets were generated on the basis of different polymorphism frequencies and error rates. The first datasets contains simulated reads with an indel rate of 0.05%, a sequencing error rate of 0.5%, and a substitution rate of 0.1%. The second datasets has an indel rate of 0.25%, a sequencing error range of 1%, and a substitution rate of 0.5%. In both cases, PASTA performed significantly better than MAPSPLICE, SPLICEMAP or Tophat, and achieved performance comparable to GSNAP or RUM, as shown in Figure 2.
Overall, simulations with various sequencing depths show that PASTA is highly sensitive in identifying splice junctions even when the sequencing depth is relatively low. PASTA is able to identify splice junctions with reads of small to medium size (30 to 70 basepairs) and rare transcripts that result in low coverage levels. On the other hand, PASTA achieves results similar to other comparable tools on longer reads and higher sequencing depths, as demonstrated by the second simulation study.
Analysis on mouse RNA-Seq data
We have performed two RNA-Sequencing runs on Illumina Genome Analyzer 2x using mouse samples in order to generate initial validation data for PASTA (called Run1 and Run2 respectively in the following). Run1 produced two lanes of RNA-Seq data from control mouse samples and four lanes from mutant mouse samples, while Run2 produced three lanes from control mouse samples and four lanes from mutant mouse samples. The mutant samples were obtained from Mbnl3 knockdown mice provided by the Swanson laboratory at the University of Florida. Mbnl3 encodes a protein belonging to the muscleblind family of Cys3His zinc finger proteins, and is known to have a widespread effect on splicing. The disruption of normal Mbnl3 function may therefore induce changes in the splicing pattern of many downstream genes regulated by it [13, 14]. In the following we provide a description of the experiments and we present our preliminary analysis of the results.
In order to evaluate the performance of PASTA in detecting splice junctions, we ran TopHat on the same data, and we compared the number of known junctions identified by the two programs. Since we do not have an independent way of confirming the predicted junctions, we used the set of junctions derived from the validated gene models in the ENSEMBL genes database as the “gold standard” against which to measure performance. Tables 1, 2 and 3 contain the results of the comparison between PASTA and TopHat on data from the two sequencing runs. The results show, first of all, that PASTA detects a higher number of junctions than TopHat, especially with shorter read lengths, and the difference increases with the number of reads. Table 2 focuses on known junctions: we computed the number of predicted junctions appearing in ENSEMBL that are identified by both programs, and reported the ratio between the common known junctions and the number of predictions from each program, respectively, as a percentage. The resulting value, that indicates the percentage of TopHat junctions that PASTA correctly identifies, is consistently at 96% or higher in both Run1 and Run 2. Additionally, PASTA identifies an extra 50% or higher of known junctions in Run 1, and 20% or higher of known junctions in Run 2 that are completely absent in TopHat predictions. The number of known junctions identified by TopHat is significantly lower in Run 1 in comparison with Run 2, as a result of the smaller number of reads and shorter read size in Run 1 compared with Run 2. In contrast, PASTA is able to recover a similar number of splice junctions in Run 1 and Run 2. Some of these extra known junctions identified by PASTA map to exons that are sequenced at low levels or are supported by very few junction-spanning reads; these reads are therefore crucial in reconstructing correct gene structures.
Table 3 reports the number of additional junctions (i.e., not found in ENSEMBL known genes) identified by both programs. Although it is impossible to know whether these junctions are real or are false positives without performing a large-scale validation experiment, a substantial number of them are identified by both programs, especially in the case of longer read lengths and higher sequencing depths. In addition, a significant fraction of the new junctions is covered by more than one read, which increases the likelihood that they are indeed biologically real. The average coverage of new junctions ranges from 5.3 in Run 1 to 8.2 in Run 2, and the percentage of junctions with coverage higher than the average ranges from 8% in Run 1 to 16% in Run 2.
Table 4 displays the average probability score and the frequency of canonical (GT-AG) junctions as a function of the coverage level (expressed as reads/junctions) on junctions identified by PASTA. The results show that higher coverage is usually a strong indicator of real junctions, characterized by high probability scores and presence of canonical splice signals. These results are further supported by Table 5. Using ENSEMBL known genes, we can see that PASTA predicted junctions that appear in the known genes exhibit higher average probability scores and higher coverage. In addition, Table 6 shows that canonical junctions have significantly higher average scores and coverage than non-canonical one. Finally, in order to see the average coverage for PASTA predicted junctions that appeared in ENSEMBL know genes, we compute the total number of reads that fall on these junctions. These results suggest that PASTA predicted junctions that appear on ENSEMBL genes are mostly canonical junctions (97% or more), and are normally supported by high probability scores and high coverage as shown by Table 5. Finally, we compared PASTA with RUM in splice junction prediction using the RNA-Seq datasets from Run 1 and Run 2. Results (see Additional file 1: Tables S2 and S3) shows that PASTA and RUM predictions are very closely correlated with each other in detecting known junctions.
PCR validations on minor splice sites
In order to validate PASTA’s ability to detect splice junctions including minor splice sites, we selected a total of nine splicing junctions in a third RNA-Seq dataset from the mouse forebrain and tested them experimentally. Five of these candidate targets contained the minor splice site signal AT-AC, while the other four candidate targets contained the minor splice site signal GC-AG (Additional file 1: Figure S2 and Table S4).
PCR assays confirmed the presence of all GC-AG signals and AT-AC signals tested, demonstrating PASTA’s potential to accurately discover minor splice sites. In particular, we investigated AT-AC signals, which are excised by a new class of splicesome, as they are known to be present in genes with critical cellular functions [2, 8]. ENSEMBL annotations only contain a total of 473 AT-AC signals, which may be an underestimation due to bias towards canonical junctions in gene-finding algorithms. PASTA identified on average 500 AT-AC signals per sequencing run in mouse RNA-Seq datasets, and approximately 30-50% of these signals are present in ENSEMBL annotated junctions (Table 7). The remaining 50% or more AT-AC signals may come from novel junctions.
PASTA is an easy to use and efficient tool to identify splice junctions from RNA-Seq data, intended as the first module in a complete computational pipeline for AS analysis. Compared to similar tools, PASTA offers an increased ability to detect real splice junctions especially at low coverage levels and short sequence size, due to several heuristic strategies it employs. It does not rely on the presence of canonical splice junctions, and it uses an organism-specific statistical model to evaluate predicted intron-exon junctions. Junction positions are determined through a highly accurate procedure based on patterned alignments. Moreover, PASTA enables the prediction of trans-splicing events from patterned alignments identified in different chromosomes. It allows prediction of splice junctions in less well-studied non-model organisms using information learned from closely-related model organisms. In addition, experimental validation demonstrates PASTA’s high sensitivity in discovering minor splice sites (Table 7, Additional file 1: Figure S2 and Table S4).
Finally, PASTA does not filter predicted junctions on the basis of their coverage, but retains high-scoring junctions even when they are supported by a low number of reads. The reason is that the final result we are interested in is not the presence or absence of an individual junction, but which isoform structures can be inferred from a set of junctions in the same locus. It is therefore a better strategy to retain low-coverage junctions (provided they have a high score) and evaluate the isoform(s) they participate in when information about all the other junctions in them is known. The resulting high sensitivity in discovering splice junctions, including minor splice sites, was demonstrated by experimental validation of a subset of PASTA predictions.
Availability and requirements
Project name: PASTA
Project home page: http://genome.ufl.edu/rivalab/PASTA
Operating system(s): Linux, OS X
Programming language: Common Lisp
Other requirements: Bowtie package
License: GNU GPL
Single nucleotide polymorphism
Barash Y, Calarco JA, Gao W, Pan Q, Wang X, Shai O, Blencowe BJ, Frey BJ: Deciphering the splicing code. Nature. 2010, 465: 53-59. 10.1038/nature09000.
Wang G-S, Cooper TA: Splicing in disease: disruption of the splicing code and the decoding machinery. Nat Rev Genet. 2007, 8: 749-761. 10.1038/nrg2164.
Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ: Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008, 40: 1413-1415. 10.1038/ng.259.
Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP, Stoeckert CJ, Hogenesch JB, Pierce EA: Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics. 2011, 27: 2518-2528.
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.
Shendure J: The beginning of the end for microarrays?. Nat Methods. 2008, 5: 585-587. 10.1038/nmeth0708-585.
Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.
Burset M, Seledtsov IA, Solovyev VV: Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res. 2000, 28: 4364-4375. 10.1093/nar/28.21.4364.
Graham JH, Robb DT, Poe AR: Random phenotypic variation of yeast (Saccharomyces cerevisiae) single-gene knockouts fits a double pareto-lognormal distribution. PLoS One. 2012, 7 (11): 1-6.
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras J-B, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464: 768-772. 10.1038/nature08872.
Aban IB, Meerschaert MM, Panorska AK: Parameter estimation for the truncated pareto distribution. J Am Stat Assoc. 2006, 101: 270-277. 10.1198/016214505000000411.
Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009, 25: 1105-1111. 10.1093/bioinformatics/btp120.
Lee K-S, Smith K, Amieux PS, Wang EH: MBNL3/CHCR prevents myogenic differentiation by inhibiting MyoD-dependent gene transcription. Differentiation. 2008, 76: 299-309. 10.1111/j.1432-0436.2007.00209.x.
Squillace RM, Chenault DM, Wang EH: Inhibition of muscle differentiation by the novel muscleblind-related protein CHCR. Dev Biol. 2002, 250: 218-230. 10.1006/dbio.2002.0798.
The authors wish to thank Dr. Maurice Swanson for his help with the generation of the RNA-Seq datasets and Dr. Mark CK Young for his assistance on the statistical analysis.
This work was supported by the UF Genetics Institute, under the “Interdisciplinary RNA Research” project. Publication charges were paid by the UF Department of Molecular Genetics and Microbiology.
The authors declare that they have no competing interests.
ST conceived the algorithm, developed the PASTA software, and performed in-silico and experimental validation. AR supervised the entire project. Both authors read and approved the final manuscript.
Electronic supplementary material
About this article
Cite this article
Tang, S., Riva, A. PASTA: splice junction identification from RNA-Sequencing data. BMC Bioinformatics 14, 116 (2013). https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-14-116
- Next-generation sequencing
- Alternative splicing
- Computational analysis of alternative splicing