K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Table 4 The number of similar Trinity transcripts between original Inchworm and MapReduce-Inchworm using the mouse RNA-seq data [22]

cutoff for transcript similarity (%)	number of similar transcripts
100	47,816
99	57,926
95	64,109
90	67,178
85	69,002
80	70,398
75	71,390
70	72,285

Two sets of transcripts from original Inchworm and MapReduce-Inchworm were compared using BLAT [42]; Transcripts from original Inchworm was used as target and transcripts from MapReduce-Inchworm was used as query for input parameters to BLAT. The perl script blat_top_hit_extractor.pl, included in Trinity pipeline, was used to extract the most top hit for each transcript in query against target. The first column refers to the cutoff of transcript similarity, which was quantified using two similarity score defined as follows: 1) 1 - (query_sequence_size - number_of_matching_bases)/query_sequence_size 2) 1 - (target_sequence_size - number_of_matching_bases)/target_sequence_size. If these two similarity scores between two transcripts from both methods were greater than or equal to the cutoff value, those were considered as similar transcripts. The second column refers to the number of similar transcripts between original and MapReduce-Inchworm according to the cutoff value. Note the total number of transcripts from both methods can be found in Table 3

ISSN: 1471-2105