TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets

BMC Bioinformatics

Table 4 Assembly results for three metagenomic datasets

Library	Assembly run	# Reads	# Contigs (> 500 bp)	Average contig length (> 500 bp)	Contig N50¹ (bp)	# Concatenated tag sequences allowing 3 mismatches
LIB019	A	42,825	136 (25)	329.91 (703.08)	423	10
	B	34,778	73 (25)	390.04 (694.04)	605	5
	C	35,426²	50 (26)	510.94 (768.92)	663	0
LIB020	A	17,129	89 (6)	246.40 (557.33)	306	4
	B	14,208	55 (13)	292.85 (655.85)	510	3
	C	14,366²	52 (12)	312.54 (726.33)	547	0
LIB021	A	49,282	305 (15)	238.54 (682.00)	276	29
	B	41,126	186 (18)	264.12 (691.67)	302	16
	C	42,495²	165 (20)	282.39 (782.00)	303	0

The GS De Novo Assembler Software version 2.3 (Roche, Branford, CT) was used to assemble three metagenomic libraries (LIB019, LIB020 and LIB021) to illustrate how TagCleaner can improve metagenomic and other high-throughput studies. The assembly parameters were set to 95% identity over at least 35 bp. Assemblies were generated for three different parameter sets for each of the metagenomic libraries: (A) raw data; (B) tag sequences trimmed allowing three mismatches; (C) tag sequences trimmed allowing three mismatches with additional splitting of the fragment-to-fragment concatenations and continuous end tag trimming. For B and C, the minimum sequence length was set to 40 bp, sequence duplicates were removed and all other parameters were kept at their default values.
¹ The N50 contig size is a weighted median that is defined as the length of the smallest contig C in the sorted list of all contigs where the cumulative length from the largest contig to contig C is at least 50% of the total length (sum of contig lengths).
² Increased number of reads due to splitting of the fragment-to-fragment concatenations.

ISSN: 1471-2105