Automated quality control for a molecular surveillance system

Sims, Seth; Longmire, Atkinson G.; Campo, David S.; Ramachandran, Sumathi; Medrzycki, Magdalena; Ganova-Raeva, Lilia; Lin, Yulin; Sue, Amanda; Thai, Hong; Zelikovsky, Alexander; Khudyakov, Yury

doi:10.1186/s12859-018-2329-5

BMC Bioinformatics

Table 1 GHOST QC filters listed in order of execution. All filters except for “Primer dimer” are discussed in detail in Longmire et al., [12]

From: Automated quality control for a molecular surveillance system

Order	Filter name	Description	Position relative to N = 20,000 random sampling
1	Ambiguity	After standard demultiplexing, read pairs are filtered out if a read has more than three N’s.	Before
2	Primer dimer	Checks for the existence of primer dimers or non-specifc product. For each read pair, this filter inspects the forward read for forward primer using the same search parameters as the filter dedicated to primer verification and read orientation. Once found, the reverse complement is searched for the reverse primer. If the distance between the forward and reverse primers is found to be less than the threshold set in the filter dedicated to read length (185 bp), the pair is discarded. If both primers are not found, the process is repeated with the reverse read.	Before
3	Short read	Read pairs are filtered out if either read has a length less than 185 bp.	Before
4	MID mismatch	Each identifier on both forward and reverse reads are examined and the pair is discarded if either identifier is found to not be an exact match to a given list of valid identifiers.	Before
5	Minority MID	Pairs containing valid identifiers are discarded if they are not a constituent of the majority identifier tuple. If 25% or more of the read pairs are found to contain valid identifiers that are not the majority tuple, the entire sample is discarded from analysis without further processing.	Before
6	Primer verification	Primer sequence patterns are searched for in the forward and reverse reads. Primer sequences are located in each read using fuzzy matching and only allow substitutions ≤2, insertions (relative to the reference) ≤ 1, deletions (relative to the reference) ≤ 1, and a combination of total errors ≤3. Read pairs where either primers cannot be found are discarded. The primer locations are used to orient the reads into the uniform orientation.	After
7	Casper mismatch	Read pairs are unified into a single error-corrected sequence using the Casper error correction method with a quality threshold of 15, k-mer length of 17, k-mer neighborhood of 8, and minimum match threshold of 95%. Overlap fitness is evaluated by the classical Hamming Distance. The overlap corresponding to the highest ratio of correct positions to overlap length is selected, with the longest overlap being preferred in the event of there being more than one overlap with equal ratios.	After
8	Nonsense	Merged sequences are discarded if a nonsense-free reading frame cannot be found.	After

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com