From: Automated quality control for a molecular surveillance system
Order | Filter name | Description | Position relative to N = 20,000 random sampling |
---|---|---|---|
1 | Ambiguity | After standard demultiplexing, read pairs are filtered out if a read has more than three N’s. | Before |
2 | Primer dimer | Checks for the existence of primer dimers or non-specifc product. For each read pair, this filter inspects the forward read for forward primer using the same search parameters as the filter dedicated to primer verification and read orientation. Once found, the reverse complement is searched for the reverse primer. If the distance between the forward and reverse primers is found to be less than the threshold set in the filter dedicated to read length (185 bp), the pair is discarded. If both primers are not found, the process is repeated with the reverse read. | Before |
3 | Short read | Read pairs are filtered out if either read has a length less than 185 bp. | Before |
4 | MID mismatch | Each identifier on both forward and reverse reads are examined and the pair is discarded if either identifier is found to not be an exact match to a given list of valid identifiers. | Before |
5 | Minority MID | Pairs containing valid identifiers are discarded if they are not a constituent of the majority identifier tuple. If 25% or more of the read pairs are found to contain valid identifiers that are not the majority tuple, the entire sample is discarded from analysis without further processing. | Before |
6 | Primer verification | Primer sequence patterns are searched for in the forward and reverse reads. Primer sequences are located in each read using fuzzy matching and only allow substitutions ≤2, insertions (relative to the reference) ≤ 1, deletions (relative to the reference) ≤ 1, and a combination of total errors ≤3. Read pairs where either primers cannot be found are discarded. The primer locations are used to orient the reads into the uniform orientation. | After |
7 | Casper mismatch | Read pairs are unified into a single error-corrected sequence using the Casper error correction method with a quality threshold of 15, k-mer length of 17, k-mer neighborhood of 8, and minimum match threshold of 95%. Overlap fitness is evaluated by the classical Hamming Distance. The overlap corresponding to the highest ratio of correct positions to overlap length is selected, with the longest overlap being preferred in the event of there being more than one overlap with equal ratios. | After |
8 | Nonsense | Merged sequences are discarded if a nonsense-free reading frame cannot be found. | After |