From: MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads
Machine learning models | Study goals | Pre-processing stage | Trained features | Ref |
---|---|---|---|---|
ANNs, NB, SVM, and 20 tree-based machine learning models | Classify each base in the metagenomics and polyploid samples as a true-variation base (signal) or an erroneous-variation one (noise) | Data clustering Multiple Sequence Alignment | The frequencies of the bases in each column in the resulted multiple sequence alignment matrix and in the cluster of the local similar sequences Windows with different distance thresholds to the intended classified base are constructed in order to consider the information available from its neighbors in the alignment matrix | [29] |
RF | Utilized in CARE 2.0 error correction to classify each base in the low-quality multiple sequence alignment | k-mers Extraction and Hashing. Multiple Sequence Alignment | Local features: | [28] |
 The relative frequency of the base being a candidate for correction |  | |||
 The quality-weighted relative frequency of the base being a candidate for correction |  | |||
 The average quality score of the base being a candidate for correction |  | |||
 The relative frequency of the consensus base |  | |||
 The quality-weighted relative frequency of the consensus base |  | |||
 The average quality score of the consensus base |  | |||
 The average quality score |  | |||
 The normalized coverage |  | |||
 The normalized average quality | ||||
Global Features: | Â | |||
 The average quality-weighted relative frequency of the consensus base |  | |||
 The minimum quality-weighted relative frequency of the consensus base |  | |||
 The normalized maximum coverage |  | |||
 The normalized minimum coverage |  | |||
RNN | Choosing the best k-mer value for short reads error correction tools | Reads tokenization (Characters (i.e. bases) and Words (k-mers)) and transformation using one-hot encoding | The probability distribution of words (k-mers) and characters (bases) of the text (sequencing reads) considering their relative context | [30] |
Transformer Network with self-attention | Choosing the best k-mer value for short and long reads error correction tools | Reads tokenization (Characters (i.e. bases) and Words (k-mers)) and transformation using sine and cosine positions encoding | The probability distribution of words (k-mers) and characters (bases) of the text (sequencing reads) considering their relative context | [31] |