Skip to main content

Table 1 Different machine learning approaches in sequencing data errors detection and correction

From: MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

Machine learning models

Study goals

Pre-processing stage

Trained features

Ref

ANNs, NB, SVM, and 20 tree-based machine learning models

Classify each base in the metagenomics and polyploid samples as a true-variation base (signal) or an erroneous-variation one (noise)

Data clustering

Multiple Sequence Alignment

The frequencies of the bases in each column in the resulted multiple sequence alignment matrix and in the cluster of the local similar sequences

Windows with different distance thresholds to the intended classified base are constructed in order to consider the information available from its neighbors in the alignment matrix

[29]

RF

Utilized in CARE 2.0 error correction to classify each base in the low-quality multiple sequence alignment

k-mers Extraction and Hashing. Multiple Sequence Alignment

Local features:

[28]

 The relative frequency of the base being a candidate for correction

 

 The quality-weighted relative frequency of the base being a candidate for correction

 

 The average quality score of the base being a candidate for correction

 

 The relative frequency of the consensus base

 

 The quality-weighted relative frequency of the consensus base

 

 The average quality score of the consensus base

 

 The average quality score

 

 The normalized coverage

 

 The normalized average quality

Global Features:

 

 The average quality-weighted relative frequency of the consensus base

 

 The minimum quality-weighted relative frequency of the consensus base

 

 The normalized maximum coverage

 

 The normalized minimum coverage

 

RNN

Choosing the best k-mer value for short reads error correction tools

Reads tokenization (Characters (i.e. bases) and Words (k-mers)) and transformation using one-hot encoding

The probability distribution of words (k-mers) and characters (bases) of the text (sequencing reads) considering their relative context

[30]

Transformer Network with self-attention

Choosing the best k-mer value for short and long reads error correction tools

Reads tokenization (Characters (i.e. bases) and Words (k-mers)) and transformation using sine and cosine positions encoding

The probability distribution of words (k-mers) and characters (bases) of the text (sequencing reads) considering their relative context

[31]