MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

Sami, Amira; El-Metwally, Sara; Rashad, M. Z.

doi:10.1186/s12859-024-05681-1

BMC Bioinformatics

Table 1 Different machine learning approaches in sequencing data errors detection and correction

From: MAC-ErrorReads: machine learning-assisted classifier for filtering erroneous NGS reads

Machine learning models	Study goals	Pre-processing stage	Trained features	Ref
ANNs, NB, SVM, and 20 tree-based machine learning models	Classify each base in the metagenomics and polyploid samples as a true-variation base (signal) or an erroneous-variation one (noise)	Data clustering Multiple Sequence Alignment	The frequencies of the bases in each column in the resulted multiple sequence alignment matrix and in the cluster of the local similar sequences Windows with different distance thresholds to the intended classified base are constructed in order to consider the information available from its neighbors in the alignment matrix	[29]
RF	Utilized in CARE 2.0 error correction to classify each base in the low-quality multiple sequence alignment	k-mers Extraction and Hashing. Multiple Sequence Alignment	Local features:	[28]
			The relative frequency of the base being a candidate for correction
			The quality-weighted relative frequency of the base being a candidate for correction
			The average quality score of the base being a candidate for correction
			The relative frequency of the consensus base
			The quality-weighted relative frequency of the consensus base
			The average quality score of the consensus base
			The average quality score
			The normalized coverage
			The normalized average quality
			Global Features:
			The average quality-weighted relative frequency of the consensus base
			The minimum quality-weighted relative frequency of the consensus base
			The normalized maximum coverage
			The normalized minimum coverage
RNN	Choosing the best k-mer value for short reads error correction tools	Reads tokenization (Characters (i.e. bases) and Words (k-mers)) and transformation using one-hot encoding	The probability distribution of words (k-mers) and characters (bases) of the text (sequencing reads) considering their relative context	[30]
Transformer Network with self-attention	Choosing the best k-mer value for short and long reads error correction tools	Reads tokenization (Characters (i.e. bases) and Words (k-mers)) and transformation using sine and cosine positions encoding	The probability distribution of words (k-mers) and characters (bases) of the text (sequencing reads) considering their relative context	[31]

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com