Skip to main content

Table 7 Descriptions for machine learning techniques

From: Employing phylogenetic tree shape statistics to resolve the underlying host population structure

Machine learning technique

Description

References

K-nearest neighbour (KNN)

KNN classifies an object based on closest training examples in the feature space. KNN is a supervised machine learning technique where data is divided into two sets: a training and a test set. The training set is used to train the machine (learning), while the test set is used to determine the classes of the given objects (actual classification). Given an unknown sample \((k_0)\) to be classified and a training data set, the distances between \(k_0\) and all samples in the training set are computed. The number of neighbours (k) that have the shortest distance (closest) to \(k_0\) are identified. And \(k_0\) will be inferred to belong to the class where its k closest neighbours come from. Some of the distance metrics that can be used in the KNN classification include: Eucledian, Eucledian squared, City-block and Chebychev.

[50]

Support vector machine (SVM)

SVM is both a supervised learning and a binary classification method. It finds the best separating hyperplane between two classes of the training samples in the feature space. Suppose we have n sample points in the training set, where each sample point \(\mathbf{x }_i\), has k attributes and each belongs to one of two classes. Let us denote the classes as either 1 or \(-1\), the sample points are denoted as \((\mathbf{x }_i,y_i)\), where \(i=1,...,n\), \(y_i \in \{ -1,1\}\) and \(\mathbf{x } \in {\mathbb {R}}^k\). For the case when the data are separable and \(k=2\), a line separating between the two classes is easily drawn. In circumstances where \(k>2\) and the data are still separable, a hyperplane separates the two classes. For a case when the data are not linearly separable, the data are transformed using kernel functions. Some of the commonly used functions include radial basis kernel, linear kernel, polynomial kernel and the sigmoidal kernel.

[51, 52]

Decision tree (DT)

DT procedure divides a data set into subdivisions basing on a set of tests that are defined at each branch or a node. From the given data, a tree is constructed which is composed of a root, internal nodes which are known as splits and a set of leaves. The leaves are the terminal nodes. Data are classified according to the decision framework defined by the tree. It is the leaf nodes that are assigned the label class. The assignment is done according to the leaf node into which the observation falls. The learning algorithms define splits at each internal node of a decision tree from the training data set. For an accurate decision tree, the training data should be of high quality so that the relations between the features and classes can be easily learned.

[53]