SMOTE for high-dimensional class-imbalanced data

BMC Bioinformatics

Table 1 Summary of the theoretical properties of SMOTE for high-dimensional data

Property	Consequence of using SMOTE on high-dimensional data
E(SMOTE) = E(X)	Little impact on classifiers that depend on mean values (DLDA);
$var (SMOTE) = \frac{2}{3} var (X)$	Minority class variability is underestimated; negative impact on classifiers that use class-specific variances (DQDA); inflated statistical significance of statistical tests for comparing classes (t-test);
d(SMOTE, TEST) < d(X, TEST)d: Euclidean distance	Test samples are classified mostly in the minority class for classifiers based on Euclidean distance (k-NN); variable selection is helpful in reducing this problem;
cor(SMOTE, X) ≥ 0; cor(SMOTE^s, SMOTE^t) ≥ 0	Training set samples are no longer independent; independence of samples is assumed by most classifiers (DLDA, PLR,...) and variable selection methods (t-test, Mann-Whitney,...)

ISSN: 1471-2105