Skip to main content

Table 1 Summary of the theoretical properties of SMOTE for high-dimensional data

From: SMOTE for high-dimensional class-imbalanced data

Property

Consequence of using SMOTE on high-dimensional data

E(SMOTE) = E(X)

Little impact on classifiers that depend on mean values (DLDA);

var(SMOTE)= 2 3 var(X)

Minority class variability is underestimated; negative impact on classifiers that use class-specific variances (DQDA); inflated statistical significance of statistical tests for comparing classes (t-test);

d(SMOTE, TEST) < d(X, TEST)d: Euclidean distance

Test samples are classified mostly in the minority class for classifiers based on Euclidean distance (k-NN); variable selection is helpful in reducing this problem;

cor(SMOTE, X) ≥ 0; cor(SMOTEs, SMOTEt) ≥ 0

Training set samples are no longer independent; independence of samples is assumed by most classifiers (DLDA, PLR,...) and variable selection methods (t-test, Mann-Whitney,...)