A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

Table 2 Average cross-validated prediction accuracy.

		no selection			Univariate selection			Multivariate selection (Gini importance)			multivariate selection (PLS/PC)
		PLS	PC	RF	PLS	PC	RF	PLS	PC	RF	PLS	PC	RF
MIR BSE	orig	66.8	62.9	74.9	80.7	80.7	76.7	84.1	83.2	77.4	68	63.5	75.5
		-	-	-	***	***	*	***	***	**	**
	binned	72.7	73.4	75.3	80.4	80.7	76.6	86.8	85.8	77.3	85	82.1	75.6
		-	-	-	***	***	**	***	***	**	***	***
MIR wine	French	69.5	69.3	79.3	83.7	83.5	82.2	82.4	81	81.2	66.9	70.0	79.8
		-	-	-	***	**		***	**	*
	grape	77	71.4	90.2	98.1	98.7	90.3	98.4	98.4	94.2	91.7	88.5	90.4
		-	-	-	***	***		***	***	**	***	***
NMR tumor	all	88.8	89	89	89.3	89.3	90.5	90.0	89.6	89.6	89.3	89.2	89.1
		-	-	-	*		***	**		*
	center	71.6	72.3	73.1	73.9	72.7	73.9	72.6	72.0	74.3	71.8	72.7	73.3
		-	-	-	**			*
NMR candida	1	94.9	94.6	90.3	95.1	94.9	90.6	95.6	95.3	90.3	95.3	95.2	90.7
		-	-	-
	2	95.6	95.2	93.2	95.8	95.7	93.7	95.6	95.5	93.5	96.0	95.9	94.1
		-	-	-							*
	3	93.7	93.8	89.7	93.7	93.8	89.9	94.2	93.8	89.9	94.0	94.0	90.2
		-	-	-				*		*	*
	4	86.9	87.3	83.9	87.8	87.3	84.0	88.2	87.6	84.3	87.7	87.6	84.1
		-	-	-				*
	5	92.7	92.6	89.2	92.7	92.6	89.9	92.5	92.5	90.3	92.8	92.6	90.0
		-	-	-

The best classification results on each data set are underlined. Approaches which do not differ significantly from the optimal result (at a 0.05 significance level) are set in bold type (see methods section). Significant differences in the performance of a method as compared to the same classifier without feature selection are marked with asterisks (* p-value < 0.05, ** p-value < 0.01, *** p-value < .001). The MIR data of this table benefit significantly from a feature selection, whereas the NMR data do so only to a minor extent. Overall, a feature selection by means of Gini importance in conjunction with a PLS classifier was successful in all cases and superior to the "native" classifier of Gini importance, the random forest, in all but one cases.

ISSN: 1471-2105