Skip to main content
Figure 2 | BMC Bioinformatics

Figure 2

From: Using iterative cluster merging with improved gap statistics to perform online phenotype discovery in the context of high-throughput RNAi screens

Figure 2

Gap statistic curves for dataset with different sample number. Each curve represents experiment on one real dataset, and twenty reference datasets are defined from this real dataset. For each data point, value on X-axis indicates how many clusters are defined on both the reference dataset and the real dataset and value on Y-axis indicates gap statistic for this cluster number, which is defined as the average difference of within cluster dispersions between the clustering results on reference datasets and real dataset, the error bars around the data points show the variation across different reference datasets. The estimated cluster number is defined as the X value of the first data point with higher Y value than the bottom of error bar for its instant right neighbor. During the experiments, the "real" dataset consists of two clusters and different reference datasets are used, Left, uniform reference distribution are used, sample number differences are equal, 2-fold, 3-fold and 5-fold from bottom to top, gap statistics works; middle, uniform reference are used, sample number differences are 7-fold, 9-fold and 10-fold from bottom to top, gap statistic fails; right, two clusters having 10-fold difference in sample number, Gaussian distribution is used as reference distribution for the cluster with larger sample numbers, and the cluster number is estimated accurately.

Back to article page