Skip to main content
Figure 3 | BMC Bioinformatics

Figure 3

From: Microarray data mining: A novel optimization-based approach to uncover biologically coherent structures

Figure 3

Schematic showing the multi-stage clustering process for dataset III. The full set of genes, including dubious ORFs and control samples are clustered by EP_GOS_Clust to yield 49 clusters. Those clusters with correlation ≥ 0.5 are retained and split into two groups. Those with ≥ 60% of their member genes annotated as unknown biological function are set aside as group B. The second group is subjected to iterative clustering as described in Methods, with a threshold p-value of 10-4, yielding 21 clusters (group A). The remaining genes from the initial clustering process are first filtered to remove those with little correlation to any other gene or limited expression. Those genes passing the filter are subjected to EP_GOS_Clust and those clusters exhibiting expression correlation ≥ 0.5 are examined. Those clusters that also have at least 30% their genes annotated to a common function with a p-value less than 10-3 are retained as group C. Those with ≥ 50% of their member genes annotated as unknown biological function are set aside as group D. The remaining genes are once again clustered by EP_GOS_Clust, yielding one cluster with ≥ 40% of their member genes annotated as unknown biological function (group F) and several clusters with the indicated correlation, precision and coherence. The remaining 3,760 genes are then stringently filtered. Since the genes have already been subjected to clustering, we can assume that the most useful information has already been sieved out. The remaining 3562 genes are probably all irrelevant, but we would still like to identify the genes that have significant levels of expression. We hence look at the number of genes that has a minimum proportion of feature points falling within the data mean ± 0.5*(standard deviation), and find that as the pre-determined proportion is decreased, the number of genes increases almost linearly until the 77% mark, where it then starts to grow exponentially. We take this to signify an increasing bulk of spurious genes and set the cut-off at 77% to extract 206 genes for further clustering. This yields the final group of clusters (group E).

Back to article page