Modeling and mining term association for improving biomedical information retrieval performance

BMC Bioinformatics

Table 5 Performance of GSP algorithm

	(k₁, b)	Geno 2007			Geno 2006		Geno 2005	Geno 2004	HARD 2004
		document	passage	passage2	document	passage	document	document	document	passage
GSP	(0.4,2.0)	0.1066	0.0338	0.0149	0.1892	0.0242	0.1867	0.2723	0.2358	0.2639
		(-1.87%)	(-98.75%)	(-58.28%)	(-7.09%)	(-25.95%)	(-4.96%)	(-7.74%)	(-3.72%)	(-0.15%)
	(0.5,1.3)	0.149	0.0843	0.0456	0.2855	0.0466	0.2423	0.3165	0.2562	0.3001
		(-6.18%)	(-86.59%)	(-36.85%)	(-8.17%)	(-26.31%)	(-6.88%)	(-7.01%)	(-8.57%)	(-0.54%)
	(1.0,1.0)	0.1839	0.0898	0.0357	0.2757	0.0402	0.2385	0.3166	0.2501	0.2842
		(-3.32%)	(-0.60%)	(-9.21%)	(-5.46%)	(-19.40%)	(-6.36%)	(-7.55%)	(-0.83%)	(-4.56%)
	(1.2,0.75)	0.1905	0.0714	0.0658	0.3174	0.0404	0.2655	0.3293	0.2589	0.2776
		(-5.35%)	(-10.11%)	(-13.79%)	(-6.11%)	(-11.65%)	(-7.62%)	(-8.11%)	(-1.07%)	(-0.65%)
	(2.0,0.4)	0.1931	0.0657	0.0667	0.3203	0.0403	0.2588	0.3206	0.2567	0.2916
		(-4.62%)	(-3.79%)	(-4.02%)	(-7.85%)	(-11.40%)	(-6.89%)	(-7.96%)	(-8.65%)	(-0.73%)
	Best	0.1931	0.0898	0.0667	0.3203	0.0466	0.2655	0.3293	0.2589	0.3001
Baselines	Best	0.2108	0.0963	0.0641	0.3529	0.0718	0.2874	0.3584	0.281	0.2985
TA	Best	0.2724	0.1611	0.0762	0.3549	0.101	0.3085	0.3606	0.2845	0.3031

The GSP algorithm is adopted as a comparison to the proposed approach: (1) the candidates of 1 - sequences are all the keywords, the k - sequences candidates are generated on the frequent (k - 1) - sequences, after mapped the GSP algorithm to our research problem; (2) the counts of candidates are simulated as a non-parametric distribution, where the lower bound of the 95% confidence interval is the minimum support value for this GSP algorithm; (3) only the paragraph index under five parameter settings of (k₁, b) is considered; (4) the best results of the GSP algorithm are compared with the best of the baselines and the proposed term association approach; (5) "TA" stands for term association; (6) the values in the parentheses are the relative rates of improvement over the original baselines.

ISSN: 1471-2105