Skip to main content
Figure 4 | BMC Bioinformatics

Figure 4

From: Building a protein name dictionary from full text: a machine learning term extraction approach

Figure 4

Steps involved in constructing the catalog of protein references. Terms are shown enclosed in rectangular boxes. Terms may occur in the context of sentences (when on a horizontal line, left), or in an article (right). Step 1: Articles are split into sentences, and sentences are split into tokens. Tokens roughly correspond to words (see text for details). Tokens with high frequency that are not eliminated by the exclusion lists (see Figure 1) are grouped into n-grams. On the figure, APE1/ref-1 is a n-gram that consists of two tokens: APE1 and ref-1, and can be recognized if the two terms co-occur frequently in sequence in a full length article. When the terms are recognized, each occurrence of a term in sentences of the article is identified. Step 2: Machine learning features are calculated from the context of the term in the article (see text for details) and the support vector machine (SVM) model classifies the context of the term. We obtain the score for each context of a term. In our experimental setup, smaller scores suggest that the context provides little evidence that the term refers to a protein, while larger scores (in absolute values) indicate more support. Step 3: We calculate the combined score (Sc) as the sum of the scores for each occurrence of a given term in a given article. The final catalog consists of a table with one row per term and article. Each row has three columns: PubMedID, term, and Sc.

Back to article page