Skip to main content
Figure 4 | BMC Bioinformatics

Figure 4

From: Moara: a Java library for extracting and normalizing gene and protein mentions

Figure 4

Editing procedures for the generation of mention and synonym variations. Two examples of the editing procedures are shown in detail. The non-repeated variations that are returned by the system are presented in green and the repeated variations are shown in orange. Only those procedures that result in a change to the examples are shown. In general, the mentions (or synonyms) are separated according to parenthesis and then into parts that are meaningful on their own. These parts are then tokenized according to numbers, Greek letters and any other symbols (i.e. hyphens), and then the tokens are alphabetically ordered. Gradual filtering is carried out starting with stopwords and followed by the BioThesaurus terms. These are filtered according to their frequency in the lexicon, starting with the more frequent ones (higher than 10,000) to the less frequent ones (at least one).

Back to article page