Skip to main content
  • Methodology article
  • Open access
  • Published:

A relation based measure of semantic similarity for Gene Ontology annotations

Abstract

Background

Various measures of semantic similarity of terms in bio-ontologies such as the Gene Ontology (GO) have been used to compare gene products. Such measures of similarity have been used to annotate uncharacterized gene products and group gene products into functional groups. There are various ways to measure semantic similarity, either using the topological structure of the ontology, the instances (gene products) associated with terms or a mixture of both. We focus on an instance level definition of semantic similarity while using the information contained in the ontology, both in the graphical structure of the ontology and the semantics of relations between terms, to provide constraints on our instance level description.

Semantic similarity of terms is extended to annotations by various approaches, either though aggregation operations such as min, max and average or through an extrapolative method. These approaches introduce assumptions about how semantic similarity of terms relates to the semantic similarity of annotations that do not necessarily reflect how terms relate to each other.

Results

We exploit the semantics of relations in the GO to construct an algorithm called SSA that provides the basis of a framework that naturally extends instance based methods of semantic similarity of terms, such as Resnik's measure, to describing annotations and not just terms. Our measure attempts to correctly interpret how terms combine via their relationships in the ontological hierarchy. SSA uses these relationships to identify the most specific common ancestors between terms. We outline the set of cases in which terms can combine and associate partial order constraints with each case that order the specificity of terms. These cases form the basis for the SSA algorithm. The set of associated constraints also provide a set of principles that any improvement on our method should seek to satisfy.

Conclusion

We derive a measure of semantic similarity between annotations that exploits all available information without introducing assumptions about the nature of the ontology or data. We preserve the principles underlying instance based methods of semantic similarity of terms at the annotation level. As a result our measure better describes the information contained in annotations associated with gene products and as a result is better suited to characterizing and classifying gene products through their annotations.

Background

Although the semantic similarity between two GO terms has been extensively investigated [14], how to define similarity between two gene products based on GO annotations for a specific application remains unclear [5]. To date annotation similarity has been computed by four general approaches: the set-based approach; the graph-based approach; the vector-based approach; and the term-based approach. In the set-based approach an annotation is viewed as a 'bag of words'. Two annotations are similar if there is a large overlap between their sets of terms. A graph-based approach views similarity as a graph-matching procedure. Vector-based methods embed annotations in a vector space where each possible term in the ontology forms a dimension. Term-based approaches compute similarity between individual terms and then combine these similarities to produce a measure of annotation similarity.

All the above approaches do not consider the semantics of relationships between terms. How terms are related can significantly alter how an annotation, which is a set of terms, is interpreted. In the GO there are two main types of relations: is_a and part_of. The is_a relation represents a taxonomic relationship between terms that can be modeled using the improper subset relation, which is a partial ordering of terms. The part_of relation represents a partonomic relationship between terms that can also be modeled in terms of a partial order. Though the partial orders represented by taxonomies and partonomies are well understood there has been little attention given as to how these two partial orderings combine. Using the various cases identified by combining taxonomies and partonomies we construct an algorithm called SSA (S emantic S imilarity of A nnotations) that identifies the terms that can be associated with an annotation and terms that relate to both annotations. Instances associated with these terms are then used to construct a Resnik-like measure of annotation similarity thus extending the underlying intuitions behind this term-based measure to the annotation level.

A measure of term or annotation similarity should be based on a set of principles that form the basis for what is considered similar. The nature of similarity has been the focus of intense research in the areas of aesthetics [6, 7] and psychology [8]. In mathematics properties such as identity, symmetry and the triangle inequality have been used to form the basis of measures of similarity of mathematical objects. Principles of term and annotation similarity have been suggested by various authors. This work intends to build on these principles and introduce additional principles that a measure of similarity should seek to satisfy.

Similarity between objects is normally expressed as a number that ranges along an interval on the real numbers . However the main purpose of similarity is usually to determine whether two or more objects are similar to a reference object. For this reason a measure of similarity can be viewed as a partial order on a set of objects, the actual numbers play only a secondary purpose. For example, we may say that an object X is more similar to Z than another object Y. Formally this is expressed as sim(X, Z) > sim(Y, Z).

In the study of ontological similarity Lin [9] develops the principles of commonality and difference when constructing a measure of term similarity. The greater the commonality between objects the greater the similarity. Likewise, the greater the difference between objects the greater the dissimilarity. The source of both the commonality and difference between terms depends on the method chosen to measure the descriptiveness of terms. Different sources of descriptiveness may result in different orderings of similarity between terms or annotations.

Popescu et al. [10] recognize that an important property of term similarity is that two different terms should have a non-zero similarity value if the terms are related. They also recognize that an important property of annotation similarity is that the descriptiveness of annotations should be greater than or equal to the descriptiveness of its constituent terms. In this paper this property is called the monotonicity property.

In defining a measure of similarity a set of relevant properties that objects can be compared along are identified. In ontological similarity, whether of terms or annotations, there are two main sources of similarity: the conceptual or structural level; and the instance level. At the structural level we may consider such properties as graph distance, graph similarity, relation types, common ancestors, etc. At the instance level we consider the set of instances associated with a term or annotation. Our measure of ontological similarity combines aspects from both levels. Here we survey how various measures of annotation similarity combine these properties in various ways to form the basis for a measure of descriptiveness of a term or annotation.

Set-Based Approaches

Set based methods for measuring the similarity of annotations are based on the Tversky ratio model of similarity [8, 11] which is a general model of distance between sets of terms. It is represented by the formula

f ( G 1 G 2 ) f ( G 1 G 2 ) + α f ( G 1 G 2 ) + β f ( G 2 G 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqWGMbGzcqGGOaakcqWGhbWrdaWgaaqaaiabigdaXaqabaGaeyykICSaem4raC0aaSbaaeaacqaIYaGmaeqaaiabcMcaPaqaaiabdAgaMjabcIcaOiabdEeahnaaBaaabaGaeGymaedabeaacqGHPiYXcqWGhbWrdaWgaaqaaiabikdaYaqabaGaeiykaKIaey4kaSIaeqySdeMaey4fIOIaemOzayMaeiikaGIaem4raC0aaSbaaeaacqaIXaqmaeqaaiabgkHiTiabdEeahnaaBaaabaGaeGOmaidabeaacqGGPaqkcqGHRaWkcqaHYoGycqGHxiIkcqWGMbGzcqGGOaakcqWGhbWrdaWgaaqaaiabikdaYaqabaGaeyOeI0Iaem4raC0aaSbaaeaacqaIXaqmaeqaaiabcMcaPaaaaaa@561D@

where G1 and G1 are sets of terms or annotations from the same ontology and f is an additive function on sets (usually set cardinality). For α = β = 1 we get the Jaccard distance between sets:

S J a c c a r d = f ( G 1 G 2 ) f ( G 1 G 2 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdQeakjabdggaHjabdogaJjabdogaJjabdggaHjabdkhaYjabdsgaKbqabaGccqGH9aqpjuaGdaWcaaqaaiabdAgaMjabcIcaOiabdEeahnaaBaaabaGaeGymaedabeaacqGHPiYXcqWGhbWrdaWgaaqaaiabikdaYaqabaGaeiykaKcabaGaemOzayMaeiikaGIaem4raC0aaSbaaeaacqaIXaqmaeqaaiabgQIiilabdEeahnaaBaaabaGaeGOmaidabeaacqGGPaqkaaaaaa@4A2B@

and for α = β = 1 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaaaaa@2E55@ we get the Dice distance between sets [11]:

S D i c e = 2 f ( G 1 G 2 ) f ( G 1 ) + f ( G 2 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdseaejabdMgaPjabdogaJjabdwgaLbqabaGccqGH9aqpjuaGdaWcaaqaaiabikdaYiabgEHiQiabdAgaMjabcIcaOiabdEeahnaaBaaabaGaeGymaedabeaacqGHPiYXcqWGhbWrdaWgaaqaaiabikdaYaqabaGaeiykaKcabaGaemOzayMaeiikaGIaem4raC0aaSbaaeaacqaIXaqmaeqaaiabcMcaPiabgUcaRiabdAgaMjabcIcaOiabdEeahnaaBaaabaGaeGOmaidabeaacqGGPaqkaaaaaa@4A54@

In this situation the source of descriptiveness of an annotation is its set of terms. Each term and its set of associated instances is considered independent of other terms. The commonality and difference between annotations is modeled as set intersection and difference of sets of terms respectively. Set-based approaches return a similarity of zero if they do not share common terms ignoring the fact that terms may be closely related. Because of the atomic nature of terms in the set-based approach the monotonicity property does not apply.

Vector-Based Approaches

Vector-based methods embed ontological terms in a vector space by associating each term with a dimension. Usually a vector is binary consisting of 0's and 1's where 0 denotes the absence (resp. presence) of a term (along a particular dimension) in an annotation. This has the advantage that standard clustering techniques on vector spaces such as k-means can be applied to group similar terms. What is required is a means of measuring the size of vectors. This can be achieved by embedding terms in a metric space (usually Euclidean). The most common method of measuring similarity between vectors of terms is the cosine similarity

s v ( G 1 , G 2 ) = v 1 · v 2 | v 1 | | v 2 | MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4Cam3aaSbaaSqaaiabdAha2bqabaGccqGGOaakcqWGhbWrdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabdEeahnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaacqWG2bGDdaWgaaqaaiabigdaXaqabaGaeS4JPFMaemODay3aaSbaaeaacqaIYaGmaeqaaaqaaiabcYha8jabdAha2naaBaaabaGaeGymaedabeaacqGG8baFcqGG8baFcqWG2bGDdaWgaaqaaiabikdaYaqabaGaeiiFaWhaaaaa@4A7B@

where v i represents a vector of terms constructed from an annotation (group of terms) G i . |·| corresponds to the size of the vector and • corresponds to the dot product between two vectors. The source of descriptiveness, commonality and difference is the same as the situation for set-based approaches.

Graph-Based Approaches

An ontology is a directed, acyclic graph (DAG) whose edges correspond to relationships between terms. Thus it is natural to compare terms using methods for graph matching and graph similarity. We may consider the similarity between annotations in terms of the sub-graph that connects terms within each annotation. Annotation similarity is then measured in terms of similarity between two graphs. Graph matching has only a weak correlation with similarity between terms. It is also computationally expensive to compute, graph matching being an NP-complete problem on general graphs [12].

The descriptiveness of an annotation is modeled by the set of nodes and edges associated with a subgraph. Commonality between annotations is based on the set intersection while difference is modeled by the set difference where each set consists of the nodes and edges associated with each subgraph. Alternatively, the set of edges may be ignored and only common terms of both graphs are considered [1315].

Improving Similarity Measures by Weighting Terms

Set, vector and graph-based methods for measuring similarity between annotations can be improved by introducing a weighting function into the similarity measure. For example, the weighted Jaccard distance can be formulated as:

S W e i g h t e d J a c c a r d ( G 1 , G 2 ) = { T i G 1 G 2 } m ( T i ) { T j G 1 G 2 } m ( T j ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdEfaxjabdwgaLjabdMgaPjabdEgaNjabdIgaOjabdsha0jabdwgaLjabdsgaKjabdQeakjabdggaHjabdogaJjabdogaJjabdggaHjabdkhaYjabdsgaKbqabaGccqGGOaakcqWGhbWrdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabdEeahnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaadaaeqaqaaiabd2gaTjabcIcaOiabdsfaunaaBaaabaGaemyAaKgabeaacqGGPaqkaeaacqGG7bWEcqWGubavdaWgaaqaaiabdMgaPbqabaGaeyicI4Saem4raC0aaSbaaeaacqaIXaqmaeqaaiabgMIihlabdEeahnaaBaaabaGaeGOmaidabeaacqGG9bqFaeqacqGHris5aaqaamaaqababaGaemyBa0MaeiikaGIaemivaq1aaSbaaeaacqWGQbGAaeqaaiabcMcaPaqaaiabcUha7jabdsfaunaaBaaabaGaemOAaOgabeaacqGHiiIZcqWGhbWrdaWgaaqaaiabigdaXaqabaGaeyOkIGSaem4raC0aaSbaaeaacqaIYaGmaeqaaiabc2ha9bqabiabggHiLdaaaaaa@7355@

where, as before, G1 and G2 are annotations or sets of terms describing data (e.g. a gene product), T x is the xthterm from a set of terms and m(T x ) denotes the weight of T x . This weighting function can be used to represent various properties of a term or annotation such as a measure of vagueness, uncertainty, sense of preference or a combination of the above. The vector-based approach may be extended so that values along a particular dimension can lie on the interval [0, 1] or [0, ∞). The graph-based approach can be extended by weighting the edges between terms in the graph.

Assigning a weight to each term in an annotation allows for the possibility of introducing the monotonicity property into a similarity measure. Using the monotonicity property, the weight associated with an annotation should be greater than or equal to the weight associated with any of its constituent terms. Weights can form an additional basis on which to measure the descriptiveness of a term or annotation.

Instance-Based Weights

One approach to assigning weight to an ontological term is to measure how informative a term is in describing data. A method of measuring information is to analyze a term's use in a corpus against the general use of ontological terms in the same corpus. Information is measured using the surprisal function:

IC Corpus (T i ) = -log(p(T i ))

where p(T i ) corresponds to the probability of a term T i or its taxonomic descendants occurring in a corpus. For example, consider the case where there are 30 distinct instances in a corpus and 5, 3 and 2 of these instances are annotated by the terms T i , T j and T k respectively. If T j and T k are sub-types or children of T i and do not have child terms themselves then I C C o r p u s ( T i ) = log ( 5 + 3 + 2 30 ) 1.099 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaem4qam0aaSbaaSqaaiabdoeadjabd+gaVjabdkhaYjabdchaWjabdwha1jabdohaZbqabaGccqGGOaakcqWGubavdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9iabgkHiTiGbcYgaSjabc+gaVjabcEgaNnaabmaajuaGbaWaaSaaaeaacqaI1aqncqGHRaWkcqaIZaWmcqGHRaWkcqaIYaGmaeaacqaIZaWmcqaIWaamaaaakiaawIcacaGLPaaacqGHijYUcqaIXaqmcqGGUaGlcqaIWaamcqaI5aqocqaI5aqoaaa@5010@ .

Other Weighting Approaches

Other measures of information can be used not necessarily relying on corpus data. One measure [16] relies on the assumption that how the ontology is constructed is semantically meaningful:

I C O n t ( T i ) = 1 log ( d e s c ( T i ) + 1 ) log ( n u m T e r m s ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemysaKKaem4qam0aaSbaaSqaaiabd+eapjabd6gaUjabdsha0bqabaGccqGGOaakcqWGubavdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9iabigdaXiabgkHiTKqbaoaalaaabaGagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaemizaqMaemyzauMaem4CamNaem4yamMaeiikaGIaemivaq1aaSbaaeaacqWGPbqAaeqaaiabcMcaPiabgUcaRiabigdaXiabcMcaPaqaaiGbcYgaSjabc+gaVjabcEgaNjabcIcaOiabd6gaUjabdwha1jabd2gaTjabdsfaujabdwgaLjabdkhaYjabd2gaTjabdohaZjabcMcaPaaaaaa@5CA6@

where desc(T i ) returns the number of descendants of term T i and numTerms refers to the total number of terms in the ontology.

Term-Based Approaches

In term-based approaches similarity between pairs of terms from each annotation are computed. These weightings are then combined in order to characterize the similarity between annotations as a whole. There are several ways to combine similarities of pairs of terms such as the min, max or average operations. Term-based approaches depend on a function s(T i , T j ) where T i and T j are terms from two annotations G1 and G2 respectively. s(T i , T j ) provides a measure of distance/similarity between these two terms. Once distances has been measured between all possible pairs of terms they are then aggregated using an operation such as max or the average of all distances. For example:

S a v g ( G 1 , G 2 ) = i = 1 n j = 1 m s ( T i , T j ) m n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdggaHjabdAha2jabdEgaNbqabaGccqGGOaakcqWGhbWrdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabdEeahnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaadaaeWaqaamaaqadabaGaem4CamNaeiikaGIaemivaq1aaSbaaeaacqWGPbqAaeqaaiabcYcaSiabdsfaunaaBaaabaGaemOAaOgabeaacqGGPaqkaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGTbqBaiabggHiLdaabaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4gacqGHris5aaqaaiabd2gaTjabgEHiQiabd6gaUbaaaaa@54A5@

More sophisticated term based approaches combine multiple measures of term similarity and aggregate similarity values using more complex functions, for example [17].

Graphical Measures of Term Similarity

The simplest approach to measuring similarity between ontological terms using the graph structure is to measure the shortest path distance between terms in the graph [18, 19]. Referring to figure 1, in terms of graph distance, we may consider the terms 'muscle cell proliferation' and 'fibroblast cell proliferation' (graph distance of 2) as being more similar than the former term with 'fibroblast regulation' (graph distance of 3). However the graph distance has only a weak correlation with similarity of terms. The semantic similarity between 'positive fibroblast regulation' and 'negative fibroblast regulation' is far greater than the similarity between 'muscle cell proliferation' and 'fibroblast cell proliferation' even though both examples have a graph distance of two. A simple graph distance-based measure of similarity does not model in a consistent way any notion of commonality or difference between terms.

Figure 1
figure 1

An Example of an Ontology of GO Terms. Nodes in the graph correspond to ontological terms. Edges correspond to relations between terms. Lower down terms in the diagram are descendants of terms higher up in the diagram if connected by an edge.

A more refined use of graph distance as a basis for a measure of term similarity is found in the Wu-Palmer measure of similarity [20]. It uses the idea that the distance from the root to the lowest common taxonomic ancestor (LCTA) measures the commonality between two terms while the sum of the distance between the LCTA and each term measures the difference between two terms. Combining these aspects results in the formula:

s W u P a l m e r ( T 1 , T 2 ) = 2 d i s t ( T l c t a , T r o o t ) d i s t ( T 1 , T l c t a ) + d i s t ( T 2 , T l c t a ) + 2 d i s t ( T l c t a , T r o o t ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfaOaem4Cam3aaSbaaeaacqWGxbWvcqWG1bqDcqGHsislcqWGqbaucqWGHbqycqWGSbaBcqWGTbqBcqWGLbqzcqWGYbGCaeqaaOGaeiikaGIaemivaq1aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqWGubavdaWgaaWcbaGaeGOmaidabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaeGOmaiJaey4fIOIaemizaqMaemyAaKMaem4CamNaemiDaqNaeiikaGIaemivaq1aaSbaaeaacqWGSbaBcqWGJbWycqWG0baDcqWGHbqyaeqaaiabcYcaSiabdsfaunaaBaaabaGaemOCaiNaem4Ba8Maem4Ba8MaemiDaqhabeaacqGGPaqkaeaacqWGKbazcqWGPbqAcqWGZbWCcqWG0baDcqGGOaakcqWGubavdaWgaaqaaiabigdaXaqabaGaeiilaWIaemivaq1aaSbaaeaacqWGSbaBcqWGJbWycqWG0baDcqWGHbqyaeqaaiabcMcaPiabgUcaRiabdsgaKjabdMgaPjabdohaZjabdsha0jabcIcaOiabdsfaunaaBaaabaGaeGOmaidabeaacqGGSaalcqWGubavdaWgaaqaaiabdYgaSjabdogaJjabdsha0jabdggaHbqabaGaeiykaKIaey4kaSIaeGOmaiJaey4fIOIaemizaqMaemyAaKMaem4CamNaemiDaqNaeiikaGIaemivaq1aaSbaaeaacqWGSbaBcqWGJbWycqWG0baDcqWGHbqyaeqaaiabcYcaSiabdsfaunaaBaaabaGaemOCaiNaem4Ba8Maem4Ba8MaemiDaqhabeaacqGGPaqkaaaaaa@9638@

Where T1 and T2 are the two terms being compared, T lcta is the term that corresponds to the lowest common taxonomic ancestor between T1 and T2. T root denotes to root node of the ontology (assuming that the ontology has only one root). dist(T i , T j ) denotes the graph distance between terms T i and T j . The 2 * dist(T lcta , T root ) component of the denominator serves to normalize the measure.

Instance-Based Measures of Term Similarity

Similarity may be measured using an instance based measure of semantic similarity as computed by either Resnik (eqn. 2) or Lin (eqn. 3). Resnik [21, 22] exploits the informativeness of the lowest common ancestor between terms as a measure of semantic similarity:s Resnik (T i , T j ) = IC Corpus (T lcta )

where T lcta denotes the lowest common taxonomic ancestor between ontological terms T i and T j . This measure only accounts for the commonality between terms.

Another method of measuring similarity derived by Lin [9] is:

s L i n ( T i , T j ) = 2 I C C o r p u s ( T l c t a ) I C C o r p u s ( T i ) + I C C o r p u s ( T j ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4Cam3aaSbaaSqaaiabdYeamjabdMgaPjabd6gaUbqabaGccqGGOaakcqWGubavdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiabdsfaunaaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaacqaIYaGmcqGHxiIkcqWGjbqscqWGdbWqdaWgaaqaaiabdoeadjabd+gaVjabdkhaYjabdchaWjabdwha1jabdohaZbqabaGaeiikaGIaemivaq1aaSbaaeaacqWGSbaBcqWGJbWycqWG0baDcqWGHbqyaeqaaiabcMcaPaqaaiabdMeajjabdoeadnaaBaaabaGaem4qamKaem4Ba8MaemOCaiNaemiCaaNaemyDauNaem4CamhabeaacqGGOaakcqWGubavdaWgaaqaaiabdMgaPbqabaGaeiykaKIaey4kaSIaemysaKKaem4qam0aaSbaaeaacqWGdbWqcqWGVbWBcqWGYbGCcqWGWbaCcqWG1bqDcqWGZbWCaeqaaiabcIcaOiabdsfaunaaBaaabaGaemOAaOgabeaacqGGPaqkaaaaaa@6EC3@
(3)

which has the advantage that it maps onto values on the interval [0, 1] unlike Resnik's measure which maps onto the interval [0, ∞). Lin's measure also accounts for both the commonality and difference between terms. Resnik's measure does have the desirable property that terms close to the root of the ontology have a low similarity however. This is not the case for Lin's measure.

The only structural property that both Resnik and Lin exploit is the lowest common taxonomic ancestor. To overcome this weakness Jiang and Conrath [23] integrate graph distance based measures of similarity into information based approaches. They construct a generalized weighting measure between a child and its immediate parent that accounts for the number of out edges and depth of terms along the shortest path between the compared terms in the ontology. While they acknowledge that other relation types might be relevant to measuring similarity their measure is based solely on the taxonomic or is_a relations in the ontology.

New Approaches to Annotation Similarity

Beyond the set, vector, graph and term-based approaches to measuring similarity of annotations exist other methods that introduce the additional properties discussed above such as monotonicity and taking into account the semantics of ontological relations.

Similarity Based on Fuzzy Measures

The monotonicity property leads naturally to the use of fuzzy measures as a basis for measuring the descriptiveness of an annotation. Using the information content measure of terms described in eqn. 1 as the basis for measuring similarity a fuzzy measure is constructed. A fuzzy measure is a weighting on sets of terms such that the weight associated with a set of terms is greater than or equal to the weight associated with any of its subsets.

Popescu et al. [10] use fuzzy measures to induce a weighting m for an annotation from its constituent terms. This weight is extrapolated from the weights of individual terms by using the formula for constructing a Sugeno λ-fuzzy measure: For a set of terms G a , G b and G c where G c = G a G b and G a G b = a λ-fuzzy measure for G c ism λ (G c ) = m λ (G a ) + m λ (G b ) + λ * m λ (G a ) * m λ (G b )

where λ is a value that ensures that m(G c ) ≥ m(G a ) and m(G c ) ≥ m(G b ). Given that the weights (fuzzy measure densities) m for individual terms T i in an annotation are known then λ can be determined by solving the following equation:

1 + λ = T i ( 1 + λ m ( T i ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeGymaeJaey4kaSIaeq4UdWMaeyypa0ZaaebuaeaacqGGOaakcqaIXaqmcqGHRaWkcqaH7oaBcqWGTbqBcqGGOaakcqWGubavdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabcMcaPaWcbaGaemivaq1aaSbaaWqaaiabdMgaPbqabaaaleqaniabg+Givdaaaa@4094@

In [10] the weight for each term is based on the ICCorpus measure (eqn. 1). The similarity of two annotations, represented by a set of terms G1 and G1 from the same ontology, are compared using the similarity function:

S F M S ( G 1 , G 2 ) = m G 1 ( G 1 G 2 ) + m G 2 ( G 1 G 2 ) 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdAeagjabd2eanjabdofatbqabaGccqGGOaakcqWGhbWrdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabdEeahnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaacqWGTbqBdaWgaaqaaiabdEeahnaaBaaabaGaeGymaedabeaaaeqaaiabcIcaOiabdEeahnaaBaaabaGaeGymaedabeaacqGHPiYXcqWGhbWrdaWgaaqaaiabikdaYaqabaGaeiykaKIaey4kaSIaemyBa02aaSbaaeaacqWGhbWrdaWgaaqaaiabikdaYaqabaaabeaacqGGOaakcqWGhbWrdaWgaaqaaiabigdaXaqabaGaeyykICSaem4raC0aaSbaaeaacqaIYaGmaeqaaiabcMcaPaqaaiabikdaYaaaaaa@5213@

where m G 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyBa02aaSbaaSqaaiabdEeahnaaBaaameaacqaIXaqmaeqaaaWcbeaaaaa@2FA3@ and m G 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyBa02aaSbaaSqaaiabdEeahnaaBaaameaacqaIYaGmaeqaaaWcbeaaaaa@2FA5@ are the λ-fuzzy measure functions that characterize G1 and G2 respectively. The relatedness of terms is accounted for by augmenting each annotation with the lowest common ancestors for each pair of terms from each annotation. This ensures a non-zero similarity between annotations containing related terms.

However, an ontology models other aspects of relatedness that should be taken into account. Relations between terms in an annotation can be used to identify redundant terms whose relevance to the descriptiveness of an annotation is already accounted for by other terms. For example, if two terms in an annotation are taxonomically related the existence of the parent term is implied by the existence of the child term.

If redundancy of terms is not taken into account it may lead to too many or too few instances being associated with the term. This is especially true when a term is part_of another term. The instances associated with the annotation consist of the parts and not what the instances are part of.

Exploiting Semantics of Ontological Relations

Wang et al. [14] account for the different contributions that terms related by is_a and part_of relations make to the descriptiveness of a term. The semantic contribution that ancestor terms make to a child term is calculated by:

S V ( T i ) = T j T a n c , i s T i ( T j ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaemOvayLaeiikaGIaemivaq1aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGH9aqpdaaeqbqaaiabdohaZnaaBaaaleaacqWGubavdaWgaaadbaGaemyAaKgabeaaaSqabaGccqGGOaakcqWGubavdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaemivaq1aaSbaaWqaaiabdQgaQbqabaWccqGHiiIZcqWGubavdaWgaaadbaGaemyyaeMaemOBa4Maem4yamMaeiilaWIaemyAaKgabeaaaSqab0GaeyyeIuoaaaa@4AF0@

where Tanc, idenotes the ancestors of term T i and s T i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4Cam3aaSbaaSqaaiabdsfaunaaBaaameaacqWGPbqAaeqaaaWcbeaaaaa@3034@ is calculated as

{ s T i ( T i ) = 1 s T i ( T j ) = max { w e s T i ( T k ) | T k childrenOf ( T j ) }  if  T j T i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaiqaaeaafaqaaeWabaaabaGaem4Cam3aaSbaaSqaaiabdsfaunaaBaaameaacqWGPbqAaeqaaaWcbeaakiabcIcaOiabdsfaunaaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaeyypa0JaeGymaedabaGaem4Cam3aaSbaaSqaaiabdsfaunaaBaaameaacqWGPbqAaeqaaaWcbeaakiabcIcaOiabdsfaunaaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyypa0JagiyBa0MaeiyyaeMaeiiEaGNaei4EaSNaem4DaC3aaSbaaSqaaiabdwgaLbqabaGccqGHxiIkcqWGZbWCdaWgaaWcbaGaemivaq1aaSbaaWqaaiabdMgaPbqabaaaleqaaOGaeiikaGIaemivaq1aaSbaaSqaaiabdUgaRbqabaGccqGGPaqkaeaacqGG8baFcqWGubavdaWgaaWcbaGaem4AaSgabeaakiabgIGiolabbogaJjabbIgaOjabbMgaPjabbYgaSjabbsgaKjabbkhaYjabbwgaLjabb6gaUjabb+eapjabbAgaMjabcIcaOiabdsfaunaaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeiyFa0NaeeiiaaIaeeyAaKMaeeOzayMaeeiiaaIaemivaq1aaSbaaSqaaiabdQgaQbqabaGccqGHGjsUcqWGubavdaWgaaWcbaGaemyAaKgabeaaaaaakiaawUhaaaaa@7902@

where w e [0, 1] is a number that corresponds to the semantic contribution factor for edge e. childrenOf(T x ) is a function that returns the immediate children of T x that are ancestor terms of T i . In this paper wis_a = 0.8 and wpart_of = 0.6. The similarity of two terms is computed by the formula

s ( T i , T j ) = T k T a n c , i T a n c , j ( s T i ( T k ) + s T j ( T k ) ) S V ( T i ) + S V ( T j ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4CamNaeiikaGIaemivaq1aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGubavdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiabg2da9KqbaoaalaaabaWaaabuaeaacqGGOaakcqWGZbWCdaWgaaqaaiabdsfaunaaBaaabaGaemyAaKgabeaaaeqaaiabcIcaOiabdsfaunaaBaaabaGaem4AaSgabeaacqGGPaqkcqGHRaWkcqWGZbWCdaWgaaqaaiabdsfaunaaBaaabaGaemOAaOgabeaaaeqaaiabcIcaOiabdsfaunaaBaaabaGaem4AaSgabeaacqGGPaqkcqGGPaqkaeaacqWGubavdaWgaaqaaiabdUgaRbqabaGaeyicI4Saemivaq1aaSbaaeaacqWGHbqycqWGUbGBcqWGJbWycqGGSaalcqWGPbqAaeqaaiabgMIihlabdsfaunaaBaaabaGaemyyaeMaemOBa4Maem4yamMaeiilaWIaemOAaOgabeaaaeqacqGHris5aaqaaiabdofatjabdAfawjabcIcaOiabdsfaunaaBaaabaGaemyAaKgabeaacqGGPaqkcqGHRaWkcqWGtbWucqWGwbGvcqGGOaakcqWGubavdaWgaaqaaiabdQgaQbqabaGaeiykaKcaaaaa@7086@

A term-based approach is taken to measuring the similarity between annotations G1 and G2. The similarities of the most similar pairs of terms from each annotation are averaged over to calculate the similarity between annotations:

S W a n g ( G 1 , G 2 ) = T i G 1 s ( T i , G 2 ) + T j G 2 s ( T j , G 1 ) | G 1 | + | G 2 | MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aaSbaaSqaaiabdEfaxjabdggaHjabd6gaUjabdEgaNbqabaGccqGGOaakcqWGhbWrdaWgaaWcbaGaeGymaedabeaakiabcYcaSiabdEeahnaaBaaaleaacqaIYaGmaeqaaOGaeiykaKIaeyypa0tcfa4aaSaaaeaadaaeqbqaaiabdohaZjabcIcaOiabdsfaunaaBaaabaGaemyAaKgabeaacqGGSaalcqWGhbWrdaWgaaqaaiabikdaYaqabaGaeiykaKcabaGaemivaq1aaSbaaeaacqWGPbqAaeqaaiabgIGiolabdEeahnaaBaaabaGaeGymaedabeaaaeqacqGHris5aiabgUcaRmaaqafabaGaem4CamNaeiikaGIaemivaq1aaSbaaeaacqWGQbGAaeqaaiabcYcaSiabdEeahnaaBaaabaGaeGymaedabeaacqGGPaqkaeaacqWGubavdaWgaaqaaiabdQgaQbqabaGaeyicI4Saem4raC0aaSbaaeaacqaIYaGmaeqaaaqabiabggHiLdaabaWaaqWaaeaacqWGhbWrdaWgaaqaaiabigdaXaqabaaacaGLhWUaayjcSdGaey4kaSYaaqWaaeaacqWGhbWrdaWgaaqaaiabikdaYaqabaaacaGLhWUaayjcSdaaaaaa@6A4E@

where s ( T x , G y ) = max T y G y ( s ( T x , T y ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4CamNaeiikaGIaemivaq1aaSbaaSqaaiabdIha4bqabaGccqGGSaalcqWGhbWrdaWgaaWcbaGaemyEaKhabeaakiabcMcaPiabg2da9iGbc2gaTjabcggaHjabcIha4naaBaaaleaacqWGubavdaWgaaadbaGaemyEaKhabeaaliabgIGiolabdEeahnaaBaaameaacqWG5bqEaeqaaaWcbeaakiabcIcaOiabdohaZjabcIcaOiabdsfaunaaBaaaleaacqWG4baEaeqaaOGaeiilaWIaemivaq1aaSbaaSqaaiabdMha5bqabaGccqGGPaqkcqGGPaqkaaa@4D87@ and |G y | denotes the number of terms in annotation G y .

Wang et al. make the observation that the instance based measures of term similarity will produce varying results based on the corpus chosen. They keep a fixed value for the contribution each relation type makes to the descriptiveness of a term. This does not account for the varying influence of terms on each other throughout the ontology even if the graph distance is the same. Exploiting the corpus statistics, if used appropriately, may account for this drawback. As with all term-based methods, where terms from each annotation are compared in a pairwise fashion, it is difficult to see how the monotonicity property is ensured when measuring the similarities between two annotations.

Methods

The Gene Ontology relates terms using is_a and part_of relations. We develop a measure of informativeness that provides a description of an annotation that takes into consideration the relations between terms. We use the informativeness measure of a term (eqn. 1) as the basis for providing a description of an annotation. We define an algorithm called SSA that combines the instances of terms while taking into account how these sets of instances are related by how their associated terms are related in the ontology. This results in a set of instances that can be said to be associated with an annotation and not just a term. We can then extend the concept of instance based semantic similarity of terms, such as Resnik's measure, to annotations.

Interpreting Annotations from Taxonomies

A taxonomy induces a partial ordering on a set of terms by the improper subset relation . If T i is_a T k and T j is_a T k then the set of instances associated with both T i and T j are subsets of T k . Assuming that we know of all possible instances that can be associated with a term, whatever properties that instances of both T i and T j share can be associated with any of the instances that can be associated with T k . This forms the basis for measuring the commonality between terms used in instance-based measures of similarity between terms.

The difference between terms T i and T j is modeled by the difference between the set of instances associated with each term. If we have two or more terms from a taxonomy in an annotation then it is reasonable to argue that the set of instances associated with an annotation should be the intersection of the set of instances associated with each term. The informativeness of the annotation is then based on the set of instances resulting from this intersection.

Interpreting Annotations from Partonomies

The part_of relation between terms denotes the concept that one term is 'part of ' another. It provides an alternative notion of relatedness between terms. An ontology consisting only of part_of relations is known as a partonomy. An example of a simple partonomy is wheel part_of car. It would not make sense to say that a wheel is_a car. The study of partness is complicated by the fact that there are many kinds of part_of relations. Yet the study of partness, known as mereology [24], has shown that there are also common aspects to all types of part_of relations, namely that part_of relations form a partial ordering on the sets of instances associated with each term.

According to the GO Consortium's usage guidelines since 2004 [25] the part_of relation should be interpreted as 'necessarily part of' where T i part_of T j means that all instances of T i are part of one or more instances of T j . The converse is not necessarily true. For example, all nuclei are part of cells but not all cells contain a nucleus. Bittner [26] models such a part_of relation using an improper partial order i.e. for term T i with descendant terms T j .T j part_of T i T j part_of T i

Annotations consisting of terms such that one term is part_of another should view the child term as being relevant to the annotation while the parent term provides redundant, contextual information. For example, consider an annotation consisting of two terms T i and T j from a partonomy. If T j part_of T i then the annotation should be interpreted as the set of instances of T j . All we can say is that the number of instances of T i associated with the annotation can be no more than the number of instances of T j . In general, an annotation consisting of terms belonging to a partonomy consists of terms that provide the set of instances that can be associated with the annotation while other terms provide the context in which these instances are embedded.

Partial Order Constraints for GO Annotations

Figure 2 shows a subset of the GO consisting of both part_of and is_a relations. According to the taxonomic is_a relations both 'mitochondrial chromosome' and 'mitochondrial nucleoid ' are 'mitochondrial part's. A measure of descriptiveness of a term should at least say that both 'mitochondrial chromosome' (a) and 'mitochondrial nucleoid ' (b) are more descriptive than 'mitochondrial part' (c), i.e. a, b c. Likewise, the part_of relation in figure 2 indicates that a part_of b. Here we can see how the part_of relation provides additional indirect information about descriptiveness not represented by the taxonomic relations. If an annotation consists of the terms 'mitochondrial chromosome' and 'mitochondrial nucleoid' then the annotation should be interpreted as the set of instances of 'mitochondrial chromosome'. If the terms 'mitochondrial part' and 'chromosome' are added to the annotation then the same set of instances should be associated with the annotation. All additional terms are already implied by the existence of 'mitochondrial chromosome' in the annotation. If we had either treated the part_of relation as an is_a relation or ignored it then the annotation would have been interpreted as the set of instances that are both 'mitochondrial chromosome' and 'mitochondrial nucleoid'. With this interpretation we would have possibly returned an empty set of instances since chromosomes are not nucleoids.

Figure 2
figure 2

A Subset of GO Terms and Relations. An example of where the part_of relation plays an important role in interpreting annotations. If an annotation contains the term 'mitochondrial chromosome' then all other terms shown in the graph are redundant. The diagram also shows various cases that describe how terms relate to each other.

The GO consists of many examples similar to the one described above. In general, the GO can be viewed as a taxonomy interspersed with part_of relations. Two terms are said to be directly related if there exists a series of relations on a single path between them. Terms that are not directly related along a path in the graph are indirectly related via a common ancestor. For example there may be other terms that are part_of 'mitochondrial nucleoid' in which case the term 'mitochondrial chromosome' is only related to the other parts by an indirect path of part_of relations. Though not shown, the terms 'mitochondrial nucleoid' and 'chromosome' are only indirectly related via a common ancestor through a number of is_a relations. When interpreting an annotation it is necessary to account for such situations.

In general, as described in table 1, there are nine cases to handle when trying to account for how terms are related. Terms or their taxonomic descendants may be directly related to each other in the ontology via a single path. Alternatively they may be indirectly related to each other via a common ancestor in which case we consider the two paths from the common ancestor to each term. A path may be homogeneous in that it consists of relations of only one type i.e. all relations are either only is_a or only part_of. Such paths are denoted by IS and PART respectively. A path that is inhomogeneous, consisting of both is_a and part_of relations, is denoted by MIXED.

Table 1 Partial Order Constraints

Directly Related Cases

There are three cases to handle when there exists a single path between terms in the ontology: IS, PART and MIXED paths. The first case is the generalized case of taxonomic relations where T i IS T j . For two terms T i and T j , where T j is the parent term and T i is a descendant, and a set of n intermediate terms {Tn} such that:

T i T 1 n T 2 n T n 1 n T n n T j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemivaq1aaSbaaSqaaiabdMgaPbqabaGccqGHgksZcqWGubavdaqhaaWcbaGaeGymaedabaGaemOBa4gaaOGaeyOHI0Saemivaq1aa0baaSqaaiabikdaYaqaaiabd6gaUbaakiablAciljabgAOinlabdsfaunaaDaaaleaacqWGUbGBcqGHsislcqaIXaqmaeaacqWGUbGBaaGccqGHgksZcqWGubavdaqhaaWcbaGaemOBa4gabaGaemOBa4gaaOGaeyOHI0Saemivaq1aaSbaaSqaaiabdQgaQbqabaaaaa@4E83@

it can be inferred that T i T j . Where terms are related by a PART path a similar argument can be inferred for how two terms are ordered.

For the MIXED case there exists a mixture of is_a and part_of relations. The nature of the MIXED relationship is ultimately determined by the part_of relations. For example, if T i MIXED T j then this can be interpreted as T i part_of T j . There may be several is_a relations traversed along a MIXED path from T j to T i before a part_of relation is encountered. This means that T i can only be part_of a subset of the instances of T j . This subset is identified by the set of instances associated with the term (labeled T k in table 1) which is the parent term of the first part_of relation encountered along a MIXED path from T j to T i . This results in the partial order:T i T k T j

where T i is the descendant of T j , T i is the parent and T k denotes the first term before a part_of relation is encountered while traversing the MIXED path in the ontology from T j to T i . This form of reasoning can be further extended along the rest of the MIXED path to produce a more detailed partial order. However if the ultimate goal is to only determine the partial order between T i and T j then such induction of this reasoning is unnecessary.

Indirectly Related Homogeneous Cases

There are three cases to handle where both the paths to the common ancestor between terms are homogeneous: IS – IS, PART – PART and IS – PART (or PART – IS). In the first case, where T i IS T lca and T j IS T lca , since both terms T i and T j are taxonomic descendants of a lowest common ancestor T lca then it should be expected that the number of instances associated with T i and T j are less than the number of instances associated with Tlca.This results in the partial orderT i , T j T lca

An annotation consisting of two such related terms can be interpreted as the set of instances that are associated with both T i and T j . A similar form of reasoning can be applied to the PART – PART case. The partial order for the final case IS – PART (or PART – IS) can be derived in a similar fashion to the inhomogeneous direct MIXED case. If T i IS T lca and T j PART T lca then it can be inferred that T j PART T i . If an annotation consists of two such terms then it should be interpreted as the set of instances of T j . As a partial order constraint this can be modeled asT j T i T lca

Indirectly Related Inhomogeneous Cases

Indirectly related inhomogeneous cases occur when terms are related by a common ancestor in the ontology and one or both of the paths connecting the common ancestor with each term consists of an inhomogeneous set of relation types. There are three such cases to account for: IS – MIXED (or MIXED – IS), PART – MIXED (or MIXED – PART) and MIXED – MIXED.

The partial order for the first case IS – MIXED (or MIXED – IS) can be handled by considering each path separately. The partial order for the T i IS T lca path is T i T lca . The partial order for the MIXED path is T j T k T lca which is derived in the same way as the directly related MIXED case. Combining the two partial orders results in(T j T k ), T i T lca

If an annotation consists of two such terms then it should be interpreted as the set of instances of T j that are part of instances that are of type T i and T k .

The PART – MIXED (or MIXED – PART) case requires slightly more reasoning about to construct its associated partial order. If T i PART T lca and T j MIXED T lca then it can be inferred that both T i and T j are part of T lca . Because T j is only part of a subset of the instances associated with T lca , the instances associated with T k , then T i can only be part of the set of instances associated with T k also. This results in the partial orderT j , T i T k T lca

An annotation consisting of two such related terms should be interpreted as the set of instances of T i and T j that are part of the same instances of T k .

The final case MIXED – MIXED occurs when paths from both terms to the common ancestor consist of a mixture of relation types. The partial order for such a case can be constructed by looking at each path separately. If T i MIXED T lca then the partial ordering is T i T k T lca . Similarly for T j MIXED T lca we get T j T m T lca . Combining the two partial orders results in(T i T k ), (T j T m ) ≤ T lca

If an annotation consists of two such terms then it should be interpreted as the set of instances of T i and T j that are part of the same instances of T k and T m .

The SSA Algorithm

The SSA algorithm is based on the nine cases of term relatedness described above. The SSA algorithm derives the set of instances that can be associated with an annotation from the set of instances associated with that annotation's constituent terms. There are two aspects to the algorithm: identifying which terms are the contextual, redundant instances and which terms' instances can be associated with the annotation. For example, a contextual instance may be 'mitochondrial nucleoid' that provides the context for the set of instances of 'chromosome'. Throughout we denote the set of contextual terms by exclTerms and the set of terms whose instances can be associated with the annotation as inclTerms. numInst(T i ) denotes the number of instances associated with T i .

The above partial order constraints were constructed under the ideal assumptions assumed by the partial orderings in taxonomies and partonomies. In reality there only ever exists an incomplete set of instances associated with terms and some adjustment of the number of instances is required if the partial order constraints are to be satisfied. Terms that are taxonomically related are guaranteed to satisfy the taxonomic constraints. However, terms that are partonomically related may not satisfy their associated partial order constraints. In these cases some adjustment of the number of instances associated with a term is necessary. For example, if T i PART T j and there are no instances associated with T j in the corpus while there are a number of instances associated with T i then in order to satisfy the PART constraint the number of instances of T j is set equal to the number of instances associated with T i .

The algorithm consists of the following steps:

  • For each distinct ordered pair (T i , T j ) of terms in annotations G1 and G2 respectively

  • Identify the case that corresponds to how T i is related to T j

* Terms are assigned to inclTerms or exclTerms depending on case

* The number of instances associated with a term may be adjusted if the case allows

  • Remove any terms from inclTerms also found in exclTerms

  • Return the sets inclTerms and exclTerms

where an ordered pair of terms (T i , T j ) means that (T i , T j ) ≠ (T j , T i ). In the following sections we identify how each case assigns terms to inclTerms and exclTerms and adjusts the number of instances associated with each term used to compare annotations.

Direct Cases

The IS constraint where one term in an annotation is a special case of another term can be implemented as follows:

1 if (T i IS T j )

inclTerms ← inclTerms T i

exclTerms ← exclTerms T j

In this situation the term T j is viewed as being the common taxonomic ancestor of both terms.

The PART constraint where one term is a part of another term can be implemented as:

2 if (T i PART T j )

inclTerms ← inclTerms T i

exclTerms ← exclTerms T j

if (numInst (T j ) < numInst(T i ))

numInst(T j ) = numInst(T i )

In this situation the term T j is viewed as providing the context that instances of T i are part of.

The case is similar for T i MIXED T j . In these cases we are relating terms that belong to two different lines of taxonomic inheritance where terms have a possibly incomplete set of associated instances. In order to ensure that the partial order constraint associated with this case is implemented correctly if T j has fewer instances associated with it than T i then we adjust the number of instances associated with T j to be equal to the number of instances associated with T i .

The MIXED constraint where T i is a part of another term T j via an intermediate term T k can be implemented similarly to the PART case:

3 if (T i MIXED T j )

inclTerms ← inclTerms T i

exclTerms ← exclTerms T j

exclTerms ← exclTerms T k

if (numInst(T k ) < numInst(T i ))

numInst(T k ) = numInst(T i )

if (numInst(T j ) < numInst(T i ))

numInst(T j ) = numInst(T i )

In this situation the term T k is viewed as providing the context that instances of T i are part of.

Indirect Homogeneous Cases

In the indirect homogeneous cases compared terms T i and T j are indirectly related via a common ancestor T lca along homogeneous paths. The first such case is where T i IS T lca and T j IS Tlca.In this situation the number of instances associated with T lca provides a measure of similarity between T i and T j :

4 if (T i IS T lca &T j IS T lca )

numInst(T i ), numInst(T j ) ← min(numInst(T i ), numInst(T j ))

inclTerms ← inclTerms T j T i

exclTerms ← exclTerms T lca

In the case where T i PART T lca and T j PART T lca T lca provides the context in which instances of T i and T j are embedded.

5 if (T i PART T lca &T j PART T lca )

numInst(T i ), numInst(T j ) ← min(numInst(T i ) ∩ numInst(T j ))

inclTerms ← inclTerms T j T i

exclTerms ← exclTerms T lca

if (numInst(T lca ) < numInst(T i ))

numInst(T lca ) = numInst(T i )

Since terms from two different lines of taxonomic inheritance are being compared and the set of instances associated with each term is incomplete an adjustment of the number of instances associated with each term is necessary.

The final homogeneous indirect case occurs when T i PART T lca and T j IS T lca . This is equivalent to T i PART T j since if T i is a part of T lca and T j is a kind of T lca then T i is a part of T j .

6 else if (T i PART T lca &T j IS T lca )

inclTerms ← inclTerms T i

exclTerms ← exclTerms T j

exclTerms ← exclTerms T lca

if (numInst(T j ) < numInst(T i ))

numInst(T j ) = numInst(T i )

if (numInst(T lca ) < numInst(T i ))

numInst(T lca ) = numInst(T i )

As with other cases the number of instances associated with each term are adjusted to ensure that the partial order constraint associated with the case is satisfied.

Indirect Inhomogeneous Cases

In these cases one or both paths from T lca to terms T i and T j contain inhomogeneous types of relations. Throughout this section the term T k is a term in the ontology such that T m MIXED T k and T k IS T n if T n is an ancestor of T m in the ontology.

The first such case occurs where for two indirectly related terms being compared, T i and T j , there exists an MIXED path from T i to T lca via T k and an IS path from T j to T lca .

7 if (T i MIXED T lca &T j IS T lca )

inclTerms ← inclTerms T i

exclTerms ← exclTerms T lca

if (numInst(T k ) < numInst(T i ))

numInst(T k ) = numInst(T i )

if (numInst(T lca ) < numInst(T k ))

numInst(T lca ) = numInst(T k )

Since the relationship between T i and T j cannot be refined further than their relationship via T lca only T lca is assigned to exclTerms.

The second case occurs when T i MIXED T lca via T k and T j PART T lca . Since T j is part of T lca and T i is part of T k which is a kind of T lca then T j is a part of T k .

8 if (T i MIXED T lca &T j PART T lca )

inclTerms ← inclTerms T i

inclTerms ← inclTerms T j

exclTerms ← exclTerms T k

exclTerms ← exclTerms T lca

if (numInst(T k ) < numInst(T i ))

numInst(T k ) = numInst(T i )

if (numInst(T k ) < numInst(T j ))

numInst(T k ) = numInst(T j )

if (numInst(T lca ) < numInst(T k ))

numInst(T lca ) = numInst(T k )

The final case occurs when both terms T i and T j are MIXED related to T lca via T k and T m respectively. What is common between both terms T i and T j is that they are both part of T lca . The number of instances associated with each term is adjusted to satisfy the partial order constraints associated with this case.

9 if (T i MIXED T lca &T j MIXED T lca )

inclTerms ← inclTerms T i

inclTerms ← inclTerms T j

exclTerms ← exclTerms T lca

if (numInst(T k ) < numInst(T i ))

numInst(T k ) = numInst(T i )

if (numInst(T m ) < numInst(T j ))

numInst(T m ) = numInst(T j )

if (numInst(T lca ) < numInst(T k ))

numInst(T lca ) = numInst(T k )

if (numInst(T lca ) < numInst(T m ))

numInst(T lca ) = numInst(T m )

After all terms have been compared with each other it is necessary to remove any terms from inclTerms that are found in exclTerms. This can occur when one comparison assigns a term to inclTerms while another comparison identifies the term as belonging to the excluded set. After all terms are compared each term in inclTerms should have the same number of instances associated with it. The number of instances that are associated with an annotation G is equal to the minimum number of instances that can be associated with any of the terms in inclTermsG.

Finding the Nearest Common Annotation

Just as in semantic similarity of terms, where there is a common ancestor between two terms, there exists a nearest common annotation between two annotations. The concept of a nearest common annotation allows the extension of information based semantic similarity measures of terms, such as Resnik's and Lin's measures, to information based measures of semantic similarity of annotations.

We define the nearest common annotation (NCA) between two annotations G1 and G2 to be the annotation containing terms related to both annotations. The NCA should have the minimum possible number of instances associated with it such that either G1 or G2 can be derived from it. The set of terms exclTerms which results from applying SSA to two annotations G1 and G2 will return the set of terms associated with the NCA.

Measuring Similarity

By introducing the notion of nearest common annotation we can naturally extend Resnik's measure to measuring similarity of annotation. The LCA between two terms is replaced with the NCA of two annotations G1 and G2. Likewise, instead of applying IC Corpus (eqn. 1) to instances associated with a term we apply IC Corpus to instances of an annotation. Thus the extension of Resnik's measure from terms to annotations G1 and G2, SSA Resnik , becomes:

exclTerms S S A ( G 1 , G 2 ) S S A R e s n i k ( G 1 , G 2 ) = log ( min T i e x c l T e r m s n u m I n s t ( T i ) max N u m I n s t ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabmqaaaqaaiabbwgaLjabbIha4jabbogaJjabbYgaSjabbsfaujabbwgaLjabbkhaYjabb2gaTjabbohaZjabgcziSkabdofatjabdofatjabdgeabjabcIcaOiabdEeahnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaem4raC0aaSbaaSqaaiabikdaYaqabaGccqGGPaqkaeaacqWGtbWucqWGtbWucqWGbbqqdaWgaaWcbaGaemOuaiLaemyzauMaem4CamNaemOBa4MaemyAaKMaem4AaSgabeaakiabcIcaOiabdEeahnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIaem4raC0aaSbaaSqaaiabikdaYaqabaGccqGGPaqkcqGH9aqpaeaacqGHsislcyGGSbaBcqGGVbWBcqGGNbWzdaqadaqcfayaamaalaaabaWaaCbeaeaacyGGTbqBcqGGPbqAcqGGUbGBaeaacqWGubavdaWgaaqaaiabdMgaPbqabaGaeyicI4SaemyzauMaemiEaGNaem4yamMaemiBaWMaemivaqLaemyzauMaemOCaiNaemyBa0Maem4CamhabeaacqWGUbGBcqWG1bqDcqWGTbqBcqWGjbqscqWGUbGBcqWGZbWCcqWG0baDcqGGOaakcqWGubavdaWgaaqaaiabdMgaPbqabaGaeiykaKcabaGagiyBa0MaeiyyaeMaeiiEaGNaemOta4KaemyDauNaemyBa0MaemysaKKaemOBa4Maem4CamNaemiDaqhaaaGccaGLOaGaayzkaaaaaaaa@8FF7@

where maxNumInst is the number of distinct instances in the corpus.

Lin's measure may be extended as follows:

inclTerms 1 S S A ( G 1 , G 1 ) inclTerms 2 S S A ( G 2 , G 2 ) i c G 1 log ( min T i i n c l T e r m s 1 n u m I n s t ( T i ) m a x N u m I n s t ) i c G 2 log ( min T j i n c l T e r m s 2 n u m I n s t ( T j ) m a x N u m I n s t ) S S A L i n ( G 1 , G 2 ) = 2 S S A R e s n i k ( G 1 , G 2 ) i c G 1 + i c G 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabuqaaaaabaGaeeyAaKMaeeOBa4Maee4yamMaeeiBaWMaeeivaqLaeeyzauMaeeOCaiNaeeyBa0Maee4CamNaeGymaeJaeyiKHWQaem4uamLaem4uamLaemyqaeKaeiikaGIaem4raC0aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqWGhbWrdaWgaaWcbaGaeGymaedabeaakiabcMcaPaqaaiabbMgaPjabb6gaUjabbogaJjabbYgaSjabbsfaujabbwgaLjabbkhaYjabb2gaTjabbohaZjabikdaYiabgcziSkabdofatjabdofatjabdgeabjabcIcaOiabdEeahnaaBaaaleaacqaIYaGmaeqaaOGaeiilaWIaem4raC0aaSbaaSqaaiabikdaYaqabaGccqGGPaqkaeaacqWGPbqAcqWGJbWycqWGhbWrcqaIXaqmcqGHqgcRcqGHsislcyGGSbaBcqGGVbWBcqGGNbWzdaqadaqcfayaamaalaaabaWaaCbeaeaacyGGTbqBcqGGPbqAcqGGUbGBaeaacqWGubavdaWgaaqaaiabdMgaPbqabaGaeyicI4SaemyAaKMaemOBa4Maem4yamMaemiBaWMaemivaqLaemyzauMaemOCaiNaemyBa0Maem4CamNaeGymaedabeaacqWGUbGBcqWG1bqDcqWGTbqBcqWGjbqscqWGUbGBcqWGZbWCcqWG0baDcqGGOaakcqWGubavdaWgaaqaaiabdMgaPbqabaGaeiykaKcabaGaemyBa0MaemyyaeMaemiEaGNaemOta4KaemyDauNaemyBa0MaemysaKKaemOBa4Maem4CamNaemiDaqhaaaGccaGLOaGaayzkaaaabaGaemyAaKMaem4yamMaem4raCKaeGOmaiJaeyiKHWQaeyOeI0IagiiBaWMaei4Ba8Maei4zaC2aaeWaaKqbagaadaWcaaqaamaaxababaGagiyBa0MaeiyAaKMaeiOBa4gabaGaemivaq1aaSbaaeaacqWGQbGAaeqaaiabgIGiolabdMgaPjabd6gaUjabdogaJjabdYgaSjabdsfaujabdwgaLjabdkhaYjabd2gaTjabdohaZjabikdaYaqabaGaemOBa4MaemyDauNaemyBa0MaemysaKKaemOBa4Maem4CamNaemiDaqNaeiikaGIaemivaq1aaSbaaeaacqWGQbGAaeqaaiabcMcaPaqaaiabd2gaTjabdggaHjabdIha4jabd6eaojabdwha1jabd2gaTjabdMeajjabd6gaUjabdohaZjabdsha0baaaOGaayjkaiaawMcaaaqaaiabdofatjabdofatjabdgeabnaaBaaaleaacqWGmbatcqWGPbqAcqWGUbGBaeqaaOGaeiikaGIaem4raC0aaSbaaSqaaiabigdaXaqabaGccqGGSaalcqWGhbWrdaWgaaWcbaGaeGOmaidabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaeGOmaiJaey4fIOIaem4uamLaem4uamLaemyqae0aaSbaaeaacqWGsbGucqWGLbqzcqWGZbWCcqWGUbGBcqWGPbqAcqWGRbWAaeqaaiabcIcaOiabdEeahnaaBaaabaGaeGymaedabeaacqGGSaalcqWGhbWrdaWgaaqaaiabikdaYaqabaGaeiykaKcabaGaemyAaKMaem4yamMaem4raCKaeGymaeJaey4kaSIaemyAaKMaem4yamMaem4raCKaeGOmaidaaaaaaaa@0BE3@

In this case the SSA algorithm is used to find the non redundant terms that can be associated with an annotation.

Example

We compare the similarity of two gene product's annotations that returns a high measure of similarity when compared using our measure SSA Resnik . Two gene products, AAH1 and FUR1 whose annotations (listed in table 2) were taken from the SGD database [27] were compared producing a similarity value of 5.678. The number of instances associated with each term were obtained from the GOA [28]s. cerevisiae table of GO assignments.

Table 2 Example Annotations and Their Descriptions

FUR1's annotation consisted of six terms: {GO:0004845, GO:0005622, GO:0008655, GO:0009116, GO:0016740, GO:0016757}. Each term's description is found in table 2. Likewise, AAH1's annotation consists of twelve terms: {GO:0000034, GO:0004000, GO:0005634, GO:0005737, GO:0006146, GO:0009117, GO:0009168, GO:0016787, GO:0019239, GO:0042254, GO:0043101, GO:0043103}. The NCA is constructed by applying the SSA algorithm to identify the set of contextual terms common to both annotations. Terms such as the root term 'all' are immediately added to exclTerms. The term 'cellular component' (GO:0005575) is added to exclTerms since another term 'cell part' is is_a related to it. The term 'nucleobase metabolic process' (GO:0009112) is a more specific type of 'nucloebase, nucleoside and nucleotide process' (GO:0055086) and the terms are added to inclTerms and exclTerms respectively. Similar assignments occur for 'nucleobase metabolic process' (GO:0009112)/'cellular metabolic process' (GO:0044237), 'nucleobase metabolic process' (GO:0009112)/'cellular process' (GO:0009987) as well as other terms.

The SSA algorithm return nine contextual terms, {'all' (all), 'cellular process' (GO:0009987), 'cellular metabolic process' (GO:0044237), 'nucleobase metabolic process' (GO:0009112), 'nucleobase, nucleoside, nucleotide and nucleic acid metabolic process' (GO:0006139), 'nucleobase, nucleoside and nucleotide metabolic process' (GO:0055086), 'cell part' (GO:0044464), 'intracellular' (GO:0005622), 'catalytic activity' (GO:0003824), 'metabolic compound salvage' (GO:0043094)}. The resulting annotation contains terms from all three ontologies in the GO. There are 19 instances associated with the annotation. The number of instances is determined by the most specific term: 'metabolic compound salvage' (GO:0043094). The total number of instances in the corpus is 5554. S S A R e s n i c k = log ( 19 5554 ) 5.678 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uamLaem4uamLaemyqae0aaSbaaSqaaiabdkfasjabdwgaLjabdohaZjabd6gaUjabdMgaPjabdogaJjabdUgaRbqabaGccqGH9aqpcqGHsislcyGGSbaBcqGGVbWBcqGGNbWzdaqadaqcfayaamaalaaabaGaeGymaeJaeGyoaKdabaGaeGynauJaeGynauJaeGynauJaeGinaqdaaaGccaGLOaGaayzkaaGaeyisISRaeGynauJaeiOla4IaeGOnayJaeG4naCJaeGioaGdaaa@4D62@ . Since the highest value that SSA Resnik could return for the chosen corpus is ~8.622, taking the natural log of 1 5554 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqaI1aqncqaI1aqncqaI1aqncqaI0aanaaaaaa@3141@ , 5.678 corresponds to high degree of similarity.

Results

To validate our approach the discriminatory power of our method to identify clusters of related gene products was compared against Wang's measure of annotation similarity that also exploits the differences between types of relations. The average similarity of gene products found in the same biochemical pathway in the SGD database was compared against the average similarity of the same gene products compared with gene products found in other pathways. A large difference between these two values indicates the effectiveness of a similarity measure in discovering new pathways in a set of gene products. Average similarity of annotations inside and outside pathways was measured under four conditions: all terms; cellular component terms only; biological process terms only; and molecular function terms only.

A better test would be to take the average similarity of a set of gene products found in the same pathway and find the average or max of the average similarities of all other similarly sized sets of gene products. Of course this is intractable since the computational complexity of such a test is O(n!) since there are ( N n ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaeWaaKqbagaafaqabeGabaaabaGaemOta4eabaGaemOBa4gaaaGccaGLOaGaayzkaaaaaa@308D@ ways of creating a set of size n from a set of N elements.

Figure 3 show the results of a comparison of SSA Resnik with Wang's method and M ax Resnik on measuring the average annotation similarity, using all terms, of gene products inside and outside a pathway [data for figures 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 is found in Additional file 1]. The first 35 pathways are insufficiently annotated to produce meaningful results. Similarity values for SSA Resnik and Max Resnik were normalized to allow for direct comparison between similarity values. All measures behave similarly, the similarity values returned by Wang's method tends to increase as values returned by SSA Resnik increase. All measures tend to settle to an average similarity value when genes inside and outside a pathway are compared. Wang's method returns a higher value on average with values ranging between 0.5 and 0.6 as internal gene similarity increases. SSA Resnik and Max Resnik returns values between 0.3 and 0.4 for the average similarity value of genes inside a pathway with genes outside a pathway as similarity of genes within a pathway increases. If pathways are identified by the difference between the average similarity of gene products inside and outside a cluster then SSA Resnik and Max Resnik have greater discriminatory power. SSA Resnik and Max Resnik behave identically for most pathways when all terms are considered.

Figure 3
figure 3

Normalized SSA Resnik vs Wang's Method vs Normalized Max Resnik . Values shown correspond to the average annotation similarity values between gene products with other gene products in the same pathway (taken from the SGD biochemical pathways database) and between gene products in a pathway with other gene products not found in the pathway.

Figure 4
figure 4

Average Pathway Similarity Values of Annotations Consisting only of Cellular Component Terms Using SSA Resnik . Average of SSA Resnik similarity values of gene products inside and outside a pathway.

Figure 5
figure 5

Average Pathway Similarity Values of Annotations Consisting only of Cellular Component Terms Using Max Resnik . Average of Max Resnik similarity values of gene products inside and outside a pathway.

Figure 6
figure 6

Average Pathway Similarity Values of Annotations Consisting only of Cellular Component Terms Using Wang's Method. Average of Wang's measure of similarity of gene products inside and outside a pathway.

Figure 7
figure 7

Average Pathway Similarity Values of Annotations Consisting only of Biological Process Terms Using SSA Resnik . Average of SSA Resnik similarity values of gene products inside and outside a pathway.

Figure 8
figure 8

Average Pathway Similarity Values of Annotations Consisting only of Biological Process Terms Using Max Resnik . Average of Max Resnik similarity values of gene products inside and outside a pathway.

Figure 9
figure 9

Average Pathway Similarity Values of Annotations Consisting only of Biological Process Terms Using Wang's Method. Average of Wang's measure of similarity of gene products inside and outside a pathway.

Figure 10
figure 10

Average Pathway Similarity Values of Annotations Consisting only of Molecular Function Terms Using SSA Resnik . Average of SSA Resnik similarity values of gene products inside and outside a pathway.

Figure 11
figure 11

Average Pathway Similarity Values of Annotations Consisting only of Molecular Function Terms Using Max Resnik . Average of Max Resnik similarity values of gene products inside and outside a pathway.

Figure 12
figure 12

Average Pathway Similarity Values of Annotations Consisting only of Molecular Function Terms Using Wang's Method. Average of Wang's measure of similarity of gene products inside and outside a pathway.

Figure 13
figure 13

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Cellular Component Terms Using SSA Resnik . Standard deviation of SSA Resnik similarity values of gene products inside and outside a pathway.

Figure 14
figure 14

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Cellular Component Terms Using Max Resnik . Standard deviation of Max Resnik similarity values of gene products inside and outside a pathway.

Figure 15
figure 15

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Cellular Component Terms Using Wang's Method. Standard deviation of values of Wang's measure of similarity of gene products inside and outside a pathway.

Figure 16
figure 16

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Biological Process Terms Using SSA Resnik . Standard deviation of SSA Resnik similarity values of gene products inside and outside a pathway.

Figure 17
figure 17

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Biological Process Terms Using Max Resnik . Standard deviation of Max Resnik similarity values of gene products inside and outside a pathway.

Figure 18
figure 18

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Biological Process Terms Using Wang's Method. Standard deviation of values of Wang's measure of similarity of gene products inside and outside a pathway.

Figure 19
figure 19

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Molecular Function Terms Using SSA Resnik . Standard deviation of SSA Resnik similarity values of gene products inside and outside a pathway.

Figure 20
figure 20

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Molecular Function Terms Using Max Resnik . Standard deviation of Max Resnik similarity values of gene products inside and outside a pathway.

Figure 21
figure 21

Standard Deviation of Pathway Similarity Values of Annotations Consisting only of Molecular Function Terms Using Wang's Method. Standard deviation of values of Wang's measure of similarity of gene products inside and outside a pathway.

As shown in figures 4, 5, 6, when only terms from the cellular component sub-ontology are used the difference between SSA Resnik and Max Resnik becomes clear. Max Resnik returns a very high average similarity value between terms inside and outside a pathway. This may be an artifact of the low number of instances associated with cellular component terms. However when SSA is applied the average similarity values between annotations inside and outside pathways remains consistently low. SSA Resnik returns a comparatively high average similarity value for annotations inside pathways for approximately half the cases to which it can reasonably be applied. Wang's method behaves similarly to Max Resnik in this situation.

As shown in figures 7, 8, 9, if only biological process terms are used further dissimilarity between Max Resnik and SSA Resnik can be observed. The average similarity values of annotations inside a pathway with annotations outside a pathway is much higher for Max Resnik than for SSA Resnik . Wang's method and SSA Resnik behave similarly. Similarity values of annotations inside a pathway remain consistently higher than when the same annotations are compared with annotations outside the pathway for all methods.

The source of the similarity between SSA Resnik and Max Resnik can be identified when only molecular function terms are used, as shown in figures 10 and 11. In this case both methods behave exactly the same since there are no part of relations to exploit when comparing terms. Wang's method, shown in figure 12, returns a consistently high average similarity value for annotations inside a pathway compared with annotations outside a pathway.

Further discriminatory power can be achieved by considering the standard deviation of similarity values inside and outside a pathway. A set of gene products paired with other gene products in a pathway tend to have a high standard deviation of similarity values over all pairs mainly due to the small number of pairs being compared. Conversely, pairing gene products inside a pathway with those found outside the pathway should produce a set of similarity values with a lower standard deviation since annotations are expected to be dissimilar and values come from a larger set.

Figures 13, 14, 15 shows the standard deviation of similarity values of annotations consisting of cellular component terms inside pathways. Max Resnik returns a low internal standard deviation while reporting a consistently high standard deviation of similarity values when annotations inside a pathway are compared with annotations outside a pathway. The standard deviation of annotation similarity values between different pathways returned by both SSA Resnik and Wang's method are both consistently low. The standard deviation of all methods behave similarly as average similarity of annotations, consisting only of biological process terms, within pathways increase, as shown in figures 16, 17, 18. The same is also true of annotations consisting of molecular function terms, as shown in figures 19, 20, 21.

Discussion and conclusion

The SSA algorithm provides the basis of a framework for extending instance based measures of term similarity to annotations. The algorithm's construction is based on the set of cases for how terms are related to each other when the ontology consists only of is_a and part_of relations. Due to the incomplete nature of the set of instances associated with a term it is necessary to adjust the number of instances associated with a term in order to satisfy the partial order constraints of each case fully. As the number of annotations of gene products increase and ontological terms are applied more consistently it may be possible to satisfy the constraints without such adjustment. Alternatively, the partial order constraints can be used to develop a similarity method which is less dependent on the set of instances associated with terms.

When terms from all three sub-ontologies (CC, BP and MF) are used similarity of annotations between Max Resnik and SSA Resnik are equivalent on proteins found in the SGD database. This is due to the high degree of specificity of molecular function terms, which are not related partonomically, which causes the two measures to return the same values. When only cellular component and biological process terms are used, based on the experimental evidence, SSA Resnik becomes a better identifier of proteins belonging to pathways. SSA Resnik may identify new gene products that belong to pathways but have a different molecular function to those proteins already identified as belonging to the pathway. Molecular function terms only play a small role in identifying new pathway proteins since proteins tend to have different molecular functions inside pathways.

By finding the set of instances that can be associated with an annotation it is possible to preserve, at the annotation level, the properties of instance based methods used to measure the similarity of terms. For two given annotations, the nearest common annotation (NCA) is a minimal set of terms such that either annotation could be derived from it. The SSA algorithm provides a method for finding the set of terms associated with the NCA.

By combining the SSA algorithm with Resnik's measure and the concept of nearest common annotation we have developed a measure that provides good discriminatory power to identify possible pathways and other functional groups from gene product annotations. More generally, the set of cases and their associated constraints further extend the set of principles that a reasonable measure of annotation similarity should be built on.

References

  1. Lord P, Stevens R, Brass A, Goble CA: Semantic Similarity Measures as Tools for Exploring the Gene Ontology. Pacific Symposium on Biocomputing 2003, 8: 601–612.

    Google Scholar 

  2. Lord P, Stevens R, Brass A, Goble C: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275–1283.

    Article  CAS  PubMed  Google Scholar 

  3. Sevilla J, Segura V, Podhorski A, Guruceaga JE Mato, Martinez-Cruz L, Corrales F, Rubio A: Correlation between gene expression and GO semantic similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2005, 2(4):330–338.

    Article  CAS  PubMed  Google Scholar 

  4. Couto FM, Silva MJ, Coutinho PM: Measuring semantic similarity between Gene Ontology terms, Data and Knowledge Engineering. Business Process Management – Where business processes and web services meet 2007, 61: 137–152.

    Google Scholar 

  5. Lei Z, Dai Y: Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction. BMC Bioinformatics 2006, 7: 491.

    Article  PubMed Central  PubMed  Google Scholar 

  6. Goodman N: Seven strictures on similarity. In Problems and Projects. Edited by: Goodman N. New York: Bobbs-Merrill; 1972:437–447.

    Google Scholar 

  7. Arrell D: What Goodman Should Have Said about Representation. The Journal of Aesthetics and Art Criticism Autumn 1987, 46: 41–49.

    Article  Google Scholar 

  8. Tversky A: Features of Similarity. Psychological Rev 1977, 84: 327–352.

    Article  Google Scholar 

  9. Lin D: An Information-Theoretic Definition of Similarity. In Fifteenth International Conference on Machine Learning (ICML'98). Madison, WI: Morgan-Kaufmann; 1998.

    Google Scholar 

  10. Popescu M, Keller J, Mitchell J: Fuzzy Measures on the Gene Ontology for Gene Product Similarity. IEEEIACM Transactions on computational biology and bioinformatics 2006, 3(3):263–274.

    Article  CAS  PubMed  Google Scholar 

  11. Cross V: Tversky's Parameterized Similarity Ratio Model: A Basis for Semantic Relatedness. Fuzzy Information Processing Society, 2006. NAFIPS 2006. Annual meeting of the North American 541–546. 3–6 June 2006

    Chapter  Google Scholar 

  12. Torsello A, Hidovic D, Pelillo M: Four Metrics for Efficiently Comparing Attributed Trees. Proc of 17th International Conference on Pattern Recognition 2004, 2: 467–470.

    Article  Google Scholar 

  13. Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics 2006, 22(8):967–973.

    Article  CAS  PubMed  Google Scholar 

  14. Wang JZZ, Du Z, Payattakool R, Yu PSS, Chen CFF: A New Method to Measure the Semantic Similarity of GO Terms. Bioinformatics 2007.

    Google Scholar 

  15. Pesquita C, Faria D, Bastos H, Falcao A, Couto F: Evaluating GO-based Semantic Similarity Measures. BioOntologies SIG at ISMB/ECCB – 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) 2007.

    Google Scholar 

  16. Veale N , Seco JHT: An Intrinsic Information Content Metric for Semantic Similarity in WordNet. ECAI 2004 2004, 1089–1090.

    Google Scholar 

  17. Schlicker A, Albrecht M: FunSimMat: a comprehensive functional similarity database. Nucl Acids Res 2007. gkm806+

    Google Scholar 

  18. Rada R, Mili H, Bicknell E, Bletner M: Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics 1989, 19: 17–30.

    Article  Google Scholar 

  19. Lee JH, Kim MH, Lee YJ: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 1993, 49: 188–207.

    Article  Google Scholar 

  20. Wu Z, Palmer M: Verb semantics and lexical selection. In 32nd. Annual Meeting of the Association for Computational Linguistics. New Mexico State University, Las Cruces, New Mexico; 1994:133–138.

    Chapter  Google Scholar 

  21. Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of IJCAI-95 1995.

    Google Scholar 

  22. Resnik P: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research 1999, 11: 95–130.

    Google Scholar 

  23. Jiang J, Conrath D: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Proc Int'l Conf Research in Computational Linguistics, ROCLING X 1997.

    Google Scholar 

  24. Simon P: Parts: a study in ontology. Oxford: Clarendon Press; 1987.

    Google Scholar 

  25. Gene Ontology Consortium:GO Editorial Style Guide. 2004. [http://www.geneontology.org/GO.usage.html]

    Google Scholar 

  26. Bittner T: Axioms for parthood and containment relations in bio-ontologies. Unknown 2004.

    Google Scholar 

  27. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998, 26: 73–79.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research 2004, (32 Database):D262-D266.

    Google Scholar 

Download references

Acknowledgements

This work has been supported by Microsoft Research Cambridge and the Irish Research Council for Science, Engineering and Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brendan Sheehan.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

BS proposed, designed and implemented the algorithm and table of constraints. BS wrote the manuscript. AQ and BG supervised and approved the production of this paper. SD contributed helpful suggestions for the final manuscript.

Electronic supplementary material

12859_2008_2453_MOESM1_ESM.xls

Additional file 1: Averages and Standard Deviations of Similarity Values. Averages and standard deviations of similarity values of Max Resnik , SSA Resnik and Wang's method for each pathway in SGD. (XLS 474 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Authors’ original file for figure 11

Authors’ original file for figure 12

Authors’ original file for figure 13

Authors’ original file for figure 14

Authors’ original file for figure 15

Authors’ original file for figure 16

Authors’ original file for figure 17

Authors’ original file for figure 18

Authors’ original file for figure 19

Authors’ original file for figure 20

Authors’ original file for figure 21

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Sheehan, B., Quigley, A., Gaudin, B. et al. A relation based measure of semantic similarity for Gene Ontology annotations. BMC Bioinformatics 9, 468 (2008). https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-9-468

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/1471-2105-9-468

Keywords