**4. Semantic similarity**

Using a formal representation of knowledge, computers are given the ability to manipulate concepts that are difficult to represent, in a way that preserves their "semantics". Ontologies provide the appropriate support for automatic manipulation of information. In this context, semantic similarity is a technique that assigns a numeric value to a pair of concepts based on the similarity of their meaning, extracted from the ontology.

For example, there is no directly obvious way to compare two roles. However, considering the illustration in **Figure 3**, it is possible to intuitively understand that, because both "hallucinogen" and "antifungal drug" are examples of "drugs", they are more similar than "hallucinogen" and "fossil fuel". This measure makes use of the meaning of the concepts, implicitly represented in the ontologies through the relations between the concepts. Ontologies function as a proxy for that meaning and enable its manipulation and ultimately comparison.

Several formulas and ideas have been proposed, implemented and tested in the past to compute semantic similarity. A full exposition on such measures and algorithms is beyond the scope of this chapter. The reader is encouraged to expand on this topic by reading works such as [22–25]. As such, the following is an abridged version of how ontology-based semantic similarity has been computed. In this discussion, consider the ontology in **Figure 3**.

**Figure 3.** *A second toy example of an ontology representing chemical roles, also based on ChEBI.*

Measures of similarity based on ontologies can roughly be divided into edgebased and node-based. An example of an edge-based measure is counting how many relations must be traversed to connect the two concepts being compared. Rada et al. [26] define distance as the number of edges in the smallest path between two nodes composed only of "is-a" relations. In this case, the distance between "hallucinogen" and "antimicrobial agent" would be three ("hallucino gen"→"drug"→"antifungal drug"→"antimicrobial agent"). While this type of approach is intuitive, it assumes that all nodes and all edges are equally important in terms of their semantics (e.g., that all edges weigh the same), which is generally not true in ontologies in life sciences. For instance, the "is-a" relation between "hallucinogen" and "drug" does not necessarily convey the same *amount of information* as the inverse "is-a" relation between "drug" and "antifungal drug".

One way to solve this is to introduce node-based methods, a technique that weighs nodes based on their *information content* (IC) [27]. The IC of a node is a numeric value based that reflects how informative its presence is and is calculated based on its frequency of use, since concepts that appear more frequently are generally less informative. The first formula proposed to measure IC was

$$\text{IC}(\mathcal{c}) = -\log f(\mathcal{c}) \tag{1}$$

where *f*(*c*) is the relative frequency with which the concept *c* and all its descendants appear in a corpus (in the example ontology, we can use the fraction of chemical concepts in ChEBI annotated as performing each of those roles). The intuition behind this idea is the following: consider a document (e.g., a scientific article) that uses the sentence "rodents have fur". The term "rodent" is used in such a way that every other concept that can be categorised under it also possesses the declared property. In fact, whenever a concept is used (in text, in logical axioms, etc.), it must be interpreted as including the set of all concepts recursively categorised under it.

The similarity between two concepts can be computed as the IC of the *most informative common ancestor* (usually abbreviated as MICA) between them

$$\text{sim}\_{\text{Resnik}}(c\_1, c\_2) = \text{IC}(\text{MICA}(c\_1, c\_2)).\tag{2}$$

**37**

*Semantic Similarity in Cheminformatics DOI: http://dx.doi.org/10.5772/intechopen.89032*

the measure is unbounded above);

This idea has been iterated upon with some additions and adaptations.

properties of the graph representation of the ontology [31].

• The IC measure can be normalised so that it ranges from 0.0 to 1.0 (originally,

• The semantic similarity measure itself can be normalised. Notice that the original measure gives the same similarity to the pair "application"/"biological role" (both generic concepts) and "fossil fuel"/"antiviral agent", which goes against the intuition that the first pair should be more similar. Lin [32] uses this idea to define

simLin(*c*1, *c*2) = <sup>2</sup> <sup>⋅</sup> IC(MICA(*c*1, *c*2)) \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ IC(*<sup>c</sup>*1) <sup>+</sup> IC(*c*2)

• The notion of shared information content (originally measured as the information content of the MICA of the two concepts) has also been tuned to take into account the fact that concepts can have multiple parents [33], which is necessary in many life science fields since it is in the nature of biomedical ontologies that some concepts are categorised under multiple parents, (see https://github.com/lasige-BioTM/DiShIn for an example of software that computes this type of measure) or the fact that ontologies have disjointness axioms that encode the fact that two concepts cannot share any descendants [34], also important because life science ontologies, and especially chemistry ones, make use of those types of axioms [35].

• The way to measure shared information content has also been completely reimplemented to use not the IC of the most informative common ancestor but a

These measures are able to compare one concept with another. It is also possible to compare sets of concepts. For this, one takes the matrix of pairwise similarities between concepts in the first set and concepts in the second set and mathematically manipulates it to produce a single number, taking, for example, the average, the maximum, or the "best match average", an approach that averages the highest values in each row and column [22]. There are other approaches that convert a set of concepts into the set of all their ancestors and take the intersection of those sets as a

Finally, there is a difference in measuring the *similarity* or the *relatedness* between concepts. Similarity is a term that is generally applied to the notion that two concepts are "alike" and is usually computed based on "is-a" hierarchies; relatedness is more general: two related concepts can be related based on their categorisation on a hierarchy or on any number of other non-hierarchical relations. This distinction is important in chemistry, and ChEBI in particular, since many chemistry concepts are related via relations such as "has-role", "has-part", "is-enantiomer-of", etc. Notice that when nothing is known about a chemical compound other than its structure, semantic methods can still be used, because one of the ways ontologies

metric based on the set of all ancestors of the concepts [36].

measure of similarity (two examples are simUI and simGIC [22]).

; (3)

• The IC measure has been computed from multiple sources, such as (*i*) text corpora (as in the original), (*ii*) frequency of usage of the ontology concepts in external sources [28], or (*iii*) the ontology itself, where frequency can be computed based on the number of descendants (direct or indirect) of a concept [29], the number of leaf descendants of a concept [30], or other topological

*Cheminformatics and Its Applications*

Measures of similarity based on ontologies can roughly be divided into edgebased and node-based. An example of an edge-based measure is counting how many relations must be traversed to connect the two concepts being compared. Rada et al. [26] define distance as the number of edges in the smallest path between two nodes composed only of "is-a" relations. In this case, the distance between "hallucinogen" and "antimicrobial agent" would be three ("hallucino gen"→"drug"→"antifungal drug"→"antimicrobial agent"). While this type of approach is intuitive, it assumes that all nodes and all edges are equally important in terms of their semantics (e.g., that all edges weigh the same), which is generally not true in ontologies in life sciences. For instance, the "is-a" relation between

*A second toy example of an ontology representing chemical roles, also based on ChEBI.*

"hallucinogen" and "drug" does not necessarily convey the same *amount of information* as the inverse "is-a" relation between "drug" and "antifungal drug". One way to solve this is to introduce node-based methods, a technique that weighs nodes based on their *information content* (IC) [27]. The IC of a node is a numeric value based that reflects how informative its presence is and is calculated based on its frequency of use, since concepts that appear more frequently are generally less informative. The first formula proposed to measure

where *f*(*c*) is the relative frequency with which the concept *c* and all its descen-

The similarity between two concepts can be computed as the IC of the *most* 

*informative common ancestor* (usually abbreviated as MICA) between them

dants appear in a corpus (in the example ontology, we can use the fraction of chemical concepts in ChEBI annotated as performing each of those roles). The intuition behind this idea is the following: consider a document (e.g., a scientific article) that uses the sentence "rodents have fur". The term "rodent" is used in such a way that every other concept that can be categorised under it also possesses the declared property. In fact, whenever a concept is used (in text, in logical axioms, etc.), it must be interpreted as including the set of all concepts recursively catego-

IC(*c*) = −log *f*(*c*) (1)

simResnik(*c*1, *c*2) = IC(MICA(*c*1, *c*2)). (2)

**36**

IC was

**Figure 3.**

rised under it.

This idea has been iterated upon with some additions and adaptations.


$$
\hat{\mathbf{s}} \cdot \mathbf{m}\_{\text{Lin}}(\mathbf{c}\_1, \mathbf{c}\_2) = \frac{2 \cdot \text{IC}(\text{MICA}(\mathbf{c}\_1, \mathbf{c}\_2))}{\text{IC}(\mathbf{c}\_1) + \text{IC}(\mathbf{c}\_2)} \; ; \tag{3}
$$

$$
\text{(3)}
$$


These measures are able to compare one concept with another. It is also possible to compare sets of concepts. For this, one takes the matrix of pairwise similarities between concepts in the first set and concepts in the second set and mathematically manipulates it to produce a single number, taking, for example, the average, the maximum, or the "best match average", an approach that averages the highest values in each row and column [22]. There are other approaches that convert a set of concepts into the set of all their ancestors and take the intersection of those sets as a measure of similarity (two examples are simUI and simGIC [22]).

Finally, there is a difference in measuring the *similarity* or the *relatedness* between concepts. Similarity is a term that is generally applied to the notion that two concepts are "alike" and is usually computed based on "is-a" hierarchies; relatedness is more general: two related concepts can be related based on their categorisation on a hierarchy or on any number of other non-hierarchical relations. This distinction is important in chemistry, and ChEBI in particular, since many chemistry concepts are related via relations such as "has-role", "has-part", "is-enantiomer-of", etc.

Notice that when nothing is known about a chemical compound other than its structure, semantic methods can still be used, because one of the ways ontologies (especially ChEBI) classify molecules is based on their structure. For example, ChEBI has a concept "carboxylic acid" which is an ancestor of all molecules that have one or more carboxylic acid groups (e.g., benzoic acid, all amino acids, all penicillins, etc.). This, however, is not conceptually different from measuring structural similarity, and such a setting would lack the enrichment provided by other types of knowledge (e.g., the knowledge of the chemical and biological roles of the molecule).
