**Ontology Learning Using Word Net Lexical Expansion and Text Mining**

Hiep Luong, Susan Gauch and Qiang Wang

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51141

## **1. Introduction**

[53] Mahmud, J. U., Borodin, Y., & Ramakrishnan, I. V. (2007). Csurf: a context-driven non-visual web-browser. *Proceedings of the 16th international conference on World Wide*

[54] Mehta, R. R., Mitra, P., & Karnick, H. (2005). Extracting semantic structure of web documents using content and visual information. *Special interest tracks and posters of*

[55] Gliozzo, A., Strapparava, C., & Dagan, I. (2009). Unsupervised and Supervised Ex‐ ploitation of Semantic Domains in Lexical Disambiguation. *Computer Speech and Lan‐*

[56] libxml: The XML C parser and toolkit of Gnome. http://xmlsoft.org , (accessed 16

[57] Decherchi, S., Gastaldo, P., & Zunino, S. (2009). K-means clustering for content-based document management in intelligence. *Solanas A, Martinez A. (ed.) Advances in Artifi‐ cial Intelligence for Privacy, Protection, and Security. Singapore: World Scientific*, 287-324.

[58] Vossen, P. (1998). Eurowordnet: A Multilingual Database with Lexical Semantic Net‐

[59] Sowa, J.F. (1992). Conceptual Graphs Summary. *Nagle TE, Nagle JA, Gerholz LL, Eklund PW. (ed.) Conceptual structures. Upper Saddle River: Ellis Horwood*, 3-51.

[60] Navigli, R. (2009). Word Sense Disambiguation: A Survey. *ACM Computing Surveys*,

[61] Karov, Y., & Edelman, S. (1998). Similarity-based word sense disambiguation. *Com‐*

[62] Schütze, H. (1998). Automatic Word Sense Discrimination. *Computational Linguistics*,

[63] Magnini, B., Strapparava, C., Pezzulo, G., & Gliozzo, A. (2002). The role of domain information in Word Sense Disambiguation. *Natural Language Engineering*, 8(4),

[64] Mallery, J.C. (1988). Thinking about foreign policy: Finding an appropriate role for artificial intelligence computers. *PhD thesis. MIT Political Science Department Cam‐*

[65] Gale, W. A., Church, K., & Andyarowsky, D. (1992). A method for disambiguating

[66] Barrera, A., & Verma, R. (2011). Automated extractive single-document summariza‐ tion: beating the baselines with a new approach. *Proceedings of the 2011 ACM Symposi‐*

word senses in a corpus. *Computers and the Humanities*, 26-415.

*um on Applied Computing, SAC'11, TaiChung, Taiwan*.

*the 14th international conference on World Wide Web, WWW'05,Chiba, Japan*.

*Web, WWW'07,Banff, Canada.*

100 Theory and Applications for Advanced Text Mining Text Mining

*guage*, 18(3), 275-299.

works. *Kluwer Academic Publishers.*

*putational Linguistics*, 24(1), 41-60.

May 2012).

41(2), 10:1-10:69.

24(1), 99-123.

359-373.

*bridge*.

In knowledge management systems, ontologies play an important role as a backbone for providing and accessing knowledge sources. They are largely used in the next generation of the Semantic Web that focuses on supporting a better cooperation between humans and ma‐ chines [2]. Since manual ontology construction is costly, time-consuming, error-prone, and inflexible to change, it is hoped that an automated ontology learning process will result in more effective and more efficient ontology construction and also be able to create ontologies that better match a specific application [20]. Ontology learning has recently become a major focus for research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to produce an ontology for a new domain. However, most current approaches deal with narrowly-defined specific tasks or a single part of the ontology learn‐ ing process rather than providing complete support to users. There are few studies that at‐ tempt to automate the entire ontology learning process from the collection of domainspecific literature and filtering out documents irrelevant to the domain, to text mining to build new ontologies or enrich existing ones.

The World Wide Web is a rich source of documents that is useful for ontology learning. However, because there is so much information of varying quality covering a huge range of topics, it is important to develop document discovery mechanisms based on intelligent techniques such as focused crawling [7] to make the collection process easier for a new domain. However, due to the huge number of retrieved documents, we still require an automatic mechanism rather than domain experts in order to separate out the documents that are truly relevant to the domain of interest. Text classification techniques can be used to perform this task.

© 2012 Luong et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In order to enrich an ontology's vocabulary, several ontology learning approaches attempt to extract relevant information from WordNet, a semantic network database for the English language developed by Princeton University [23]. WordNet provides a rich knowledge base in which concepts, called synonymy sets or synsets, are linked by semantic relations. How‐ ever, a main barrier to exploiting the word relationships in WordNet is that most words have multiple senses. Due to this ambiguity, not all senses for a given word can be used as a source of vocabulary. Expanding a concept's vocabulary based on an incorrect word sense would add many unrelated words to that concept and degrade the quality of the overall on‐ tology. Thus, candidate word senses must be filtered very carefully. Most existing ap‐ proaches have had mixed results with sense disambiguation, so the vocabulary for a specific domain mined from WordNet typically requires further manual filtering to be useful.

**2. Related Work**

any domain covered by WordNet.

An ontology is an explicit, formal specification of a shared conceptualization of a domain of interest [11], where formal implies that the ontology should be machine-readable and the domain can be any that is shared by a group or community. Much of current research into ontologies focuses on issues related to ontology construction and updating. In our view, there are two main approaches to ontology building: (i) manual construction of an ontology from scratch, and (ii) semi-automatic construction using tools or software with human inter‐ vention. It is hoped that semi-automatic generation of ontologies will substantially decrease the amount of human effort required in the process [12][18][24]. Because of the difficulty of the task, entirely automated approaches to ontology construction are currently not feasible. Ontology learning has recently been studied to facilitate the semi-automatic construction of ontologies by ontology engineers or domain experts. Ontology learning uses methods from a diverse spectrum of fields such as machine learning, knowledge acquisition, natural lan‐ guage processing, information retrieval, artificial intelligence, reasoning, and database man‐ agement [29]. Gómez-Pérez et al [10] present a thorough summary of several ontology learning projects that are concerned with knowledge acquisition from a variety of sources such as text documents, dictionaries, knowledge bases, relational schemas, semi-structured data, etc. Omelayenko [24] discusses the applicability of machine learning algorithms to learning of ontologies from Web documents and also surveys the current ontology learning and other closely related approaches. Similar to our approach, authors in [20] introduces an ontology learning framework for the Semantic Web thatincluded ontology importation, ex‐ traction, pruning, refinement, and evaluation giving the ontology engineers a wealth of co‐ ordinated tools for ontology modeling. In addition to a general framework and architecture, they have implemented Text-To-Onto system supporting ontology learning from free text, from dictionaries, or from legacy ontologies. However, they do not mention any automated support to collect the domain documents from the Web or how to automatically identify do‐ main-relevant documents needed by the ontology learning process. Maedche et al. have pre‐ sented in another paper [21] a comprehensive approach for bootstrapping an ontologybased information extraction system with the help of machine learning. They also presented an ontology learning framework which is one important step in their overall bootstrapping approach but it has still been described as a theoretic model and did not deal with the spe‐ cific techniques used in their learning framework. Agirre et al., [1] have presented an auto‐ matic method to enrich very large ontologies, e.g., WordNet, that uses documents retrieved from the Web. However, in their approach, the query strategy is not entirely satisfactory in retrieving relevant documents which affects the quality and performance of the topic signa‐ tures and clusters. Moreover, they do not apply any filtering techniques to verify that the retrieved documents are truly on-topic. Inspiring the idea of using WordNet to enrich vo‐ cabulary for ontology domain, we have presented the lexical expansion from WordNet ap‐ proach [18] providing a method of accurately extract new vocabulary for an ontology for

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

103

Many ontology learning approaches require a large collection of input documents in order to enrich the existing ontology [20]. Although most employ text documents [4], only a few

In our work, we employ a general ontology learning framework extracts new relevant vo‐ cabulary words from two main sources, i.e., Web documents and WordNet. This framework can be used for ontologies in any domain. We demonstrate our approach to a biological do‐ main, specifically the domain of amphibian anatomy and morphology. In this work, we are exploring two techniques for expanding the vocabulary in an ontology: 1) lexical expansion using WordNet; and 2) lexical expansion using text mining. The lexical expansion from WordNet approach accurately extracts new vocabulary for an ontology for any domain cov‐ ered by WordNet. We start with a manually-created ontology on amphibian morphology. The words associated with each concept in the ontology, the concept-words, are mapped on‐ to WordNet and we employ a similarity computation method to identify the most relevant sense from multiple senses returned by WordNetfor a given concept-word. We then enrich the vocabulary for that original concept in the amphibian ontology by including the correct sense's associated synonyms and hypernyms.

Our text mining approach uses a focused crawler to retrieve documents related to the ontol‐ ogy's domain, i.e., amphibian, anatomy and morphology, from a combination of general search engines, scholarly search engines, and online digital libraries. We use text classifica‐ tion to identify, from the set of all collected documents, those most likely to be relevant to the ontology's domain. Because it has been shown to be highly accurate, we use a SVM (Support Vector Machine) classifier for this task [6] [40]. Finally, we implemented and eval‐ uatedseveral text mining techniques to extract relevant information for the ontology enrich‐ ment from the surviving documents.

In this paper, we describe our work on the ontology learning process and present experi‐ mental results for each of the two approaches. In section 2, we present a brief survey of cur‐ rent research on ontology learning, focused crawlers, document classification, information extraction, and the use of WordNet for learning new vocabulary and disambiguating word senses. In section 3, we present our ontology learning framework and our two approaches, i.e., lexical expansion using WordNet and text mining. Sections 4 and 5 describe these ap‐ proaches in more detail and report the results of our evaluation experiments. The final sec‐ tion presents conclusions and discusses our ongoing and future work in this area.

## **2. Related Work**

In order to enrich an ontology's vocabulary, several ontology learning approaches attempt to extract relevant information from WordNet, a semantic network database for the English language developed by Princeton University [23]. WordNet provides a rich knowledge base in which concepts, called synonymy sets or synsets, are linked by semantic relations. How‐ ever, a main barrier to exploiting the word relationships in WordNet is that most words have multiple senses. Due to this ambiguity, not all senses for a given word can be used as a source of vocabulary. Expanding a concept's vocabulary based on an incorrect word sense would add many unrelated words to that concept and degrade the quality of the overall on‐ tology. Thus, candidate word senses must be filtered very carefully. Most existing ap‐ proaches have had mixed results with sense disambiguation, so the vocabulary for a specific

domain mined from WordNet typically requires further manual filtering to be useful.

sense's associated synonyms and hypernyms.

102 Theory and Applications for Advanced Text Mining Text Minning

ment from the surviving documents.

In our work, we employ a general ontology learning framework extracts new relevant vo‐ cabulary words from two main sources, i.e., Web documents and WordNet. This framework can be used for ontologies in any domain. We demonstrate our approach to a biological do‐ main, specifically the domain of amphibian anatomy and morphology. In this work, we are exploring two techniques for expanding the vocabulary in an ontology: 1) lexical expansion using WordNet; and 2) lexical expansion using text mining. The lexical expansion from WordNet approach accurately extracts new vocabulary for an ontology for any domain cov‐ ered by WordNet. We start with a manually-created ontology on amphibian morphology. The words associated with each concept in the ontology, the concept-words, are mapped on‐ to WordNet and we employ a similarity computation method to identify the most relevant sense from multiple senses returned by WordNetfor a given concept-word. We then enrich the vocabulary for that original concept in the amphibian ontology by including the correct

Our text mining approach uses a focused crawler to retrieve documents related to the ontol‐ ogy's domain, i.e., amphibian, anatomy and morphology, from a combination of general search engines, scholarly search engines, and online digital libraries. We use text classifica‐ tion to identify, from the set of all collected documents, those most likely to be relevant to the ontology's domain. Because it has been shown to be highly accurate, we use a SVM (Support Vector Machine) classifier for this task [6] [40]. Finally, we implemented and eval‐ uatedseveral text mining techniques to extract relevant information for the ontology enrich‐

In this paper, we describe our work on the ontology learning process and present experi‐ mental results for each of the two approaches. In section 2, we present a brief survey of cur‐ rent research on ontology learning, focused crawlers, document classification, information extraction, and the use of WordNet for learning new vocabulary and disambiguating word senses. In section 3, we present our ontology learning framework and our two approaches, i.e., lexical expansion using WordNet and text mining. Sections 4 and 5 describe these ap‐ proaches in more detail and report the results of our evaluation experiments. The final sec‐

tion presents conclusions and discusses our ongoing and future work in this area.

An ontology is an explicit, formal specification of a shared conceptualization of a domain of interest [11], where formal implies that the ontology should be machine-readable and the domain can be any that is shared by a group or community. Much of current research into ontologies focuses on issues related to ontology construction and updating. In our view, there are two main approaches to ontology building: (i) manual construction of an ontology from scratch, and (ii) semi-automatic construction using tools or software with human inter‐ vention. It is hoped that semi-automatic generation of ontologies will substantially decrease the amount of human effort required in the process [12][18][24]. Because of the difficulty of the task, entirely automated approaches to ontology construction are currently not feasible.

Ontology learning has recently been studied to facilitate the semi-automatic construction of ontologies by ontology engineers or domain experts. Ontology learning uses methods from a diverse spectrum of fields such as machine learning, knowledge acquisition, natural lan‐ guage processing, information retrieval, artificial intelligence, reasoning, and database man‐ agement [29]. Gómez-Pérez et al [10] present a thorough summary of several ontology learning projects that are concerned with knowledge acquisition from a variety of sources such as text documents, dictionaries, knowledge bases, relational schemas, semi-structured data, etc. Omelayenko [24] discusses the applicability of machine learning algorithms to learning of ontologies from Web documents and also surveys the current ontology learning and other closely related approaches. Similar to our approach, authors in [20] introduces an ontology learning framework for the Semantic Web thatincluded ontology importation, ex‐ traction, pruning, refinement, and evaluation giving the ontology engineers a wealth of co‐ ordinated tools for ontology modeling. In addition to a general framework and architecture, they have implemented Text-To-Onto system supporting ontology learning from free text, from dictionaries, or from legacy ontologies. However, they do not mention any automated support to collect the domain documents from the Web or how to automatically identify do‐ main-relevant documents needed by the ontology learning process. Maedche et al. have pre‐ sented in another paper [21] a comprehensive approach for bootstrapping an ontologybased information extraction system with the help of machine learning. They also presented an ontology learning framework which is one important step in their overall bootstrapping approach but it has still been described as a theoretic model and did not deal with the spe‐ cific techniques used in their learning framework. Agirre et al., [1] have presented an auto‐ matic method to enrich very large ontologies, e.g., WordNet, that uses documents retrieved from the Web. However, in their approach, the query strategy is not entirely satisfactory in retrieving relevant documents which affects the quality and performance of the topic signa‐ tures and clusters. Moreover, they do not apply any filtering techniques to verify that the retrieved documents are truly on-topic. Inspiring the idea of using WordNet to enrich vo‐ cabulary for ontology domain, we have presented the lexical expansion from WordNet ap‐ proach [18] providing a method of accurately extract new vocabulary for an ontology for any domain covered by WordNet.

Many ontology learning approaches require a large collection of input documents in order to enrich the existing ontology [20]. Although most employ text documents [4], only a few deal with ontology enrichment from documents collected from the Web rather than a man‐ ually created, domain-relevant corpus. To create a corpus from the Web, one can use general purpose crawlers and search engines, but this approach faces problems with scalability due to the rapid growth of the Web. Focused crawlers, on the other hand, overcome this draw‐ back, i.e., they yield good recall as well as good precision, by restricting themselves to a lim‐ ited domain [7]. Ester et al [7] introduce a generic framework for focused crawling consisting of two major components: (i) specification of the user interest and measuring the resulting relevance of a given Web page; and (ii) a crawling strategy.

edge from unstructured text documents [3][12]. Tan [33] presents a good survey of text mining products/applications and aligns them based on the *text refining* and *knowledge distil‐ lation* functions as well as the *intermediate form* that they adopt. In terms of using text mining for the ontology learning task, Spasic et al. [30] summarizes different approaches in which ontologies have been used for text-mining applications in biomedicine. In another work, Ve‐ lardi et al. [36] presents OntoLearn, a set of text-mining techniques to extract relevant con‐ cepts and concept instances from existing documents in a Tourism domain. Authors have devised several techniques to (i) identify concept and (ii) concept instances, (iii) organize such concepts in sub-hierarchies, and iv) detect relatedness links among such concepts.

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

105

In order to improve accuracy of the learned ontologies, the documents retrieved by focused crawlers may need to be automatically filtered by using some text classification technique such as Support Vector Machines (SVM), k-Nearest Neighbors (kNN), Linear Least-Squares Fit, TF-IDF, etc. A thorough survey and comparison of such methods and their complexity is presented in [40] and the authors in [6] conclude that SVM to be most accurate for text clas‐ sification and it is also quick to train. SVM [34] is a machine learning model that finds an optimal hyper plane to separate two then classifies data into one of two classes based on the side on which they are located [5] [14].The k-nearest neighbors (kNN) algorithm is among the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors. In pattern recognition, the kNN is a method for classifying objects based on clos‐ est training examples in the feature space. It is a type of instance-based learning where the function is only approximated locally and all computation is deferred until classification. This algorithm has also been used successfully in many text categorization applications [41].

In this section, we present the architecture of our ontology learning process framework (c.f. Figure 1) that incorporates two approaches, i.e., lexical expansion and text mining, to identi‐ fy new domain-relevant vocabulary for ontology enrichment. These approaches are present‐

This approach starts from a small manually constructed ontology, then tries to mine rele‐ vant words from WordNet in order to enrich the vocabulary associated with ontology con‐

**1.** Extract all single concept-words from the seed ontology. Filter these words by remov‐

ing stop-words; locate each remaining word's senses within WordNet.

**3. Ontology Learning Framework**

**3.1. Architecture**

ed in detail in sections 4 and 5.

*3.1.1. Lexical expansion approach*

cepts. It contains main following steps:

Rather than working with domain-relevant documents from which vocabulary can be ex‐ tracted, some ontology construction techniques exploit specific online vocabulary resources. WordNet is an online semantic dictionary, partitioning the lexicon into nouns, verbs, adjec‐ tives, and adverbs [23]. Some current researchers extract words from WordNet's lexical da‐ tabase to enrich ontology vocabularies [8][16][31]. In [27], Reiter et al describe an approach that combines the Foundational Model of Anatomy with WordNet by using an algorithm for domain-specific word sense disambiguation. In another approach similar to ours, Speretta et al exploit semantics by applying existing WordNet-based algorithms [31]. They calculate the relatedness between the two words by applying a similarity algorithm, and evaluated the effect of adding a variable number of the highest-ranked candidate words to each concept. A Perl package called Word-Net::Similarity [25] is a widely-used tool for measuring semantic similarity that contains implementations of eight algorithms for measuring semantic similar‐ ity. In our work, we evaluate the WordNet-based similarity algorithms by using the JSWL package developed by [26] that implements some of the most commons similarity and relat‐ edness measures between words by exploiting the hyponymy relations among synsets.

Semantic similarity word sense disambiguation approaches in WordNet can be divided into two broad categories based on (i) path length and (ii) information content. Warinet al [37] describe a method of disambiguating an ontology and WordNet using five different meas‐ ures. These approaches disambiguate semantic similarity based on the information content, (e.g., Lin [16], Jiang-Conrath [13], and Resnik [28]) and path length method (e.g., Leacock-Chodorow [15], and Wu-Palmer [38]). They present a new method that disambiguatesthe words in their ontology, the Common Procurement Vocabulary. Semantic similarity can also be calculated using edge-counting techniques. Yang and Powers [39] present a new pathweighting model to measure semantic similarity in WordNet. They compare their model to a benchmark set by human similarity judgments and found that their geometric model sim‐ ulates human judgments well. Varelas et al [35] propose the Semantic Similarity Retrieval Model (SSRM), a general document similarity and information retrieval method suitable for retrieval in conventional document collections and the Web. This approach is based on the term-based Vector Space Model by computing TF-IDF weights to term representations of documents. These representations are then augmented by semantically similar terms (which are discovered from WordNet by applying a semantic query in the neighborhood of each term) and by re-computing weights to all new and pre-existing terms.

Text mining, also known as text data mining or knowledge discovery from textual databas‐ es, refers generally to the process of extracting interesting and non-trivial patterns or knowl‐ edge from unstructured text documents [3][12]. Tan [33] presents a good survey of text mining products/applications and aligns them based on the *text refining* and *knowledge distil‐ lation* functions as well as the *intermediate form* that they adopt. In terms of using text mining for the ontology learning task, Spasic et al. [30] summarizes different approaches in which ontologies have been used for text-mining applications in biomedicine. In another work, Ve‐ lardi et al. [36] presents OntoLearn, a set of text-mining techniques to extract relevant con‐ cepts and concept instances from existing documents in a Tourism domain. Authors have devised several techniques to (i) identify concept and (ii) concept instances, (iii) organize such concepts in sub-hierarchies, and iv) detect relatedness links among such concepts.

In order to improve accuracy of the learned ontologies, the documents retrieved by focused crawlers may need to be automatically filtered by using some text classification technique such as Support Vector Machines (SVM), k-Nearest Neighbors (kNN), Linear Least-Squares Fit, TF-IDF, etc. A thorough survey and comparison of such methods and their complexity is presented in [40] and the authors in [6] conclude that SVM to be most accurate for text clas‐ sification and it is also quick to train. SVM [34] is a machine learning model that finds an optimal hyper plane to separate two then classifies data into one of two classes based on the side on which they are located [5] [14].The k-nearest neighbors (kNN) algorithm is among the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors. In pattern recognition, the kNN is a method for classifying objects based on clos‐ est training examples in the feature space. It is a type of instance-based learning where the function is only approximated locally and all computation is deferred until classification. This algorithm has also been used successfully in many text categorization applications [41].

## **3. Ontology Learning Framework**

#### **3.1. Architecture**

deal with ontology enrichment from documents collected from the Web rather than a man‐ ually created, domain-relevant corpus. To create a corpus from the Web, one can use general purpose crawlers and search engines, but this approach faces problems with scalability due to the rapid growth of the Web. Focused crawlers, on the other hand, overcome this draw‐ back, i.e., they yield good recall as well as good precision, by restricting themselves to a lim‐ ited domain [7]. Ester et al [7] introduce a generic framework for focused crawling consisting of two major components: (i) specification of the user interest and measuring the

Rather than working with domain-relevant documents from which vocabulary can be ex‐ tracted, some ontology construction techniques exploit specific online vocabulary resources. WordNet is an online semantic dictionary, partitioning the lexicon into nouns, verbs, adjec‐ tives, and adverbs [23]. Some current researchers extract words from WordNet's lexical da‐ tabase to enrich ontology vocabularies [8][16][31]. In [27], Reiter et al describe an approach that combines the Foundational Model of Anatomy with WordNet by using an algorithm for domain-specific word sense disambiguation. In another approach similar to ours, Speretta et al exploit semantics by applying existing WordNet-based algorithms [31]. They calculate the relatedness between the two words by applying a similarity algorithm, and evaluated the effect of adding a variable number of the highest-ranked candidate words to each concept. A Perl package called Word-Net::Similarity [25] is a widely-used tool for measuring semantic similarity that contains implementations of eight algorithms for measuring semantic similar‐ ity. In our work, we evaluate the WordNet-based similarity algorithms by using the JSWL package developed by [26] that implements some of the most commons similarity and relat‐ edness measures between words by exploiting the hyponymy relations among synsets.

Semantic similarity word sense disambiguation approaches in WordNet can be divided into two broad categories based on (i) path length and (ii) information content. Warinet al [37] describe a method of disambiguating an ontology and WordNet using five different meas‐ ures. These approaches disambiguate semantic similarity based on the information content, (e.g., Lin [16], Jiang-Conrath [13], and Resnik [28]) and path length method (e.g., Leacock-Chodorow [15], and Wu-Palmer [38]). They present a new method that disambiguatesthe words in their ontology, the Common Procurement Vocabulary. Semantic similarity can also be calculated using edge-counting techniques. Yang and Powers [39] present a new pathweighting model to measure semantic similarity in WordNet. They compare their model to a benchmark set by human similarity judgments and found that their geometric model sim‐ ulates human judgments well. Varelas et al [35] propose the Semantic Similarity Retrieval Model (SSRM), a general document similarity and information retrieval method suitable for retrieval in conventional document collections and the Web. This approach is based on the term-based Vector Space Model by computing TF-IDF weights to term representations of documents. These representations are then augmented by semantically similar terms (which are discovered from WordNet by applying a semantic query in the neighborhood of each

Text mining, also known as text data mining or knowledge discovery from textual databas‐ es, refers generally to the process of extracting interesting and non-trivial patterns or knowl‐

term) and by re-computing weights to all new and pre-existing terms.

resulting relevance of a given Web page; and (ii) a crawling strategy.

104 Theory and Applications for Advanced Text Mining Text Minning

In this section, we present the architecture of our ontology learning process framework (c.f. Figure 1) that incorporates two approaches, i.e., lexical expansion and text mining, to identi‐ fy new domain-relevant vocabulary for ontology enrichment. These approaches are present‐ ed in detail in sections 4 and 5.

#### *3.1.1. Lexical expansion approach*

This approach starts from a small manually constructed ontology, then tries to mine rele‐ vant words from WordNet in order to enrich the vocabulary associated with ontology con‐ cepts. It contains main following steps:

**1.** Extract all single concept-words from the seed ontology. Filter these words by remov‐ ing stop-words; locate each remaining word's senses within WordNet.

**2.** Build a reference hypernym tree for each word sense as a reference source for semantic similarity disambiguation.

a document contains the word "cell" but it is in the context of cellphone, telecommuni‐ cation… will be filtered out. Other documents containing that word "cell" with the con‐

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

107

After the above process, we have created a collection of documents relevant to amphibian morphology. These are input to an information extraction (IE) system to mine information

The need for terminological standardization of anatomy is pressing in amphibian morpho‐ logical research [22]. A long-term NSF-sponsored project, AmphibAnat, aims to integrate the amphibian anatomical ontology knowledge base with systematic, biodiversity, embryo‐ logical and genomic resources. However, another important goal of this project is to semiautomatically construct and enrich the amphibian anatomical ontology. An amphibian ontology will facilitate the integration of anatomical data representing all orders of amphib‐

Based on information in a manually constructed seed ontology, we use a focused crawler and data-mining software in order to mine electronic resources for instances of concepts and properties to be added to the existing ontologies [17][19]. We also use concept-words of this ontology to match the corresponding relevant words and senses in WordNet. The current amphibian ontology created by this project consists of 1986 semantic concepts (with the highest depth level is 9) and 570 properties. Figure 2 presents a part of this ontology which

In this section, we introduce our two main vocabulary-extraction approaches, one based on identifying information from an already-existing vocabulary resource (WordNet) and one

is available in two main formats: (i) OWL and (ii) OBO - Open Biomedical Ontology.

ians, thus enhancing knowledge representation of amphibian biology and diversity.

text of amphibian, embryo structure or biological… will be kept.

from documents that can be used to enrich the ontology.

**3.2. Domain and Ontology Application**

**Figure 2.** A part of the amphibian ontology

**4. Extracting New Vocabulary**


**Figure 1.** Architecture of ontology learning framework.

#### *3.1.2. Text mining approach*

The main processes are as following:


a document contains the word "cell" but it is in the context of cellphone, telecommuni‐ cation… will be filtered out. Other documents containing that word "cell" with the con‐ text of amphibian, embryo structure or biological… will be kept.

After the above process, we have created a collection of documents relevant to amphibian morphology. These are input to an information extraction (IE) system to mine information from documents that can be used to enrich the ontology.

#### **3.2. Domain and Ontology Application**

**2.** Build a reference hypernym tree for each word sense as a reference source for semantic

**3.** If a concept-word has multiple senses, identify the correct word sense usinga similarity

**4.** Select the most similar sense and add its synonyms and hypernyms to the correspond‐

**1.** We begin with an existing small, manually-created amphibian morphology ontology [22]. From this, we automatically generate queries for each concept in the hierarchically-

**2.** We submit these queries to a variety of Web search engines and digital libraries. The program downloads the potentially relevant documents listed on the first page (top-

**3.** Next, we apply SVM classification to filter out documents in the search results that match the query well but are less relevant to the domain of our ontology. For example,

similarity disambiguation.

106 Theory and Applications for Advanced Text Mining Text Minning

ing concept in the ontology.

**Figure 1.** Architecture of ontology learning framework.

The main processes are as following:

*3.1.2. Text mining approach*

structured ontology.

ranked 10) results.

computation algorithm on reference hypernym tree.

The need for terminological standardization of anatomy is pressing in amphibian morpho‐ logical research [22]. A long-term NSF-sponsored project, AmphibAnat, aims to integrate the amphibian anatomical ontology knowledge base with systematic, biodiversity, embryo‐ logical and genomic resources. However, another important goal of this project is to semiautomatically construct and enrich the amphibian anatomical ontology. An amphibian ontology will facilitate the integration of anatomical data representing all orders of amphib‐ ians, thus enhancing knowledge representation of amphibian biology and diversity.

**Figure 2.** A part of the amphibian ontology

Based on information in a manually constructed seed ontology, we use a focused crawler and data-mining software in order to mine electronic resources for instances of concepts and properties to be added to the existing ontologies [17][19]. We also use concept-words of this ontology to match the corresponding relevant words and senses in WordNet. The current amphibian ontology created by this project consists of 1986 semantic concepts (with the highest depth level is 9) and 570 properties. Figure 2 presents a part of this ontology which is available in two main formats: (i) OWL and (ii) OBO - Open Biomedical Ontology.

## **4. Extracting New Vocabulary**

In this section, we introduce our two main vocabulary-extraction approaches, one based on identifying information from an already-existing vocabulary resource (WordNet) and one based on extracting vocabulary directly from the domain literature.The advantages of using WordNet is that it contains words and relationships that are, because it was manually con‐ structed, highly accurate.However, the drawbacks are that the vocabulary it contains is broad and thus ambiguous.Also, for specialized domains, appropriate words and/or word senses are likely to be missing. Thus, we contrast this approach with one that works di‐ rectlyon the domain literature.In this case, the appropriate words and word senses are present and ambiguity is less of a factor. However, the relationships between the word senses are implicit in how they are used in the text rather than explicitly represented.

tologies generally represent information about objects, we are restricting the word senses

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

109

Because WordNet organizes nouns into IS-A hypernym hierarchies that provide taxonomic information, it is useful for identifying semantic similarity between words [25]. Our ap‐ proach constructs a reference hypernym tree (or standard hypernym tree) from a few man‐ ually disambiguated words. When a word has multiple senses, we construct the hypernym tree for each sense. Then, we calculate how close each candidate sense's hypernym tree is to this reference source. We argue that the more similar a hypernym tree is to the reference

To build the reference hypernym tree, we consider only the top two levels of concepts of the amphibian ontology since they cover many vocabularies appearing at lower levels. In addi‐ tion, since the concepts are also related using the "IS A" relation, the hypernyms of the con‐ cepts in top two levels of the ontology are also hypernyms of the concepts in the lower levels. The main steps of this process are as following: first, we convert each concept in top

considered to only the noun senses for a candidate word.

tree, the closer the word sense is to the key concepts in the ontology.

*4.1.2. Reference Hypernym Tree*

**Figure 3.** A part of the reference hypernym tree.

#### **4.1. Lexical Expansion Approach**

This approach attempts to identify the correct WordNet sense for each concept-word and then add that sense's synsets and hypernyms as new vocabulary for the associated concept. We base our concept-word sense disambiguation on comparing the various candidate senses to those in a reference source of senses for the domain of the ontology. To provide this standard reference, we manually disambiguate the WordNet senses for the conceptwords in the top two levels of the amphibian ontology. We then create a reference hyper‐ nym tree that contains the WordNethypernyms of these disambiguated WordNet senses. The problem of solving the lexical ambiguity that occurs whenever a given concept-word has several different meanings is now simplified to comparing the hypernym tree for each candidate word sense with the reference hypernym tree. We then compute a tree similarity metric between each candidate hypernym tree and the reference hypernym tree to identify the most similar hypernym tree and, by association, the word sense most closely related to the ontology domain.

#### *4.1.1. WordNet Synonym and Hypernym*

WordNet has become a broad coverage thesaurus that is now widely used in natural lan‐ guage processing and information retrieval [32]. Words in WordNet are organized into syn‐ onym sets, called *synsets*, representing a concept by a set of words with similar meanings. For example, frog, toad frog, and batrachian are all words in the same synset. Hypernyms, or the IS-A relation, is the main relation type in WordNet. In a simple way, we can define *"Y is a hypernym of X if every X is a (kind of) Y"*. All hypernym levels of a word can be structured in a hierarchy in which the meanings are arranged from the most specific at the lower levels up to the most general meaning at the top, for example "amphibian" is a hypernym of "frog", "vertebrate, craniate" is a hypernym of "amphibian", and so on.

For a given word, we can build a hypernym tree including all hypernyms in their hierarchy returned by WordNet. When a word has multiple senses, we get a set of hypernym trees, one per sense. In order to find the hypernyms of a given word in WordNet, we must pro‐ vide the word and the syntactic class in which we are interested. In our approach, since on‐ tologies generally represent information about objects, we are restricting the word senses considered to only the noun senses for a candidate word.

#### *4.1.2. Reference Hypernym Tree*

based on extracting vocabulary directly from the domain literature.The advantages of using WordNet is that it contains words and relationships that are, because it was manually con‐ structed, highly accurate.However, the drawbacks are that the vocabulary it contains is broad and thus ambiguous.Also, for specialized domains, appropriate words and/or word senses are likely to be missing. Thus, we contrast this approach with one that works di‐ rectlyon the domain literature.In this case, the appropriate words and word senses are present and ambiguity is less of a factor. However, the relationships between the word

senses are implicit in how they are used in the text rather than explicitly represented.

This approach attempts to identify the correct WordNet sense for each concept-word and then add that sense's synsets and hypernyms as new vocabulary for the associated concept. We base our concept-word sense disambiguation on comparing the various candidate senses to those in a reference source of senses for the domain of the ontology. To provide this standard reference, we manually disambiguate the WordNet senses for the conceptwords in the top two levels of the amphibian ontology. We then create a reference hyper‐ nym tree that contains the WordNethypernyms of these disambiguated WordNet senses. The problem of solving the lexical ambiguity that occurs whenever a given concept-word has several different meanings is now simplified to comparing the hypernym tree for each candidate word sense with the reference hypernym tree. We then compute a tree similarity metric between each candidate hypernym tree and the reference hypernym tree to identify the most similar hypernym tree and, by association, the word sense most closely related to

WordNet has become a broad coverage thesaurus that is now widely used in natural lan‐ guage processing and information retrieval [32]. Words in WordNet are organized into syn‐ onym sets, called *synsets*, representing a concept by a set of words with similar meanings. For example, frog, toad frog, and batrachian are all words in the same synset. Hypernyms, or the IS-A relation, is the main relation type in WordNet. In a simple way, we can define *"Y is a hypernym of X if every X is a (kind of) Y"*. All hypernym levels of a word can be structured in a hierarchy in which the meanings are arranged from the most specific at the lower levels up to the most general meaning at the top, for example "amphibian" is a hypernym of

For a given word, we can build a hypernym tree including all hypernyms in their hierarchy returned by WordNet. When a word has multiple senses, we get a set of hypernym trees, one per sense. In order to find the hypernyms of a given word in WordNet, we must pro‐ vide the word and the syntactic class in which we are interested. In our approach, since on‐

"frog", "vertebrate, craniate" is a hypernym of "amphibian", and so on.

**4.1. Lexical Expansion Approach**

108 Theory and Applications for Advanced Text Mining Text Minning

the ontology domain.

*4.1.1. WordNet Synonym and Hypernym*

Because WordNet organizes nouns into IS-A hypernym hierarchies that provide taxonomic information, it is useful for identifying semantic similarity between words [25]. Our ap‐ proach constructs a reference hypernym tree (or standard hypernym tree) from a few man‐ ually disambiguated words. When a word has multiple senses, we construct the hypernym tree for each sense. Then, we calculate how close each candidate sense's hypernym tree is to this reference source. We argue that the more similar a hypernym tree is to the reference tree, the closer the word sense is to the key concepts in the ontology.


To build the reference hypernym tree, we consider only the top two levels of concepts of the amphibian ontology since they cover many vocabularies appearing at lower levels. In addi‐ tion, since the concepts are also related using the "IS A" relation, the hypernyms of the con‐ cepts in top two levels of the ontology are also hypernyms of the concepts in the lower levels. The main steps of this process are as following: first, we convert each concept in top two level into concept-words that can be submitted to WordNet, for instance the concept "anatomical\_structure" is divided into two concept-words "anatomical" and "structure". The top two levels of our ontology contain 15 concepts and 19 concept-words. Then, we use WordNet to collect all the hypernyms for each concept-word. We manually choose the hy‐ pernym that best matches the meaning in our domain and then add to the reference hyper‐ nym tree. Figure 3 presents a part of our reference hypernym tree that contains 110 hypernyms covering 12 hierarchical levels. This reference tree will be used as truth to evalu‐ ate the correct sense extracted from WordNet for each experiment word.

#### *4.1.3. Similarity Computation Algorithm*

In this section, we present our tree-based similarity measurement technique that is used to compare hypernym trees corresponding to different senses of a concept-word with the refer‐ ence hypernym tree. Our similarity measure is based on two factors:

**Figure 4.** Similarity computation using the Reference (Standard) HypernymTree.

As discussed in Section 3.2, we use the amphibian morphology ontology developed by the AmphibAnat project [22] for our experiments. We used WordNet (3.0) as the lexical source and the API library of MIT Java WordNet Interface (JWI 2.1.5) to locate the synsets and hy‐

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

111

We begin by processing names for the concepts in the top two levels of the amphibian ontol‐ ogy and manually disambiguating them to create the reference hypernym tree. We then process the names for the 1971 concepts in levels 3 and below. Since many concepts are named with phrases of two or more wordsm and WordNet only contains single words, we need to convert these words into single words, i.e., the concept-words. After removing du‐ plicate concept-words, e.g., concepts "anatomical\_structure" and "multi-tissue\_structure" have a duplicate concept-word, "structure", stopwords (e.g., of, to, for) and numbers, we end up with a set of 877 unique concept-words. However, because the ontology is very do‐ main specific, many of these concept-words do not appear in WordNet at all. Very specific terms in the amphibian domain, e.g., premaxilla, dorsalis, adjectives, e.g., nervous, embry‐ onic, hermaphroditic, and incomplete words added during the ontology creation process, e.g., sp, aa, do not appear in WordNet. Therefore, we report results using only the 308 con‐

We matched these 308 words to WordNet to identify all their corresponding senses, syno‐ nyms and hypernyms. However, not all of the senses for a given word are relevant so we need to disambiguate the semantic similarity of these senses to choose the closest and most correct one for our amphibian domain. We applied each of three similarity computation al‐

*4.1.4. Experiments*

*a. Dataset and Experiments*

pernyms for a given concept-word.

cept-words that were contained in WordNet.


Once the hypernym tree is constructed for each sense, we are able to apply the similarity computation algorithm to calculate these two above factors (i.e. D and CE) for each case. Then, we weight these two factors to get the highest value.

Figure 4 shows an example of how the reference hypernym tree can be used to disambiguate word senses and select the most relevant one. Suppose that the hypernym trees in this ex‐ ample, i.e., HTree1, HTree2 and HTree3, each correspond to one sense of a given conceptword. We compare each hypernym tree to the reference hypernym tree by calculating the matched tree depth and the number of common elements between two trees. Results show that the first tree HTree1 is closest to the standard tree with the depth (D=6) and number of common elements (CE=13) are greater than others.

Ontology Learning Using Word Net Lexical Expansion and Text Mining http://dx.doi.org/10.5772/51141 111

**Figure 4.** Similarity computation using the Reference (Standard) HypernymTree.

#### *4.1.4. Experiments*

two level into concept-words that can be submitted to WordNet, for instance the concept "anatomical\_structure" is divided into two concept-words "anatomical" and "structure". The top two levels of our ontology contain 15 concepts and 19 concept-words. Then, we use WordNet to collect all the hypernyms for each concept-word. We manually choose the hy‐ pernym that best matches the meaning in our domain and then add to the reference hyper‐ nym tree. Figure 3 presents a part of our reference hypernym tree that contains 110 hypernyms covering 12 hierarchical levels. This reference tree will be used as truth to evalu‐

In this section, we present our tree-based similarity measurement technique that is used to compare hypernym trees corresponding to different senses of a concept-word with the refer‐

**•** *Matched Tree Depth (D):* Since all hypernym trees are represented from the most general (top level) to the more specific meanings (lower levels), matches between hypernyms in the lower levels of the two trees should be ranked higher. For example, matching on "frog" is more important than matching on "vertebrate". We calculate the depth (D) as the distance from the most general hypernym (i.e., "entity") until the first hypernym that occurs in both the reference hypernym tree and a candidate hypernym tree. The higher valued matched tree depth indicates that the meaning of the matched word is more spe‐

**•** *Number of common elements (CE):* Normally, if two trees contain many of the same entities, they are judged more similar. Once we determine the depth (D), we count the number of common elements between the reference hypernym tree and a candidate tree. If two can‐ didate trees have the same match depth, then the tree with more elements in common

Once the hypernym tree is constructed for each sense, we are able to apply the similarity computation algorithm to calculate these two above factors (i.e. D and CE) for each case.

Figure 4 shows an example of how the reference hypernym tree can be used to disambiguate word senses and select the most relevant one. Suppose that the hypernym trees in this ex‐ ample, i.e., HTree1, HTree2 and HTree3, each correspond to one sense of a given conceptword. We compare each hypernym tree to the reference hypernym tree by calculating the matched tree depth and the number of common elements between two trees. Results show that the first tree HTree1 is closest to the standard tree with the depth (D=6) and number of

with the reference hypernym tree would be considered the more similar one.

ate the correct sense extracted from WordNet for each experiment word.

ence hypernym tree. Our similarity measure is based on two factors:

cific, thus it would be more relevant to the domain.

Then, we weight these two factors to get the highest value.

common elements (CE=13) are greater than others.

*4.1.3. Similarity Computation Algorithm*

110 Theory and Applications for Advanced Text Mining Text Minning

#### *a. Dataset and Experiments*

As discussed in Section 3.2, we use the amphibian morphology ontology developed by the AmphibAnat project [22] for our experiments. We used WordNet (3.0) as the lexical source and the API library of MIT Java WordNet Interface (JWI 2.1.5) to locate the synsets and hy‐ pernyms for a given concept-word.

We begin by processing names for the concepts in the top two levels of the amphibian ontol‐ ogy and manually disambiguating them to create the reference hypernym tree. We then process the names for the 1971 concepts in levels 3 and below. Since many concepts are named with phrases of two or more wordsm and WordNet only contains single words, we need to convert these words into single words, i.e., the concept-words. After removing du‐ plicate concept-words, e.g., concepts "anatomical\_structure" and "multi-tissue\_structure" have a duplicate concept-word, "structure", stopwords (e.g., of, to, for) and numbers, we end up with a set of 877 unique concept-words. However, because the ontology is very do‐ main specific, many of these concept-words do not appear in WordNet at all. Very specific terms in the amphibian domain, e.g., premaxilla, dorsalis, adjectives, e.g., nervous, embry‐ onic, hermaphroditic, and incomplete words added during the ontology creation process, e.g., sp, aa, do not appear in WordNet. Therefore, we report results using only the 308 con‐ cept-words that were contained in WordNet.

We matched these 308 words to WordNet to identify all their corresponding senses, syno‐ nyms and hypernyms. However, not all of the senses for a given word are relevant so we need to disambiguate the semantic similarity of these senses to choose the closest and most correct one for our amphibian domain. We applied each of three similarity computation al‐ gorithms, to determine the closest sense for each word and evaluated them: 1) based on the number of matched tree depth (#Depth); 2) based on the number of common elements ( #CE); and 3) based on the sum of both factors (#Depth+CE). For each method, we got a list of word senses selected by the algorithm and compared these senses to the truth-list provid‐ ed by our human expert to evaluate the effectiveness of our algorithm.

relevant to the amphibian domain. Since each of above 252 returned words may have some correct senses judged by the human expert, we got finally 285 correct senses.In order to see how well each algorithm performed, we implemented different thresholds to see how re‐ sults are varied as we increased our selectivity based on level of match and/or number of common elements matched. We varied the depth from 1 to 10, the common elements from 2 to 20, and the sum of the two from 3 to 30. For example with the #Depth+CE method, the range of threshold t is from 3 to 30; the value t=15 means that we take only senses having the number #Depth+CE is greater than 15. Figures 5, 6 and 7 show the F-measures achieved by three methods of #Depth, #CE and #Depth+CE respectively using various threshold values.

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

113

**Figure 5.** F-measure of the #Depth only method (β=0.25).

**Figure 6.** F-measure of the #CE only method (β=0.25).

#### *b. Evaluation Measures*

In order to evaluate the results of our both approaches, i.e., Lexical Expansion and Text Min‐ ing, we needed to know the correct word sense (if any) for each word. An expert in the am‐ phibian domain helped us by independently judging the words and senses identified by each technique and identifying which were correct. These judgments formed the truth lists against which the words extracted by text mining and the senses chosen by the lexical ex‐ pansion were compared.

Information extraction effectiveness is measured in terms of the classic Information Retriev‐ al metrics of Precision, Recall and F-measure. We used these measures for both approaches, but with some different definitions of measure for each approach. In the lexical expansion approach, since each word in WordNet may have different senses and we have to find the correct one, we define:

**•** *Precision (P):* measures the percentage of the words having correct sense identified by our algorithm that matched the sense judged correct by the human expert.

$$P = \frac{\text{\textquotedblleft\textquotedblright}\\_words\\_layers\\_correct\\_sense\\_identified}{\text{\textquotedblleft\textquotedblright}\\_words\\_terminated} \tag{1}$$

• *Recall (R):* measures the percentage of the correct words identified by our algorithm that matched those from the truth-list words.

$$R = \frac{\text{\textquotedblleft\textquotedblright}\\_voured\\_correct\\_sense\\_identified}{\text{\textquotedblleft\textquotedblright}\\_truth\\_viscosity} \tag{2}$$

We use F-measure which is calculated as following for both approaches.

$$F\_{\beta} = \frac{\left(1 + \beta^{\,2}\right) \* P \* R}{\beta^{\,2 \*} P + R} \tag{3}$$

Because we want to enhance the ontology with only truly relevant words, we want a metric that is biased towards high precision versus high recall. We chose to use the F-measure with a β value that weights precision four times higher than recall.

#### *4.1.5. Result Evaluation*

Among 308 word returned from WordNet, human expert judged that 252 of the extracted words were correct. Of the 56 incorrect words returned, there were no senses in WordNet relevant to the amphibian domain. Since each of above 252 returned words may have some correct senses judged by the human expert, we got finally 285 correct senses.In order to see how well each algorithm performed, we implemented different thresholds to see how re‐ sults are varied as we increased our selectivity based on level of match and/or number of common elements matched. We varied the depth from 1 to 10, the common elements from 2 to 20, and the sum of the two from 3 to 30. For example with the #Depth+CE method, the range of threshold t is from 3 to 30; the value t=15 means that we take only senses having the number #Depth+CE is greater than 15. Figures 5, 6 and 7 show the F-measures achieved by three methods of #Depth, #CE and #Depth+CE respectively using various threshold values.

**Figure 5.** F-measure of the #Depth only method (β=0.25).

gorithms, to determine the closest sense for each word and evaluated them: 1) based on the number of matched tree depth (#Depth); 2) based on the number of common elements ( #CE); and 3) based on the sum of both factors (#Depth+CE). For each method, we got a list of word senses selected by the algorithm and compared these senses to the truth-list provid‐

In order to evaluate the results of our both approaches, i.e., Lexical Expansion and Text Min‐ ing, we needed to know the correct word sense (if any) for each word. An expert in the am‐ phibian domain helped us by independently judging the words and senses identified by each technique and identifying which were correct. These judgments formed the truth lists against which the words extracted by text mining and the senses chosen by the lexical ex‐

Information extraction effectiveness is measured in terms of the classic Information Retriev‐ al metrics of Precision, Recall and F-measure. We used these measures for both approaches, but with some different definitions of measure for each approach. In the lexical expansion approach, since each word in WordNet may have different senses and we have to find the

**•** *Precision (P):* measures the percentage of the words having correct sense identified by our

• *Recall (R):* measures the percentage of the correct words identified by our algorithm that
