\_*truth* <sup>−</sup>*list* \_*words* (2)

*<sup>β</sup>* <sup>2</sup> \* *<sup>P</sup>* <sup>+</sup> *<sup>R</sup>* (3)

*<sup>P</sup>* <sup>=</sup> # \_*words* \_*having* \_*correct* \_*sense* \_*identified*

*<sup>R</sup>* <sup>=</sup> # \_*words* \_*having* \_*correct* \_*sense* \_*identified*

(1 + *β* 2)\* *P* \* *R*

Because we want to enhance the ontology with only truly relevant words, we want a metric that is biased towards high precision versus high recall. We chose to use the F-measure with

Among 308 word returned from WordNet, human expert judged that 252 of the extracted words were correct. Of the 56 incorrect words returned, there were no senses in WordNet

We use F-measure which is calculated as following for both approaches.

*F<sup>β</sup>* =

a β value that weights precision four times higher than recall.

algorithm that matched the sense judged correct by the human expert.

ed by our human expert to evaluate the effectiveness of our algorithm.

*b. Evaluation Measures*

112 Theory and Applications for Advanced Text Mining Text Minning

pansion were compared.

correct one, we define:

*4.1.5. Result Evaluation*

matched those from the truth-list words.

**Figure 6.** F-measure of the #CE only method (β=0.25).

ian ontology and new mined vocabularies will be used to enrich the corresponding con‐

1st sense 5 7 12 2nd sense 4 7 11 3rd sense 10 7 17 **4th sense 15 7 22**

In order to collect a corpus of documents from which ontological enrichments can be mined, we use the seed ontology as input to our focused search. For each concept in a selected sub‐ set of ontology, we generate a query that is then submitted to two main sources, i.e., search engines and digital libraries. To aid in query generation strategies, we created an interactive system that enables us to create queries from existing concepts in the ontology and allows us to change parameters such as the Website address, the number of returned results, the for‐

We next automatically submit the ontology-generated queries to multiple search engines and digital libraries related to the domain (e.g., Google, Yahoo, Google Scholar, http:// www.amphibanat.org). For each query, we process the top 10 results from each search site

Although documents are retrieved selectively through restricted queries and by focused search, the results can contain some documents that are less relevant or not relevant at all. Therefore, we still need a mechanism to evaluate and verify the relevance of these docu‐ ments to the predefined domain of the ontology. To remove unexpected documents, first we automatically remove those that are blank or too short, are duplicated documents, or those that are in a format that is not suitable for text processing. We then use LIBSVMclassification tool [5] to separate the remaining documents into two main categories: (i) relevant and (ii) non-relevant to the domain of ontology. We have also varied different parameters of LIBSVM such as kernel type (-t), degree (-d), weight (-wi)… in order to select the best pa‐ rameters for our classification. Only those documents that are deemed truly relevant are in‐

The SVM classification algorithm must first be trained, based on labeled examples, so that it can accurately predict unknown data (i.e., testing data). The training phase consists of find‐

using an HTML parser to extract the hyperlinks for collecting documents.

**Table 2.** Example of semantic similarity computation for the concept-word "cavity".

**# Depth #CE #Depth+CE**

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

115

cepts of the ontology.

**4.2. Text Mining Approach**

mat of returned documents, etc.

*4.2.2. Classifying and Filtering Documents*

put to the pattern extraction process.

*4.2.1. Searching and Collecting Documents*

**Figure 7.** F-measure of the #Depth+CE method (β=0.25).

We found that we got poor results when using depth only, i.e., a maximum F-measure of 0.42 with a threshold of 1, but the common elements approach performed better with a max‐ imum F-measure of 0.67 with a threshold of 10. However, the best result was achieved when we used the sum of depth and common element using a threshold t=15. This produced a precision of 74% and recall of 50% and an F-measure of 0.72. In this case, 155 words were found by the algorithm of which 115 were correct. Table 1 presents the number of unique synonym and hypernym words we could add to the ontology from WordNet based on our disambiguated senses.


**Table 1.** Number of retrieved synonyms and hypernyms words.

To better understand this approach, we present an example based on the concept-word "cavity." WordNet contains 4 different senses for this word: 1) pit, cavity, 2) cavity, en‐ closed space; 3) cavity, caries, dental caries, tooth decay; and 4) cavity, bodily cavity, cav‐ um. Based on the total value of #Depth+CE, our algorithm correctly selects the fourth sense as the most relevant for the amphibian ontology (c.f., Table 2). Based on this sense, we then consider the synonyms *[cavity, bodily\_cavity, cavum]* and hypernyms *[structure, anatomical\_struc‐ ture, complex\_body\_part, bodily\_structure, body\_structure]* as words to enrich the vocabulary for the concept "cavity". This process is applied for all concept-words in the seed amphib‐


ian ontology and new mined vocabularies will be used to enrich the corresponding con‐ cepts of the ontology.

**Table 2.** Example of semantic similarity computation for the concept-word "cavity".

#### **4.2. Text Mining Approach**

**Figure 7.** F-measure of the #Depth+CE method (β=0.25).

114 Theory and Applications for Advanced Text Mining Text Minning

**Table 1.** Number of retrieved synonyms and hypernyms words.

disambiguated senses.

We found that we got poor results when using depth only, i.e., a maximum F-measure of 0.42 with a threshold of 1, but the common elements approach performed better with a max‐ imum F-measure of 0.67 with a threshold of 10. However, the best result was achieved when we used the sum of depth and common element using a threshold t=15. This produced a precision of 74% and recall of 50% and an F-measure of 0.72. In this case, 155 words were found by the algorithm of which 115 were correct. Table 1 presents the number of unique synonym and hypernym words we could add to the ontology from WordNet based on our

**Threshold 3 6 9 12 15 18 21 24 27 30** #words returned 308 291 271 225 155 98 62 17 8 4 # words correct 190 178 176 149 115 70 38 10 1 1 #SYN 417 352 346 308 231 140 78 26 1 1 #HYN 164 156 151 123 80 48 42 12 2 2

To better understand this approach, we present an example based on the concept-word "cavity." WordNet contains 4 different senses for this word: 1) pit, cavity, 2) cavity, en‐ closed space; 3) cavity, caries, dental caries, tooth decay; and 4) cavity, bodily cavity, cav‐ um. Based on the total value of #Depth+CE, our algorithm correctly selects the fourth sense as the most relevant for the amphibian ontology (c.f., Table 2). Based on this sense, we then consider the synonyms *[cavity, bodily\_cavity, cavum]* and hypernyms *[structure, anatomical\_struc‐ ture, complex\_body\_part, bodily\_structure, body\_structure]* as words to enrich the vocabulary for the concept "cavity". This process is applied for all concept-words in the seed amphib‐

#### *4.2.1. Searching and Collecting Documents*

In order to collect a corpus of documents from which ontological enrichments can be mined, we use the seed ontology as input to our focused search. For each concept in a selected sub‐ set of ontology, we generate a query that is then submitted to two main sources, i.e., search engines and digital libraries. To aid in query generation strategies, we created an interactive system that enables us to create queries from existing concepts in the ontology and allows us to change parameters such as the Website address, the number of returned results, the for‐ mat of returned documents, etc.

We next automatically submit the ontology-generated queries to multiple search engines and digital libraries related to the domain (e.g., Google, Yahoo, Google Scholar, http:// www.amphibanat.org). For each query, we process the top 10 results from each search site using an HTML parser to extract the hyperlinks for collecting documents.

#### *4.2.2. Classifying and Filtering Documents*

Although documents are retrieved selectively through restricted queries and by focused search, the results can contain some documents that are less relevant or not relevant at all. Therefore, we still need a mechanism to evaluate and verify the relevance of these docu‐ ments to the predefined domain of the ontology. To remove unexpected documents, first we automatically remove those that are blank or too short, are duplicated documents, or those that are in a format that is not suitable for text processing. We then use LIBSVMclassification tool [5] to separate the remaining documents into two main categories: (i) relevant and (ii) non-relevant to the domain of ontology. We have also varied different parameters of LIBSVM such as kernel type (-t), degree (-d), weight (-wi)… in order to select the best pa‐ rameters for our classification. Only those documents that are deemed truly relevant are in‐ put to the pattern extraction process.

The SVM classification algorithm must first be trained, based on labeled examples, so that it can accurately predict unknown data (i.e., testing data). The training phase consists of find‐ ing a hyper plane that separates the elements belonging to two different classes. Our topic focused search combining with the SVM classification as described in [17] is 77.5% accurate, in order to evaluate our text mining approach in the absence of noise, we report in this pa‐ per our results based on the labeled relevant training examples. In future, the topic focused crawler will be used to feed directly into the text mining process.

#### *4.2.3. Information Extraction using Text Mining*

After the topic-specific searching produces a set of documents related to the amphibian mor‐ phology domain, this phase information mines important vocabulary from the text of the documents. Specifically, our goal is to extract a set of words that are most related to the do‐ main ontology concept-words. We have implemented and evaluated a vector space ap‐ proach using two methods to calculate the *tf\*idf* weights. Since most ontology conceptwords are nouns, we also explored the effect of restricting our extracted words to nouns only. The weight calculation methods compared were the *document-based selection* and *cor‐ pus-based selection*. In both approaches, in order to give high weights to words important to the domain, we pre-calculated the *idf* (inverse document frequency) from a collection of 10,000 documents that were randomly downloaded from a broad selection of categories in the ODP1 collection.

#### *a. Document-based selection (L1)*

This method, *L1*, first calculates weights of words in each document using *tf\*idf* as follows:

$$\mathcal{W}(i,\ j) = r \, t \, f\_{(i,j)} \ast \, \mathrm{id} \, f\_i \tag{4}$$

$$\operatorname{tr}\,t\,f\_{\{i,j\}} = \frac{\operatorname{tr}f\_{\{i,j\}}}{N\,(j)}\tag{5}$$

*|{d:t*<sup>i</sup>

perform best.

where

*4.2.4. Experiments*

*a. Dataset and Experiments*

tology built by domain expert.

*b. Corpus-based selection (L2)*

L1. The formula is thus:

W(i) is the weight of term i in the corpus

*c. Part-of-speech restriction (L1N, L2N)*

The other factors are calculated as in L1 method.

[25] using JWI(the MIT Java WordNet Interface).

d}| is number of documents in which ti

appears.

A word list sorted by weights is generated for each document from which the top k words are selected. These word lists are then merged in sorted order these words to only one list ranked by their weight. We performed some preliminary experiments, not reported here, which varied k from 1 to 110. The results reported here use k = 30, a value that was found to

This method, *L2*, calculates weights of words by using *sum(tf)\*idf*. Thus, the collection based frequency is used to identify a single word list rather than selecting a word list for each document based on the within-document frequency and then merging, as is done in method

For each of the previous approaches, we implemented a version that removed all non-nouns from the final word list. We call the word lists, *L1N* and *L2N*, corresponding to the subset of words in lists *L1* and *L2* respectively that are tagged as nouns using the WordNet library

The current amphibian ontology is large and our goal is to develop techniques that can min‐ imize manual effort by growing the ontology from a small, seed ontology. Thus, rather than using the whole ontology as input to the system, for our experiments we used a subset of only the five top-level concepts from the ontology whose meaning broadly covered the am‐ phibian domain. Ultimately, we hope to compare the larger ontology we build to the full on‐

We found that when we expand the query containing the concept name with keywords de‐ scribing the ontology domain overall, we get a larger number of relevant results. Based on these explorations, we created an automated module that, given a concept in the ontology, currently generates 3 queries with the expansion added, e.g., "amphibian" "morphology" "pdf". From each of the five concepts, we generate three queries, for a total of 15 automati‐ cally generated queries. Each query is then submitted to each of the four search sites from

*rt f* (*i*, *<sup>j</sup>*)∗*id f <sup>i</sup>* (7)

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

117

*W* (*i*)=∑ *j*=1 *n*

$$id \, f\_i = \log \frac{\|\, \, \_D\|\, \_D\|}{\|\, \_D\| \, \_i \in \, \_d\|\, \_i} \tag{6}$$

where

*W(i,j)* is the weight of term *i* in document *j*

*rtf(*i,j*)* is the relative term frequency of term *i* in document *j*

*idf*<sup>i</sup> is the inverse document frequency of term *i*, which is pre-calculated across 10,000 ODP documents

*tf(*i,j*)* is the term frequency of term *i* in document *j*

*N(j)* means the number of words in document *j*

*|D|* is the total number of documents in the corpus

<sup>1</sup> Open Directory Project http://www.dmoz.org/

*|{d:t*<sup>i</sup> d}| is number of documents in which ti appears.

A word list sorted by weights is generated for each document from which the top k words are selected. These word lists are then merged in sorted order these words to only one list ranked by their weight. We performed some preliminary experiments, not reported here, which varied k from 1 to 110. The results reported here use k = 30, a value that was found to perform best.

#### *b. Corpus-based selection (L2)*

This method, *L2*, calculates weights of words by using *sum(tf)\*idf*. Thus, the collection based frequency is used to identify a single word list rather than selecting a word list for each document based on the within-document frequency and then merging, as is done in method L1. The formula is thus:

$$\mathcal{W}(i) = \sum\_{j=1}^{n} r t \, f\_{\{i,j\}} \star i \, d \, f\_i \tag{7}$$

where

ing a hyper plane that separates the elements belonging to two different classes. Our topic focused search combining with the SVM classification as described in [17] is 77.5% accurate, in order to evaluate our text mining approach in the absence of noise, we report in this pa‐ per our results based on the labeled relevant training examples. In future, the topic focused

After the topic-specific searching produces a set of documents related to the amphibian mor‐ phology domain, this phase information mines important vocabulary from the text of the documents. Specifically, our goal is to extract a set of words that are most related to the do‐ main ontology concept-words. We have implemented and evaluated a vector space ap‐ proach using two methods to calculate the *tf\*idf* weights. Since most ontology conceptwords are nouns, we also explored the effect of restricting our extracted words to nouns only. The weight calculation methods compared were the *document-based selection* and *cor‐ pus-based selection*. In both approaches, in order to give high weights to words important to the domain, we pre-calculated the *idf* (inverse document frequency) from a collection of 10,000 documents that were randomly downloaded from a broad selection of categories in

This method, *L1*, first calculates weights of words in each document using *tf\*idf* as follows:

*rt <sup>f</sup>* (*i*, *<sup>j</sup>*)<sup>=</sup> *<sup>t</sup> <sup>f</sup>* (*i*, *<sup>j</sup>*)


is the inverse document frequency of term *i*, which is pre-calculated across 10,000 ODP

*id f <sup>i</sup>* =log

*W* (*i*, *j*)=*rt f* (*i*, *<sup>j</sup>*)∗*id f <sup>i</sup>* (4)

*<sup>N</sup>* ( *<sup>j</sup>*) (5)

<sup>|</sup> {*<sup>d</sup>* : *ti* <sup>∈</sup> *<sup>d</sup>* } <sup>|</sup> (6)

crawler will be used to feed directly into the text mining process.

*4.2.3. Information Extraction using Text Mining*

116 Theory and Applications for Advanced Text Mining Text Minning

the ODP1

where

*idf*<sup>i</sup>

documents

collection.

*a. Document-based selection (L1)*

*W(i,j)* is the weight of term *i* in document *j*

*tf(*i,j*)* is the term frequency of term *i* in document *j*

*|D|* is the total number of documents in the corpus

*N(j)* means the number of words in document *j*

1 Open Directory Project http://www.dmoz.org/

*rtf(*i,j*)* is the relative term frequency of term *i* in document *j*

W(i) is the weight of term i in the corpus

The other factors are calculated as in L1 method.

#### *c. Part-of-speech restriction (L1N, L2N)*

For each of the previous approaches, we implemented a version that removed all non-nouns from the final word list. We call the word lists, *L1N* and *L2N*, corresponding to the subset of words in lists *L1* and *L2* respectively that are tagged as nouns using the WordNet library [25] using JWI(the MIT Java WordNet Interface).

#### *4.2.4. Experiments*

## *a. Dataset and Experiments*

The current amphibian ontology is large and our goal is to develop techniques that can min‐ imize manual effort by growing the ontology from a small, seed ontology. Thus, rather than using the whole ontology as input to the system, for our experiments we used a subset of only the five top-level concepts from the ontology whose meaning broadly covered the am‐ phibian domain. Ultimately, we hope to compare the larger ontology we build to the full on‐ tology built by domain expert.

We found that when we expand the query containing the concept name with keywords de‐ scribing the ontology domain overall, we get a larger number of relevant results. Based on these explorations, we created an automated module that, given a concept in the ontology, currently generates 3 queries with the expansion added, e.g., "amphibian" "morphology" "pdf". From each of the five concepts, we generate three queries, for a total of 15 automati‐ cally generated queries. Each query is then submitted to each of the four search sites from which the top ten results are requested. This results in a maximum of 600 documents to process. However, because some search sites return fewer than ten results for some queries, we perform syntactic filtering, and duplicate documents are returned by search engines, in practice this number was somewhat smaller.

*<sup>R</sup>* <sup>=</sup> # \_*correct* \_*words* \_*identified*

methods' performances.

ieved by these tests using various threshold values.

**Figure 8.** F-measure of the tests in Group\_A (β=0.25).

words were added to the ontology of which 147 were correct.

*4.2.5. Results*

Based on these two measures, we also calculate the F-measure as the equation (3) to evaluate

We evaluated our results by comparing the candidate word lists that were extracted from the relevant documents using our algorithms with the judgments submitted by our human domain expert. Since not all words on the word lists are likely to be relevant, we varied how many of the top weighted words were used. We chose threshold values t from 0.1 to 1.0 cor‐ responding to the percentage of top candidate words that are extracted (e.g., t=0.1 means that top 10% words are selected). We carried out 6 different tests corresponding to the four candidate lists, i.e., *L1*, *L2*, *L1N*, *L2N* and two more cases *L1+L2* (average of *L1* and *L2*) and *L1N+L2N* (average of *L1N* and *L2N*) as input to our algorithm. These tests are named by their list names *L1*, *L2*, *L1+L2*, *L1N*, *L2N* and *L1N+L2N*. Figure 8 presents the F-measures ach‐

The best result was achieved in the test L1N, using the highest weighted nouns extracted from individual documents. By analyzing results, we find that the best performance is ach‐ ieved with a threshold t=0.6, i.e., the top 60% of the words (277 words total) in the candidate list are used. This threshold produced precision of 88% and recall of 58% meaning that 167
