\_*candidate* \_*words* (8)

*<sup>P</sup>* <sup>=</sup> # \_*correct* \_*words* \_*identified*

the 6 techniques, i.e., *L1*, *L2*, *L1+L2*, *L1N*, *L2N* and *L1N+L2N*.

122 were judged irrelevant to the amphibian morphology domain

practice this number was somewhat smaller.

118 Theory and Applications for Advanced Text Mining Text Minning

using *Group\_B*.

*b. Evaluation Measures*

In the text mining approach, we define:

that matched those from the candidate words.

matched those from the truth-list words.

We evaluated our results by comparing the candidate word lists that were extracted from the relevant documents using our algorithms with the judgments submitted by our human domain expert. Since not all words on the word lists are likely to be relevant, we varied how many of the top weighted words were used. We chose threshold values t from 0.1 to 1.0 cor‐ responding to the percentage of top candidate words that are extracted (e.g., t=0.1 means that top 10% words are selected). We carried out 6 different tests corresponding to the four candidate lists, i.e., *L1*, *L2*, *L1N*, *L2N* and two more cases *L1+L2* (average of *L1* and *L2*) and *L1N+L2N* (average of *L1N* and *L2N*) as input to our algorithm. These tests are named by their list names *L1*, *L2*, *L1+L2*, *L1N*, *L2N* and *L1N+L2N*. Figure 8 presents the F-measures ach‐ ieved by these tests using various threshold values.

**Figure 8.** F-measure of the tests in Group\_A (β=0.25).

The best result was achieved in the test L1N, using the highest weighted nouns extracted from individual documents. By analyzing results, we find that the best performance is ach‐ ieved with a threshold t=0.6, i.e., the top 60% of the words (277 words total) in the candidate list are used. This threshold produced precision of 88% and recall of 58% meaning that 167 words were added to the ontology of which 147 were correct.

To confirm our results, we validated the best performing algorithm, *L1N* with a threshold of 0.6, using the 30 previously unused relevant documents in *Group\_B.* We applied the docu‐ ment-based selection algorithm using nouns only with a threshold value 0.6. In this case, the achieved results are P = 77%, R = 58% and F-measure = 0.7. This shows that, although preci‐ sion is a bit lower, overall the results are reproducible on a different document collection. In this case 183 words were added to the ontology of which 141 were correct.

the top-weighted candidates extracted from the documents, the precision is higher but the recall decreases. In the best case, where the F-measure is maximized, the precision is 88% on the test collection. Our algorithm was also validated with another dataset (i.e. documents in *Group\_B*), the precision in this case decreases to 77% which is still acceptable and does not

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

121

The results in the lexical analysis approach also show that our similarity computation meth‐ od can provide an effective way to identify the correct sense for words in a given ontology. These senses then provide a source of new vocabulary that can be used to enrich the ontolo‐ gy. Our algorithm was best with the precision of 74% in the method of #Depth+CE. If we consider words with fewer senses, the accuracy of detection of correct words is higher. This level of accuracy was achieved using a reference hypernym tree with 110 hypernyms, built from only 14 manually disambiguated concept-words from the top two levels of our amphib‐ ian morphology ontology. From these words, we are able to identify a large collection of synonyms and hypernyms that are a good potential source for the ontology enrichment process.

In previous sections, we have presented our main approaches to mine new relevant vocabu‐ laries to enrich a domain ontology from two main sources: (i) the WordNet database; and (ii) domain-relevant documents collected from the Internet. In this section we describe how we enrich the domain ontology by adding these newly-mined words to the appropriate con‐ cepts. Our main task in this phase is finding a method to add correctly a new candidate

As presented in the section 4.1, our lexical expansion approach has identifies the correct WordNet sense for each concept-word and then considering that sense's synsets and hyper‐ nyms as new vocabulary for the associated concept.In the previous phase, we used the hy‐ pernyms trees to identify the correct sense for each concept-word. For each correct sense, we add the synonym words (synsets) of that sense to the ontology concept vocabulary. Because we know which concept the synset is associated with, an advantage of this approach is that

From the 285 correct senses mined in the first phase, our lexical expansion mineda total of 352 new synonym words over 155 concept words that were returned. Many synonym words were duplicates, so we refined the candidate vocabulary by removing redundant words,

To evaluate the accuracy of this approach, we presented each of these 155 word-concept pairings to a human domain expert and asked them to validate whether or not the pairing is accurate. In their judgment, 115 words were relevant to the amphibian morphology domain, and their 260 synonyms words were considered. We then refined these synonym words by

affect significantly to the number and quality of relevant words extracted.

**5. Ontology Enrichment**

**5.1. Lexical Expansion Approach**

word to the best appropriate ontology concept vocabulary.

it is trivial to attach the new vocabulary to the correct concept.

leaving 321 new words to add to the ontology.


**Table 3.** Number of words can be added.

Table 3 reports in more detail on the number of candidate words and how many correct words can be added to the ontology through the text mining process with the document-based selection and restricting our words to nouns only, i.e. the L1N test with threshold 0.6 on the valida‐ tion documents, *Group\_B*. We also observe that the top words extracted using this technique are very relevant to the domain of amphibian ontology, for example, the top 10 words are: frog, amphibian, yolk, medline, muscle, embryo, abstract, pallium, nerve, membrane.

#### **4.3. Discussions**

The experimental results show that the ontology enrichment process can be used with new relevant vocabularies extracted from both of our approaches of text mining and lexical expan‐ sion. Overall, the text mining approach achieved a better result than the WordNet approach. Table 4 shows the performance comparison of these two approaches in their best case.


**Table 4.** Comparison of two approaches in the best case.

In the text mining approach, we got the best results using a vector space approach with the document-based selection and restricting our words to nouns only. Overall, our algorithm produced good accuracy, over than 81% for all cases. If we restrict our candidates to only the top-weighted candidates extracted from the documents, the precision is higher but the recall decreases. In the best case, where the F-measure is maximized, the precision is 88% on the test collection. Our algorithm was also validated with another dataset (i.e. documents in *Group\_B*), the precision in this case decreases to 77% which is still acceptable and does not affect significantly to the number and quality of relevant words extracted.

The results in the lexical analysis approach also show that our similarity computation meth‐ od can provide an effective way to identify the correct sense for words in a given ontology. These senses then provide a source of new vocabulary that can be used to enrich the ontolo‐ gy. Our algorithm was best with the precision of 74% in the method of #Depth+CE. If we consider words with fewer senses, the accuracy of detection of correct words is higher. This level of accuracy was achieved using a reference hypernym tree with 110 hypernyms, built from only 14 manually disambiguated concept-words from the top two levels of our amphib‐ ian morphology ontology. From these words, we are able to identify a large collection of synonyms and hypernyms that are a good potential source for the ontology enrichment process.

## **5. Ontology Enrichment**

To confirm our results, we validated the best performing algorithm, *L1N* with a threshold of 0.6, using the 30 previously unused relevant documents in *Group\_B.* We applied the docu‐ ment-based selection algorithm using nouns only with a threshold value 0.6. In this case, the achieved results are P = 77%, R = 58% and F-measure = 0.7. This shows that, although preci‐ sion is a bit lower, overall the results are reproducible on a different document collection. In

**Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0**

**#candidatewords** 28 55 83 110 139 167 194 222 249 277 **# words added** 22 50 77 101 124 147 162 188 206 225

Table 3 reports in more detail on the number of candidate words and how many correct words can be added to the ontology through the text mining process with the document-based selection and restricting our words to nouns only, i.e. the L1N test with threshold 0.6 on the valida‐ tion documents, *Group\_B*. We also observe that the top words extracted using this technique are very relevant to the domain of amphibian ontology, for example, the top 10 words are:

The experimental results show that the ontology enrichment process can be used with new relevant vocabularies extracted from both of our approaches of text mining and lexical expan‐ sion. Overall, the text mining approach achieved a better result than the WordNet approach. Table 4 shows the performance comparison of these two approaches in their best case.

**Approach**

**Text Mining Approach**

frog, amphibian, yolk, medline, muscle, embryo, abstract, pallium, nerve, membrane.

**Measures Lexical Expansion**

Precision 0.74 0.88 Recall 0.50 0.58 F-Measure 0.72 0.85 #Candidate Words 155 167 #Words mined correctly 115 147 #Words mined incorrectly 40 20

In the text mining approach, we got the best results using a vector space approach with the document-based selection and restricting our words to nouns only. Overall, our algorithm produced good accuracy, over than 81% for all cases. If we restrict our candidates to only

this case 183 words were added to the ontology of which 141 were correct.

**Table 3.** Number of words can be added.

120 Theory and Applications for Advanced Text Mining Text Minning

**Table 4.** Comparison of two approaches in the best case.

**4.3. Discussions**

In previous sections, we have presented our main approaches to mine new relevant vocabu‐ laries to enrich a domain ontology from two main sources: (i) the WordNet database; and (ii) domain-relevant documents collected from the Internet. In this section we describe how we enrich the domain ontology by adding these newly-mined words to the appropriate con‐ cepts. Our main task in this phase is finding a method to add correctly a new candidate word to the best appropriate ontology concept vocabulary.

## **5.1. Lexical Expansion Approach**

As presented in the section 4.1, our lexical expansion approach has identifies the correct WordNet sense for each concept-word and then considering that sense's synsets and hyper‐ nyms as new vocabulary for the associated concept.In the previous phase, we used the hy‐ pernyms trees to identify the correct sense for each concept-word. For each correct sense, we add the synonym words (synsets) of that sense to the ontology concept vocabulary. Because we know which concept the synset is associated with, an advantage of this approach is that it is trivial to attach the new vocabulary to the correct concept.

From the 285 correct senses mined in the first phase, our lexical expansion mineda total of 352 new synonym words over 155 concept words that were returned. Many synonym words were duplicates, so we refined the candidate vocabulary by removing redundant words, leaving 321 new words to add to the ontology.

To evaluate the accuracy of this approach, we presented each of these 155 word-concept pairings to a human domain expert and asked them to validate whether or not the pairing is accurate. In their judgment, 115 words were relevant to the amphibian morphology domain, and their 260 synonyms words were considered. We then refined these synonym words by removing duplicates. Thus, we ultimately added 231 words to the appropriate concepts, al‐ most entirely automatically. The precision of adding correct synonym words to the ontology was 71.9%. The only manual step in this process is the filtering out of the incorrect wordsenses extracted from WordNet.

created windows of plus/minus N words around each occurrence. Overlapping windows are then combined to create non-overlapping text chunks. Because varying numbers of over‐ lapping windows may be combined to create the text chunks, the text chunks vary in size and in the number of word occurrences they contain. In order to illustrate the text chunks selection process, consider a single document representing the concept name "neural\_canal"

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

123

In one part of the document under consideration with the window size of +-10, there are three occurrences of both terms "neural" and "canal" and thus three windows of size 21 with a word in the middle. We can see that Windows 1, 2, and 3 all contain keywords and

*morphy pronounced reduction process boulengerella lateristriga maculata hypothesized synapomor‐ phic species neural synapomorphy laterosensory canal system body majority characiforms completely*

*process boulengerella lateristriga maculata hypothesized synapomorphic species neural synapomor‐ phy laterosensory canal system body majority characiforms completely developed posterior neural lat‐*

*synapomorphy laterosensory canal system body majority characiforms completely developed posterior*

In this step, we combine all overlapping windows to create the text chunks as following.

*morphy pronounced reduction process boulengerella lateristriga maculata hypothesized synapomor‐ phic species neural synapomorphy laterosensory canal system body majority characiforms completely developed posterior neural laterosensory canal system body nearly lateral line scales minimally pore*

In this step, we simply combine the text chunks extracted for a given context from a docu‐

After having extracted text chunks surrounding to keyword instances (i.e., concepts or syno‐ nyms), we have two sets of contexts: (1) **S** representing candidate synonyms; and (2) **C** rep‐ resenting concepts. In this step, we transform the contexts to vectors representing the keywords. We adopted *tf\*idf* approach to index tokens from the context files and assign weight values to these tokens. Thus, all keywords can be represented by a series of features

*neural laterosensory canalsystem body nearly lateral line scales minimally pore*

with the query keywords are "neural" and "canal". *a. Step 1: Identify windows around each query keyword*

Window 1: (middle keyword is "neural")

Window 2: (middle keyword is "canal")

Window 3: (middle keyword is "neural")

b. Step 2: Combine overlapping chunks of text

c. Step 3: Combining chunks of text across document

*5.2.3. Generating Vectors for the Extracted Contexts*

ment by appending them together.

they are overlapping.

*developed posterior*

*erosensory canal*

## **5.2. Text Mining Approach**

## *5.2.1. Overview*

To reiterate our task, given a list of potential synonyms to be added to an ontology, we want to develop an algorithm to automatically identify the best matching concept for each candi‐ date. Unlike the WordNet approach, this is a much more difficult task for words mined from the literature. The candidate words are domain relevant, but exactly where in the on‐ tology do they belong?We again turn to thedomain-relevant corpus of documents to deter‐ mine the concept in the ontology to which these newly mined candidate words should be attached. The main differences between our approach and that of others are:


The four main steps of our approach are:


## *5.2.2. Extracting concept-word contexts*

This process begins with a set of domain-related documents that are used as a knowledge base. From each preprocessed text file, we locate all candidate word occurrences and then created windows of plus/minus N words around each occurrence. Overlapping windows are then combined to create non-overlapping text chunks. Because varying numbers of over‐ lapping windows may be combined to create the text chunks, the text chunks vary in size and in the number of word occurrences they contain. In order to illustrate the text chunks selection process, consider a single document representing the concept name "neural\_canal" with the query keywords are "neural" and "canal".

#### *a. Step 1: Identify windows around each query keyword*

In one part of the document under consideration with the window size of +-10, there are three occurrences of both terms "neural" and "canal" and thus three windows of size 21 with a word in the middle. We can see that Windows 1, 2, and 3 all contain keywords and they are overlapping.

Window 1: (middle keyword is "neural")

removing duplicates. Thus, we ultimately added 231 words to the appropriate concepts, al‐ most entirely automatically. The precision of adding correct synonym words to the ontology was 71.9%. The only manual step in this process is the filtering out of the incorrect word-

To reiterate our task, given a list of potential synonyms to be added to an ontology, we want to develop an algorithm to automatically identify the best matching concept for each candi‐ date. Unlike the WordNet approach, this is a much more difficult task for words mined from the literature. The candidate words are domain relevant, but exactly where in the on‐ tology do they belong?We again turn to thedomain-relevant corpus of documents to deter‐ mine the concept in the ontology to which these newly mined candidate words should be

**•** The documents are used as the knowledge base of word relationships as well as the source of domain-specific vocabulary. Specifically, for each concept (and each candidate word), chunks of text are extracted around the word occurrences to create concept-word contexts and candidate word contexts. The words around each concept and/or candidate word occurrence thus contain words related to the semantics of the concept/candidate

**•** Word phrases can be handled. Instead of handling each word in a word phrase separately and then trying to combine the results, as we did with WordNet, we can process the word phrase as a whole. For each word phrase, we extract text chunks in which all of the words occur. Thus, the contexts are related to the concept-word as a whole, not just the individu‐

**1.** Extract text chunks surrounding the concept-word(s) from each input document. Com‐ bine the text chunks extracted to create context files that represent the concept. Using the same method, create context files for each candidate word (i.e., potential synonym).

**2.** For each candidate, calculate the context-based similarity between the candidate context

This process begins with a set of domain-related documents that are used as a knowledge base. From each preprocessed text file, we locate all candidate word occurrences and then

**3.** Identify the most similar concepts using kNN (k nearest neighbors).

**4.** Assign the candidate words to the concept(s) with the highest similarity.

attached. The main differences between our approach and that of others are:

senses extracted from WordNet.

122 Theory and Applications for Advanced Text Mining Text Minning

**5.2. Text Mining Approach**

*5.2.1. Overview*

word.

al component words.

The four main steps of our approach are:

and each concept context.

*5.2.2. Extracting concept-word contexts*

*morphy pronounced reduction process boulengerella lateristriga maculata hypothesized synapomor‐ phic species neural synapomorphy laterosensory canal system body majority characiforms completely developed posterior*

Window 2: (middle keyword is "canal")

*process boulengerella lateristriga maculata hypothesized synapomorphic species neural synapomor‐ phy laterosensory canal system body majority characiforms completely developed posterior neural lat‐ erosensory canal*

Window 3: (middle keyword is "neural")

*synapomorphy laterosensory canal system body majority characiforms completely developed posterior neural laterosensory canalsystem body nearly lateral line scales minimally pore*

b. Step 2: Combine overlapping chunks of text

In this step, we combine all overlapping windows to create the text chunks as following.

*morphy pronounced reduction process boulengerella lateristriga maculata hypothesized synapomor‐ phic species neural synapomorphy laterosensory canal system body majority characiforms completely developed posterior neural laterosensory canal system body nearly lateral line scales minimally pore*

c. Step 3: Combining chunks of text across document

In this step, we simply combine the text chunks extracted for a given context from a docu‐ ment by appending them together.

#### *5.2.3. Generating Vectors for the Extracted Contexts*

After having extracted text chunks surrounding to keyword instances (i.e., concepts or syno‐ nyms), we have two sets of contexts: (1) **S** representing candidate synonyms; and (2) **C** rep‐ resenting concepts. In this step, we transform the contexts to vectors representing the keywords. We adopted *tf\*idf* approach to index tokens from the context files and assign weight values to these tokens. Thus, all keywords can be represented by a series of features (i.e., weighted tokens) extracted from the relevant chunks. We have generated two types of vectors for each keyword, i.e., *individual vectors* and a *centroid vector*. An *individual vector* summarizes the contexts around a keyword within a single document while a centroid vec‐ tor combines the individual vectors to summarize all contexts in which a keyword appears.

#### *a. Individual Vector*

Let's take an example of a concept keyword Ck represented by 5 context files extracted from 5 relevant documents (i.e., N=5). This keyword will have 5 individual vectorsrepresenting for each individual context file. We used the KeyConcept package [9] to index features and calculate their weights; each individual vector has the following format:

$$IND\_{i} \\_ \mathcal{C}\_{k} = (feature\_{i1} : weight \ t\_{i1} \\_ \text{feature}\_{i2} ; weight \ t\_{i2} \\_ \dots \\_ \text{feature}\_{im} ; weight \ t\_{im}) \tag{10}$$

in which weightij = rtfij \* idfj ,, the relative term frequency rtfij and the inverse document fre‐ quency idfij are respectively calculated by the following formulas

$$\text{return}\_{\text{i}} = \frac{\text{\# feature}\_{\text{j}} \cdot \text{in\\_context}\_{\text{i}}}{\text{\#feature} \cdot \text{in\\_context}\_{\text{i}}} \tag{11}$$

*sim*(*Sm*, *Cn*)=

*w*i,m : weight of the feature i in the vector representing Sm

*w*i,n : weight of the feature i in the vector representing Cn

where

*t*: total number of features

*5.2.5. Experiments*

test synonyms in our dataset.

∑ *i*=1 *t*

Since each keyword is presented either by N individual vectors (i.e., N context files) or a centroid vector over N context files, we propose to conduct four different methods to calcu‐

**•** *Indi2Indi*: calculates the similarity value between each individual vector of a synonym

**•** *Cent2Indi*: calculates the similarity value between the centroid vector of a synonym key‐

**•** *Indi2Cent*: calculates the similarity value between each individual vector of a synonym

**•** *Cent2Cent*: calculates the similarity value between the centroid vector of a synonym key‐

Finally, for each candidate synonym it is assigned to the top concepts whose context Cn has

Once the similarity between each word pair in (Sm, Cn) is calculated, we use the kNN meth‐ od to rank the similarity computation results. The higher the similarity means the candidate synonym would be closer to the target concept. For each candidate word, we take top k items from the similarity sorted list. Then, we accumulate the weights of pairs having the same concept names. The final ranked matching list is determined by the accumulated weights. We then pick the pairs having highest weights value between Sm and Cn as the po‐

We evaluated four WordNet-based algorithms that had produced high accuracy as report‐ ed, i.e., Resnik [27], Jiang and Conrath [13], Lin [16], and Pirro and Seco [26], using all 191

∑ *i*=1 *t w* <sup>2</sup>

Note that these similarity values are normalized by the size of the context.

late the similarities between candidate synonym vectors and concept vectors:

keyword in Sm and each individual vector of the concept keyword in Cn.

word in Sm and each individual vector of the concept keyword in Cn.

keyword in Sm and a centroid vector of the concept keyword in Cn.

word in Sm and a centroid vector of the concept keyword in Cn.

the highest similarity values to the context for Sm.

tential concepts to which the candidate synonym is added.

*a. Baseline: Existing WordNet-Based Algorithms*

*wi*,*<sup>m</sup>* ×*wi*,*<sup>n</sup>*

Ontology Learning Using Word Net Lexical Expansion and Text Mining

(14)

125

http://dx.doi.org/10.5772/51141

*<sup>i</sup>*,*<sup>m</sup>* <sup>×</sup> ∑ *i*=1 *t w* <sup>2</sup> *i*,*n*

$$\text{id } f\_j = \log \left( \frac{\text{\# documents\\_in\\_collection}}{\text{\# documents\\_containing\\_feature}\_j} \right) \tag{12}$$

In our experiments, the document collection we used is very specific to the amphibian mor‐ phology domain. In order to make the idf value more fair and accurate, we adopt the idf dataset based on the ODP that we had used in the previous paper [19].

#### *b. Centroid Vectors*

The centroid vector of the concept keyword Ck is calculated based on the individual vectors over N relevant documents.

$$\text{CEN\\_C}\_{-}\text{C}\_{k} = \sum\_{i=1}^{N} \text{IN}\,D\_{i}\\_{\text{C}\_{k}} \tag{13}$$

#### *5.2.4. Similarity Computation and Ranking*

In the next sections, we present our methods to calculate the similarity between two sets, i.e., (1) S representing candidate synonyms; and (2) C representing concepts, to create and rank a list of synonym-concept assignments.

Since each context is essentially a collection of text, we use the classic cosine similarity met‐ ric from the vector space model to measure the semantic relatedness of two contexts. For each context Sm representing a candidate synonym and Cn representing a concept, the similarity of each pair (Sm, Cn) is the weight of cosine similarity value calculated by the following formula:

#### Ontology Learning Using Word Net Lexical Expansion and Text Mining http://dx.doi.org/10.5772/51141 125

$$\text{sim}(S\_{m'} \ C\_n) = \frac{\sum\_{i=1}^t w\_{i,m} \times w\_{i,n}}{\sqrt{\sum\_{i=1}^t w^2 \sum\_{i,m} \sqrt{\sum\_{i=1}^t w^2 \sum\_{i,n}}}} \tag{14}$$

where

(i.e., weighted tokens) extracted from the relevant chunks. We have generated two types of vectors for each keyword, i.e., *individual vectors* and a *centroid vector*. An *individual vector* summarizes the contexts around a keyword within a single document while a centroid vec‐ tor combines the individual vectors to summarize all contexts in which a keyword appears.

Let's take an example of a concept keyword Ck represented by 5 context files extracted from 5 relevant documents (i.e., N=5). This keyword will have 5 individual vectorsrepresenting for each individual context file. We used the KeyConcept package [9] to index features and

> *rt <sup>f</sup> ij* <sup>=</sup> # *featurej* \_*in* \_*contexti* # *features* \_*in* \_*contexti*

*id <sup>f</sup> <sup>j</sup>* =log( #*documents* \_*in* \_*collection*

dataset based on the ODP that we had used in the previous paper [19].

*CEN* \_*Ck* <sup>=</sup>∑

#*documents* \_*containing* \_ *featurej*

In our experiments, the document collection we used is very specific to the amphibian mor‐ phology domain. In order to make the idf value more fair and accurate, we adopt the idf

The centroid vector of the concept keyword Ck is calculated based on the individual vectors

In the next sections, we present our methods to calculate the similarity between two sets, i.e., (1) S representing candidate synonyms; and (2) C representing concepts, to create and

Since each context is essentially a collection of text, we use the classic cosine similarity met‐ ric from the vector space model to measure the semantic relatedness of two contexts. For each context Sm representing a candidate synonym and Cn representing a concept, the similarity of each pair (Sm, Cn) is the weight of cosine similarity value calculated by the following formula:

*i*=1 *N*

*IN Di* \_*Ck* =( *featurei*<sup>1</sup> :*weigh ti*1, *featurei*<sup>2</sup> :*weigh ti*2, ..., *featureim* :*weigh tim*) (10)

,, the relative term frequency rtfij and the inverse document fre‐

(11)

) (12)

*IN Di* \_*Ck* (13)

calculate their weights; each individual vector has the following format:

quency idfij are respectively calculated by the following formulas

*a. Individual Vector*

124 Theory and Applications for Advanced Text Mining Text Minning

in which weightij = rtfij \* idfj

*b. Centroid Vectors*

over N relevant documents.

*5.2.4. Similarity Computation and Ranking*

rank a list of synonym-concept assignments.

*w*i,m : weight of the feature i in the vector representing Sm

*w*i,n : weight of the feature i in the vector representing Cn

*t*: total number of features

Note that these similarity values are normalized by the size of the context.

Since each keyword is presented either by N individual vectors (i.e., N context files) or a centroid vector over N context files, we propose to conduct four different methods to calcu‐ late the similarities between candidate synonym vectors and concept vectors:


Finally, for each candidate synonym it is assigned to the top concepts whose context Cn has the highest similarity values to the context for Sm.

Once the similarity between each word pair in (Sm, Cn) is calculated, we use the kNN meth‐ od to rank the similarity computation results. The higher the similarity means the candidate synonym would be closer to the target concept. For each candidate word, we take top k items from the similarity sorted list. Then, we accumulate the weights of pairs having the same concept names. The final ranked matching list is determined by the accumulated weights. We then pick the pairs having highest weights value between Sm and Cn as the po‐ tential concepts to which the candidate synonym is added.

#### *5.2.5. Experiments*

#### *a. Baseline: Existing WordNet-Based Algorithms*

We evaluated four WordNet-based algorithms that had produced high accuracy as report‐ ed, i.e., Resnik [27], Jiang and Conrath [13], Lin [16], and Pirro and Seco [26], using all 191 test synonyms in our dataset.


**Table 5.** Evaluation of WordNet-based algorithms.

Our baseline approach using WordNet was complicated by the fact that it does not contain word phrases, which are very common in the concepts and synonyms in our ontology. We separate each word phrase (e.g., neural\_canal) into single words, e.g., neural and canal, then submit each to WordNet to calculate the similarity. Then, we add together the individual word similarity score to produce the total similarity value for each phrase. For each of the 191 synonyms in the test set, the similarity between that synonym and each of the 530 con‐ cept names pairs was calculated, and then the concepts were ranked in decreasing order by similarity. Table 5 presents the accuracy for each method at different cutoff points in the list.

**Figure 9.** Result of the context-based similarity computation methods

not improved.

*c. Comparison with Baseline*

(i.e., RESNIK's algorithm).

By analyzing results, we find that the best performance is achieved in the Indi2Indi method with the window size (wsize) 4 and the k value 30 (c.f. Figure 9). Other methods slightly per‐ formed less well than the Indi2Ind. When the window size is increased, the performance is

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

127

From the above charts (c.f. Figure 9), we found that the context-based method works better with small text window size (i.e. size=4, 6). We evaluated the same reported pairs of syno‐ nym and concept-words with our text mining methods and the above WordNet-based algo‐ rithms. Figure 10 shows that our context-based similarity computation methods consistently outperform the best WordNet-based algorithm. Even the less promising *Cent2Cent* method produces performance better than the best case of the WordNet-based similarity algorithms

The experimental results support our belief that we might not need to rely on the WordNet database to determine similar words. The WordNet-based algorithms have not provided high precision results since many words are compound and/or do not exist in the WordNet database.

From the Table 5, we can see that Resnik's algorithm works best with this test set, although is only 1.05% accurate at the rank 1 location and 8.9% the correct concept appeared within the top 10 ranked concepts. The other approaches are not as accurate.

#### *b. Context-Based Algorithms*

We evaluated results by comparing the list of top ranked concepts for each synonym pro‐ duced using our context-based method with the truth list we extracted from the ontology. The performance is reported over varying text chunk sizes (or text window sizes - *wsize*), from 4 to 40 tokens. Due to space limits, we only report here the best case achieved for each method (c.f. Figure 9). We also evaluated different values of k from 5 to 30 in the kNN meth‐ ods to determine the k value that returned the best result. The numbers of k reported in each chart were different since the number or results generated depend on the number of vectors compared by each method. For example, when we measure the similarity between a syno‐ nym and a concept name, the *Indi2Indi* method used 5 individual vectors for each synonym and concept; but we have only one vector for each keyword in the *Cent2Cent* method.

**Figure 9.** Result of the context-based similarity computation methods

By analyzing results, we find that the best performance is achieved in the Indi2Indi method with the window size (wsize) 4 and the k value 30 (c.f. Figure 9). Other methods slightly per‐ formed less well than the Indi2Ind. When the window size is increased, the performance is not improved.

#### *c. Comparison with Baseline*

**Algos Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 Top9 Top10**

LIN 1 4 4 9 11 12 13 13 15 15 JIANG 1 2 3 7 7 9 10 11 11 11 RESNIK 2 9 9 10 15 16 16 16 17 17

PIRRO 1 3 4 5 9 10 10 12 13 13

Our baseline approach using WordNet was complicated by the fact that it does not contain word phrases, which are very common in the concepts and synonyms in our ontology. We separate each word phrase (e.g., neural\_canal) into single words, e.g., neural and canal, then submit each to WordNet to calculate the similarity. Then, we add together the individual word similarity score to produce the total similarity value for each phrase. For each of the 191 synonyms in the test set, the similarity between that synonym and each of the 530 con‐ cept names pairs was calculated, and then the concepts were ranked in decreasing order by similarity. Table 5 presents the accuracy for each method at different cutoff points in the list.

From the Table 5, we can see that Resnik's algorithm works best with this test set, although is only 1.05% accurate at the rank 1 location and 8.9% the correct concept appeared within

We evaluated results by comparing the list of top ranked concepts for each synonym pro‐ duced using our context-based method with the truth list we extracted from the ontology. The performance is reported over varying text chunk sizes (or text window sizes - *wsize*), from 4 to 40 tokens. Due to space limits, we only report here the best case achieved for each method (c.f. Figure 9). We also evaluated different values of k from 5 to 30 in the kNN meth‐ ods to determine the k value that returned the best result. The numbers of k reported in each chart were different since the number or results generated depend on the number of vectors compared by each method. For example, when we measure the similarity between a syno‐ nym and a concept name, the *Indi2Indi* method used 5 individual vectors for each synonym

and concept; but we have only one vector for each keyword in the *Cent2Cent* method.

the top 10 ranked concepts. The other approaches are not as accurate.

2.09% 4.71% 5.76% 6.28% 6.81% 6.81% 7.85% 7.85%

1.57% 3.66% 3.66% 4.71% 5.24% 5.76% 5.76% 5.76%

**4.71% 5.24% 7.85% 8.38% 8.38% 8.38% 8.90% 8.90%**

2.09% 2.62% 4.71% 5.24% 5.24% 6.28% 6.81% 6.81%

**# Correct pairs identified**

**Precision**

LIN

126 Theory and Applications for Advanced Text Mining Text Minning

JIANG

**RESNIK**

PIRRO

*b. Context-Based Algorithms*

0.52 %

0.52 %

**1.05 %**

0.52 %

**Table 5.** Evaluation of WordNet-based algorithms.

2.09 %

1.05 %

**4.71 %**

1.57 %

> From the above charts (c.f. Figure 9), we found that the context-based method works better with small text window size (i.e. size=4, 6). We evaluated the same reported pairs of syno‐ nym and concept-words with our text mining methods and the above WordNet-based algo‐ rithms. Figure 10 shows that our context-based similarity computation methods consistently outperform the best WordNet-based algorithm. Even the less promising *Cent2Cent* method produces performance better than the best case of the WordNet-based similarity algorithms (i.e., RESNIK's algorithm).

> The experimental results support our belief that we might not need to rely on the WordNet database to determine similar words. The WordNet-based algorithms have not provided high precision results since many words are compound and/or do not exist in the WordNet database.

**Lexical Expansion**

**Table 6.** Comparison of different approaches on number of words added and accuracy.

documents are submitted for information extraction using text mining.

ogy and the quality of those additions.

construct the reference hypernym tree.

**6. Conclusions**
