\_*truth* <sup>−</sup>*list* \_*words* (9)

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

119

It is crucial to have a filtering stage to remove irrelevant and slightly relevant documents to the amphibian ontology. We have adopted an SVM-based classification technique trained on 60 relevant and 60 irrelevant documents collected from the Web. In earlier experiments, our focused search combining with the SVM classification was able to collect new documents and correctly identify those related to the domain with an average accuracy 77.5% [17][19].

Ultimately, the papers collected and filtered by the topic-specific spider will be automatical‐ ly fed into the text mining software (with an optional human review in between). However, to evaluate the effectiveness of the text mining independently, without noise introduced by some potentially irrelevant documents, we ran our experiments using 60 documents man‐ ually judged as relevant, separated into two groups of 30, i.e., *Group\_A* and *Group\_B*. All these documents were preprocessed to remove HTML code, stop words and punctuation. We ran experiments to tune our algorithms using *Group\_A* and then validated our results using *Group\_B*.

For the document-based approach, we took the top 30 words from each document and merged them. This created a list of 623 unique words (*L1*). To be consistent, we also selected the top 623 words produced by the corpus based approach (*L2*). We merged these two lists and removed duplicates, which resulted in a list of 866 unique words that were submitted for human judgment. Based on our expert judgment, 507 of the total words were domainrelevant and 359 were considered irrelevant. These judgments were then used to evaluate the 6 techniques, i.e., *L1*, *L2*, *L1+L2*, *L1N*, *L2N* and *L1N+L2N*.

When we selected only the nouns from L1 and L2, this resulted in lists L1N and L2N that contained 277 and 300 words, respectively. When these were merged and duplicates were removed, 375 words were submitted for judgment of which 253 were judged relevant and 122 were judged irrelevant to the amphibian morphology domain

#### *b. Evaluation Measures*

In the text mining approach, we define:

**•** *Precision (P):* measures the percentage of the correct words identified by our algorithm that matched those from the candidate words.

$$P = \frac{\#\\_correct\\_words\\_identified}{\#\\_candidate\\_words} \tag{8}$$

**•** *Recall (R)*: measures the percentage of the correct words identified by our algorithm that matched those from the truth-list words.

#### Ontology Learning Using Word Net Lexical Expansion and Text Mining http://dx.doi.org/10.5772/51141 119

$$R = \frac{\text{\textquotedblleft\textquotedblright}\\_words\\_\text{\textquotedblleft\textquotedblright}\\_substack\text{\textquotedblright}\\_substack\text{\textquotedblleft\textquotedblright}\\_sub\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblleft}\\_\text{\textquotedblright}\\_\text{\textquotedblright}\\_\text{\textquotedblright}\\_\text{\textquotedblright}\\_\text{\textquotedblleft}\\_\text{\textquotedbl$$

Based on these two measures, we also calculate the F-measure as the equation (3) to evaluate methods' performances.

#### *4.2.5. Results*

which the top ten results are requested. This results in a maximum of 600 documents to process. However, because some search sites return fewer than ten results for some queries, we perform syntactic filtering, and duplicate documents are returned by search engines, in

It is crucial to have a filtering stage to remove irrelevant and slightly relevant documents to the amphibian ontology. We have adopted an SVM-based classification technique trained on 60 relevant and 60 irrelevant documents collected from the Web. In earlier experiments, our focused search combining with the SVM classification was able to collect new documents and correctly identify those related to the domain with an average accuracy 77.5% [17][19].

Ultimately, the papers collected and filtered by the topic-specific spider will be automatical‐ ly fed into the text mining software (with an optional human review in between). However, to evaluate the effectiveness of the text mining independently, without noise introduced by some potentially irrelevant documents, we ran our experiments using 60 documents man‐ ually judged as relevant, separated into two groups of 30, i.e., *Group\_A* and *Group\_B*. All these documents were preprocessed to remove HTML code, stop words and punctuation. We ran experiments to tune our algorithms using *Group\_A* and then validated our results

For the document-based approach, we took the top 30 words from each document and merged them. This created a list of 623 unique words (*L1*). To be consistent, we also selected the top 623 words produced by the corpus based approach (*L2*). We merged these two lists and removed duplicates, which resulted in a list of 866 unique words that were submitted for human judgment. Based on our expert judgment, 507 of the total words were domainrelevant and 359 were considered irrelevant. These judgments were then used to evaluate

When we selected only the nouns from L1 and L2, this resulted in lists L1N and L2N that contained 277 and 300 words, respectively. When these were merged and duplicates were removed, 375 words were submitted for judgment of which 253 were judged relevant and

**•** *Precision (P):* measures the percentage of the correct words identified by our algorithm

**•** *Recall (R)*: measures the percentage of the correct words identified by our algorithm that
