Total words 321 191 191 # Correct words added 231 17 37

Precision 71.9% 8.9% 19.3%

Table 6 summarizes the results for the various approaches.It is clear that, in spite of the diffi‐ culty in finding domain-specific words in WordNet, the lexical expansion approach achieves both the best ontology enrichment, both in terms of the number of words added to the ontol‐

The goal of this research study is to implement and validate an ontology learning frame‐ work in order to enrich the vocabulary of the domain ontology from Web documents related to the domain and from the WordNet semantic lexicon. We have presented two approaches, i.e., lexical expansion using WordNet and text mining, to perform these tasks. In the first ap‐ proach, we built a reference hypernym tree by manually disambiguating top two levels of concepts from the ontology and compared three similarity computation methods to allow us to disambiguate the other concept-words in the ontology using WordNet. For the correct sense, we then enriched the corresponding concept with its synonyms and hypernyms. In our second approach, we implemented a focused crawler that retrieved documents in the domain of amphibian morphology that incorporates an SVM-based filter. The most relevant

Our approaches were empirically tested based on the seed amphibian ontology with Word‐ Net lexical synsets and with retrieved Web document. While both approaches performed well, the text mining approach had higher precision for extracting relevant vocabulary to en‐ rich the ontology. However, that approach had greater difficulty attaching the new vocabu‐ lary to the correct concept. Considering the ultimate utility of these two approaches, the text mining approach depends on the number and quality of the documents collected by the top‐ ic specific crawler. In addition, although it extracts good words, these words are not match‐ ed with particular concepts within the ontology. A further pairing process, most likely involving WordNet, is needed to complete the ontology enrichment process. In contrast, the lexical expansion approach using WordNet is only dependent on the concept-words in the ontology itself. It also extracts words from WordNet on a concept-by-concept basis, so no extra process is required to match new words with concepts. However, it does suffer from inaccuracies when the incorrect senses of words are used for expansion. It also requires a small amount of manual effort in order to disambiguate a few concept-words in order to

**WordNet-based**

**Algorithms Context-based Algorithms**

http://dx.doi.org/10.5772/51141

129

Ontology Learning Using Word Net Lexical Expansion and Text Mining

**Figure 10.** Comparison of our context-based methods vs. WordNet-based algorithms

#### **5.3. Discussion**

Our context-based approach was evaluated using synonyms from the amphibian ontology itself as our truth dataset. The experimental results show that our text mining approaches outperform the common similarity calculation algorithms based on WordNet in terms of correctly adding new words to the ontology concepts. Our best algorithm, Indi2Indi, with a window size of 4 and k=30 for kNN, selected the correct concept within the top 10 concepts 19.37% of the time. In comparison, the best WordNet-based algorithm only achieved 8.9% on the same task.

Overall, using WordNet similarity algorithms to assign a new vocabulary to an existing on‐ tology might not be an effective solution since it depends on the WordNet database. One of the main drawbacks of the WordNet database is the unavailability of many words, particu‐ larly for narrowly-defined domains. Another problem with WordNet-based algorithms is that, because WordNet is so broad in scope, an extra step is neededfor semantic disambigua‐ tion thatcan further decrease accuracy.In contrast, the text mining approach can be applied to any domain and can use information from the Internet to mine new relevant vocabularies.

Compared with the lexical expansion approach, in terms of correctly assigning new words to specific ontology concepts, both context-based and WordNet-based text mining algo‐ rithms have lower precision (19.37% and 8.9% vs. 71.9%). This difference is to be expected because, with the lexical expansion approach, we already know the concepts to whichthe new words are to be attached.With that process, we start with a concept and try to find rele‐ vant domain-specific words.In contrast, the text mining approach starts with domain-rele‐ vant words and then tries to find matching concepts.


**Table 6.** Comparison of different approaches on number of words added and accuracy.

Table 6 summarizes the results for the various approaches.It is clear that, in spite of the diffi‐ culty in finding domain-specific words in WordNet, the lexical expansion approach achieves both the best ontology enrichment, both in terms of the number of words added to the ontol‐ ogy and the quality of those additions.

## **6. Conclusions**

**Figure 10.** Comparison of our context-based methods vs. WordNet-based algorithms

Our context-based approach was evaluated using synonyms from the amphibian ontology itself as our truth dataset. The experimental results show that our text mining approaches outperform the common similarity calculation algorithms based on WordNet in terms of correctly adding new words to the ontology concepts. Our best algorithm, Indi2Indi, with a window size of 4 and k=30 for kNN, selected the correct concept within the top 10 concepts 19.37% of the time. In comparison, the best WordNet-based algorithm only achieved 8.9%

Overall, using WordNet similarity algorithms to assign a new vocabulary to an existing on‐ tology might not be an effective solution since it depends on the WordNet database. One of the main drawbacks of the WordNet database is the unavailability of many words, particu‐ larly for narrowly-defined domains. Another problem with WordNet-based algorithms is that, because WordNet is so broad in scope, an extra step is neededfor semantic disambigua‐ tion thatcan further decrease accuracy.In contrast, the text mining approach can be applied to any domain and can use information from the Internet to mine new relevant vocabularies.

Compared with the lexical expansion approach, in terms of correctly assigning new words to specific ontology concepts, both context-based and WordNet-based text mining algo‐ rithms have lower precision (19.37% and 8.9% vs. 71.9%). This difference is to be expected because, with the lexical expansion approach, we already know the concepts to whichthe new words are to be attached.With that process, we start with a concept and try to find rele‐ vant domain-specific words.In contrast, the text mining approach starts with domain-rele‐

vant words and then tries to find matching concepts.

**5.3. Discussion**

128 Theory and Applications for Advanced Text Mining Text Minning

on the same task.

The goal of this research study is to implement and validate an ontology learning frame‐ work in order to enrich the vocabulary of the domain ontology from Web documents related to the domain and from the WordNet semantic lexicon. We have presented two approaches, i.e., lexical expansion using WordNet and text mining, to perform these tasks. In the first ap‐ proach, we built a reference hypernym tree by manually disambiguating top two levels of concepts from the ontology and compared three similarity computation methods to allow us to disambiguate the other concept-words in the ontology using WordNet. For the correct sense, we then enriched the corresponding concept with its synonyms and hypernyms. In our second approach, we implemented a focused crawler that retrieved documents in the domain of amphibian morphology that incorporates an SVM-based filter. The most relevant documents are submitted for information extraction using text mining.

Our approaches were empirically tested based on the seed amphibian ontology with Word‐ Net lexical synsets and with retrieved Web document. While both approaches performed well, the text mining approach had higher precision for extracting relevant vocabulary to en‐ rich the ontology. However, that approach had greater difficulty attaching the new vocabu‐ lary to the correct concept. Considering the ultimate utility of these two approaches, the text mining approach depends on the number and quality of the documents collected by the top‐ ic specific crawler. In addition, although it extracts good words, these words are not match‐ ed with particular concepts within the ontology. A further pairing process, most likely involving WordNet, is needed to complete the ontology enrichment process. In contrast, the lexical expansion approach using WordNet is only dependent on the concept-words in the ontology itself. It also extracts words from WordNet on a concept-by-concept basis, so no extra process is required to match new words with concepts. However, it does suffer from inaccuracies when the incorrect senses of words are used for expansion. It also requires a small amount of manual effort in order to disambiguate a few concept-words in order to construct the reference hypernym tree.

In future, we hope to combine these two approaches to exploit the strengths of each. For ex‐ ample we can use WordNet pair the text mining with concepts and use the documents to identify and disambiguate the multiple senses for the concept-words found in WordNet. In order to mine more vocabularies for the ontology, we will deal with the concept-words that currently do not appear in WordNet due to the very narrowness of the domain. Our other main task is to validate our approach on ontologies from other domains, to confirm that it is domain-independent. Finally, we need to incorporate the results of this work into a com‐ plete system to automatically enrich our ontology.

[7] Ester, M., Gross, M., & Kriegel, H. P. (2001). Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies.

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

131

[8] Gangemi, A., Navigli, R., & Velardi, P. (2003). The OntoWordNet Project: extension and axiomatization of conceptual relations in WordNet. In Proceedings of OD‐

[9] Gauch, S., Madrid, J. M., Induri, S., Ravindran, D., & Chadlavada, S. (2010). KeyCon‐ cept: A Conceptual Search Engine. *Center, Technical Report: ITTC-FY2004-TR8646--37*,

[10] Gómez-Pérez, A., & Manzano-Macho, D. (2003). A survey of ontology learning meth‐ ods and techniques. *Deliverable 1.5, IST Project IST-20005-29243- OntoWeb*.

[11] Gruber, T. (1994). Towards principles for the design of ontologies used for knowl‐

[12] Hotho, A., Nürnberger, A., & Paaß, G. (2005). A Brief Survey of Text Mining. *LDV-*

[13] Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexi‐ cal taxonomy. In Proceedings of International Conference Research on Computation‐

[14] Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. Paper presented at Proceedings of the 10th ECML-,1998.

[15] Leacock, C., & Chodorow, M. (1998). Combining Local Context and WordNet Simi‐

[16] Lin, D. (1998). An information-theoretic definition of similarity. Paper presented at Proceedings of the Fifteenth International Conference on Machine Learning,. 296-304.

[17] Luong, H., Gauch, S., & Wang, Q. (2009, Feb. 1-7). Ontology-based Focused Crawl‐ ing. Paper presented at International Conference on Information, Process, and

[18] Luong, H., Gauch, S., & Speretta, M. (2009, August 2-4,). Enriching Concept Descrip‐ tions in an Amphibian Ontology with Vocabulary Extracted from WordNet. Paper presented at The 22nd IEEE Symposium on Computer-Based Medical Systems

[19] Luong, H., Gauch, S., Wang, Q., & Maglia, A. (2009). An Ontology Learning Frame‐ work Using Focused Crawler and Text Mining. *International Journal on Advances in*

[20] Maedche, A., & Staab, S. (2001, March). Ontology Learning for the Semantic Web.

*IEEE Intelligent Systems, Special Issue on the Semantic Web*, 16(2), 72-79.

larity for Word Sense Identification. MIT Press, Cambridge, 265-283.

Knowledge Management (eKNOW 2009),, Cancun, Mexico,. 123-128.

edge sharing. *Int. J. of Human and Computer Studies* [43], 907-928.

Paper presented at 27th Int. Conf. on Very Large Databases, Roma, Italy.

BASE03 Conference. Springer

University of Kansas.

*Forum*, 20(1), 19-62.

al Linguistics Taiwan,

(CBMS 2009), New Mexico, USA. 1-6.

*Life Sciences*, 1(23), 99-109.

137-142.

## **Acknowledgements**

This research is partially supported by the NSF grant DBI-0445752: Semi-Automated Con‐ struction of an Ontology for Amphibian Morphology.

## **Author details**

Hiep Luong\* , Susan Gauch and Qiang Wang

\*Address all correspondence to: hluong@uark.edu

Department of Computer Science and Computer Engineering, University of Arkansas, U.S.A.

## **References**


[7] Ester, M., Gross, M., & Kriegel, H. P. (2001). Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies. Paper presented at 27th Int. Conf. on Very Large Databases, Roma, Italy.

In future, we hope to combine these two approaches to exploit the strengths of each. For ex‐ ample we can use WordNet pair the text mining with concepts and use the documents to identify and disambiguate the multiple senses for the concept-words found in WordNet. In order to mine more vocabularies for the ontology, we will deal with the concept-words that currently do not appear in WordNet due to the very narrowness of the domain. Our other main task is to validate our approach on ontologies from other domains, to confirm that it is domain-independent. Finally, we need to incorporate the results of this work into a com‐

This research is partially supported by the NSF grant DBI-0445752: Semi-Automated Con‐

Department of Computer Science and Computer Engineering, University of Arkansas,

[1] Agirre, E., Ausa, O., Havy, E., & Martinez, D. (2000). Enriching Very Large Ontolo‐ gies Using the WWW. ECAI 1st Ontology Learning Workshop Berlin, August

[2] Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. *Scientific Ameri‐*

[3] Berry, M. W. (2003). Survey of Text Mining. Springer-Verlag New York Inc., Secau‐

[4] Buitelaar, P., Cimiano, P., & Magnini, B. (2005). Ontology Learning from Text: Meth‐ ods, Evaluation and Applications. IOS Press (Frontiers in AI and applications, , 123

[5] Chang, C. C., & Lin, C. J. (2001). LIBSVM : a library for support vector machines.

[6] Dumais, S. T., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algo‐ rithms and representations for text categorization. Proceedings of CIKM-98, 148-155.

Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

plete system to automatically enrich our ontology.

struction of an Ontology for Amphibian Morphology.

, Susan Gauch and Qiang Wang

\*Address all correspondence to: hluong@uark.edu

**Acknowledgements**

130 Theory and Applications for Advanced Text Mining Text Minning

**Author details**

Hiep Luong\*

**References**

*can*, 35-43.

cus, NJ,

U.S.A.


[21] Maedche, A., Neumann, G., & Staab, S. (2003). Bootstrapping an Ontology-Based In‐ formation Extraction System. *Studies in Fuzziness and Soft Computing, Intelligent explo‐ ration of the web*, Springer, 345-359.

on the Web. In Proceedings of the 7th Annual ACM International Workshop on Web

Ontology Learning Using Word Net Lexical Expansion and Text Mining

http://dx.doi.org/10.5772/51141

133

[36] Velardi, P., Fabriani, P., & Missikoff, M. (2001). Using text processing techniques to automatically enrich a domain ontology. In Proceedings of the international confer‐ ence on Formal Ontology in Information Systems- Volume 2001 (FOIS'01), ACM New York, NY, USA DOI=10.1145/505168.505194 http://doi.acm.org/

[37] Warin, M., Oxhammer, H., & Volk, M. (2005). Enriching an ontology with wordnet

[38] Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In 32nd Annual

[39] Yang, D., & Powers, W. (2005). Measuring semantic similarity in the taxonomy of WordNet. Paper presented at Proc of the 28th Australasian Computer Science Con‐

[40] Yang, Y., Zhang, J., & Kisiel, B. (2003, July-August). A scalability analysis of classifi‐ ers in text categorization. Paper presented at Proceedings of 26th ACM SIGIR Con‐

[41] Yang, Y. (1994). Expert Network: Effective and Efficient Learning from Human Deci‐ sions in Text Categorization and Retrieval. *Proceedings of 17th ACM SIGIR*, 13-22.

[42] Yang, Y., & Pederson, J. O. (1997). A comparative study on feature selection in text

Information and Data Management (WIDM'05). ACM Press , 10-16.

based on similarity measures. In MEANING-2005 Workshop

Meeting of the Association for Computational Linguistics , 133-138.

10.1145/505168.505194 , 2001, 270-284.

ference. 315-332.

ference. 96-103.

categorization. *ICML*.


on the Web. In Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management (WIDM'05). ACM Press , 10-16.

[21] Maedche, A., Neumann, G., & Staab, S. (2003). Bootstrapping an Ontology-Based In‐ formation Extraction System. *Studies in Fuzziness and Soft Computing, Intelligent explo‐*

[22] Maglia, A. M., Leopold, J. L., Pugener, L. A., & Gauch, S. (2007). An Anatomical On‐

[23] Miller, G. A. (1995). WordNet: a lexical database for English. *Comm. ACM*, 38(11),

[24] Omelayenko, B. (2001). Learning of ontologies for the Web: the analysis of existent approaches. Paper presented at Proceedings of the international workshop on Web

[25] Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity- Measuring the Relatedness of Concepts. Paper presented at In Proc. of 19th National Conference

[26] Pirro, G., & Seco, N. (2008). Design, implementation and evaluation of a new seman‐ tic similarity metric combining features and intrinsic information content. *OTM Mex‐*

[27] Reiter, N., & Buitelaar, P. (2008). Lexical Enrichment of a Human Anatomy Ontology using WordNet. In: Proc. of GWC08 (Global WordNet Conference), Hungary ,

[28] Resnik, P. (1995). Using Information Content to evaluate semantic similarity in a

[29] Shamsfard, M., & Barforoush, A. (2003). The State of the Art in Ontology Learning.

[30] Spasic, I., Ananiadou, S., Mc Naught, J., & Kumar, A. (2005). Text mining and ontolo‐

[31] Speretta, M., & Gauch, S. (2008). Using Text Mining to Enrich the Vocabulary of Do‐ main Ontologies. Paper presented at ACM International Conference on Web Intelli‐

[32] Stevenson, M. (2002). Combining disambiguation techniques to enrich anontology. In Proceedings of the 15th ECAI workshop on Machine Learning and Natural Language

[33] Tan, A. H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the Pacific Asia Conf on Knowledge Discovery and Data Mining PAKDD'99 work‐

[35] Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E. G., & Milios, E. E. (2005). Se‐ mantic similarity methods in WordNet and their application to information retrieval

shop on Knowledge Discovery from Advanced Databases , 65-70.

[34] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer-Verlag.

*The Knowledge Engineering Review*, Cambridge Univ. Press, 18(4), 293-316.

gies in biomedicine: making sense of raw text. *Brief Bioinform*, 6, 239-251.

on Artificial Intelligence, USA. The MIT Press, 1024-1025.

taxonomy. In IJCAI-95, Montreal, Canada , 448-453.

tology For Amphibians. *Pacific Symposium on Biocomputing* [12], 367-378.

*ration of the web*, Springer, 345-359.

39-41.

375-387.

dynamics, London.

132 Theory and Applications for Advanced Text Mining Text Minning

*ico, LCNS 5332*, 1271-1288.

gence, Sydney. 549-552.

Processing for Ontology Engineering


**Chapter 6**

**Provisional chapter**

**Automatic Compilation of Travel Information from**

Travel guidebooks and portal sites provided by tour companies and governmental tourist boards are useful sources of information about travel. However, it is costly and time-consuming to compile travel information for all tourist spots and to keep these data up-to-date manually. Recently, research about services for the automatic compilation and recommendation of travel information has been increasing in various research communities, such as natural language processing, image processing, Web mining, geographic information systems (GISs), and human interfaces. In this chapter, we overview the state of the art of the research and several related services in this field. We especially focus on research in natural

The remainder of this chapter is organized as follows. Section 2 explains the automatic construction of databases for travel. Section 3 describes analysis of travelers' behavior. Section 4 introduces several studies about recommending travel information. Section 5 shows interfaces for travel information access. Section 6 lists several linguistic resources. Finally,

In this section, we describe several studies about constructing databases for travel. In Section 2.1, we introduce a study that identified travel blog entries in a blog database. In Section 2.2, we describe several methods to construct databases for travel by extracting travel information, such as tourist spots or local products, from travel blog entries using information extraction techniques. In Section 2.3, we explain a method that constructs travel links automatically.

> ©2012 Nanba et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Nanba et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

distribution, and reproduction in any medium, provided the original work is properly cited.

and reproduction in any medium, provided the original work is properly cited.

we provide our conclusions and offer future directions in Section 7.

**2. Automatic construction of databases for travel**

**Travel Information from Texts: A Survey**

**Texts: A Survey**

Toshiyuki Takezawa

Hidetsugu Nanba,

**1. Introduction**

http://dx.doi.org/10.5772/51290

Hidetsugu Nanba, Aya Ishino and

Aya Ishino and Toshiyuki Takezawa

**Automatic Compilation of**

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

language processing, including text mining.

**Provisional chapter**

## **Automatic Compilation of Travel Information from Texts: A Survey Automatic Compilation of Travel Information from Texts: A Survey**

Hidetsugu Nanba, Aya Ishino and Toshiyuki Takezawa Hidetsugu Nanba, Aya Ishino and Toshiyuki Takezawa

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51290

## **1. Introduction**

Travel guidebooks and portal sites provided by tour companies and governmental tourist boards are useful sources of information about travel. However, it is costly and time-consuming to compile travel information for all tourist spots and to keep these data up-to-date manually. Recently, research about services for the automatic compilation and recommendation of travel information has been increasing in various research communities, such as natural language processing, image processing, Web mining, geographic information systems (GISs), and human interfaces. In this chapter, we overview the state of the art of the research and several related services in this field. We especially focus on research in natural language processing, including text mining.

The remainder of this chapter is organized as follows. Section 2 explains the automatic construction of databases for travel. Section 3 describes analysis of travelers' behavior. Section 4 introduces several studies about recommending travel information. Section 5 shows interfaces for travel information access. Section 6 lists several linguistic resources. Finally, we provide our conclusions and offer future directions in Section 7.

## **2. Automatic construction of databases for travel**

In this section, we describe several studies about constructing databases for travel. In Section 2.1, we introduce a study that identified travel blog entries in a blog database. In Section 2.2, we describe several methods to construct databases for travel by extracting travel information, such as tourist spots or local products, from travel blog entries using information extraction techniques. In Section 2.3, we explain a method that constructs travel links automatically.

©2012 Nanba et al., licensee InTech. This is an open access chapter distributed under the terms of the Creative

## **2.1. Automatic identification of travel blog entries**

Travel blogs1 are defined as travel journals written by bloggers in diary form. Travel blogs are considered useful for obtaining travel information, because many bloggers' travel experiences are written in this form.

There are various portal sites for travel blogs, which we will describe in Section 6. At these sites, travel blogs are manually registered by bloggers themselves, and the blogs are classified according to travel destination. However, there are many more travel blogs in the blogosphere, beyond these portal sites. In an attempt to construct an exhaustive database of travel blogs, Nanba et al. [25] identified travel blog entries written in Japanese in a blog database.2

Blog entries that contain cue phrases, such as "travel", "sightseeing", or "tour", have a high degree of probability of being travel blogs. However, not every travel blog contains such cue phrases. For example, if a blogger describes his/her journey to Norway in multiple blog entries, the blog might state "We traveled to Norway" in the first entry, while only writing "We ate wild sheep!" in the second entry. In this case, because the second entry does not contain any expressions related to travel, it is difficult to identify it as a travel blog entry. Therefore, Nanba et al. focused not only on each blog entry but also on the surrounding entries for the identification of travel blog entries. They formulated the identification of travel blog entries as a sequence-labeling problem, and solved it using machine learning. For the machine learning method, they examined the Conditional Random Fields (CRF) method [20]; its empirical success has been reported recently in the field of natural language processing. The CRF-based method identifies the tag <sup>3</sup> of each entry. Features and tags are given in the CRF method as follows: (1) *k* tags occur before a target entry; (2) *k* features occur before a target entry; and (3) *k* features follow a target entry (see Figure 1). They used the value of *k* = 4, which was determined in a pilot study. Here, they used the following features for machine learning: whether an entry contains any of 416 cue phrases, such as " (travel)", " (tour)", and " (departure)", and the number of location names in each entry.

**Figure 1.** Features and tags used in CRF

**Figure 2.** Travel blog entries plotted on a Google map

In the following, we explain the detail of the bootstrapping-based and machine learning-based information extraction approaches based on Nanba's work [25]. Nanba et al. extracted pairs comprising a location name and a local product from travel blogs written in Japanese, which were identified using the method described in Section 2.1. For the efficient

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

137

First, they prepared 482 pairs as seeds for the bootstrapping. These pairs were obtained automatically from a "Web Japanese N-gram" database provided by Google, Inc. The

extraction of travel information, they employed a bootstrapping method.

Using the above method, Nanba et al. identified 17,268 travel blog entries from 1,100,000 blog entries, and constructed a system that plotted travel blog entries on a Google map (see Figure 2).4 In this figure, travel blog entries are shown as icons. If the user clicks an icon, the corresponding blog entry is shown in a pop-up window.

## **2.2. Automatic extraction of travel information from texts**

Nakatoh et al. [24] proposed a method for extracting names of local culinary dishes from travel blogs written in Japanese, which were identified when the blog entry included both the name of a sightseeing destination and the word "tourism". They extracted local dishes by gathering nouns that are dependent on the verb "" (eat). Tsai and Chou [32] also proposed a method for extracting dish names from restaurant review blogs written in Chinese using a machine learning (CRF) technique.

<sup>1</sup> We use the term *travel blog*. Other studies use the term "Travelogues" [10], indicating social networking service (SNS) content, blogs, reviews, message boards, and so on, for travel.

<sup>2</sup> Although Nanba et al. identified Japanese travel blogs, their method can be applied to blogs written in other languages, if cue phrases for the language are prepared.

<sup>3</sup> In this case, the tag indicates whether each entry is a travel blog entry or not.

<sup>4</sup> http://www.ls.info.hiroshima-cu.ac.jp/test/travel-map/xml-travelmap.html

<sup>136</sup> Theory and Applications for Advanced Text Mining Text Mining Automatic Compilation of Travel Information from Texts: A Survey 3 Automatic Compilation of Travel Information from Texts: A Survey http://dx.doi.org/10.5772/51290 137

**Figure 1.** Features and tags used in CRF

2 Text Mining

database.2

are written in this form.

**2.1. Automatic identification of travel blog entries**

corresponding blog entry is shown in a pop-up window.

Chinese using a machine learning (CRF) technique.

content, blogs, reviews, message boards, and so on, for travel.

<sup>3</sup> In this case, the tag indicates whether each entry is a travel blog entry or not. <sup>4</sup> http://www.ls.info.hiroshima-cu.ac.jp/test/travel-map/xml-travelmap.html

languages, if cue phrases for the language are prepared.

**2.2. Automatic extraction of travel information from texts**

Travel blogs1 are defined as travel journals written by bloggers in diary form. Travel blogs are considered useful for obtaining travel information, because many bloggers' travel experiences

There are various portal sites for travel blogs, which we will describe in Section 6. At these sites, travel blogs are manually registered by bloggers themselves, and the blogs are classified according to travel destination. However, there are many more travel blogs in the blogosphere, beyond these portal sites. In an attempt to construct an exhaustive database of travel blogs, Nanba et al. [25] identified travel blog entries written in Japanese in a blog

Blog entries that contain cue phrases, such as "travel", "sightseeing", or "tour", have a high degree of probability of being travel blogs. However, not every travel blog contains such cue phrases. For example, if a blogger describes his/her journey to Norway in multiple blog entries, the blog might state "We traveled to Norway" in the first entry, while only writing "We ate wild sheep!" in the second entry. In this case, because the second entry does not contain any expressions related to travel, it is difficult to identify it as a travel blog entry. Therefore, Nanba et al. focused not only on each blog entry but also on the surrounding entries for the identification of travel blog entries. They formulated the identification of travel blog entries as a sequence-labeling problem, and solved it using machine learning. For the machine learning method, they examined the Conditional Random Fields (CRF) method [20]; its empirical success has been reported recently in the field of natural language processing. The CRF-based method identifies the tag <sup>3</sup> of each entry. Features and tags are given in the CRF method as follows: (1) *k* tags occur before a target entry; (2) *k* features occur before a target entry; and (3) *k* features follow a target entry (see Figure 1). They used the value of *k* = 4, which was determined in a pilot study. Here, they used the following features for machine learning: whether an entry contains any of 416 cue phrases, such as " (travel)", " (tour)", and " (departure)", and the number of location names in each entry. Using the above method, Nanba et al. identified 17,268 travel blog entries from 1,100,000 blog entries, and constructed a system that plotted travel blog entries on a Google map (see Figure 2).4 In this figure, travel blog entries are shown as icons. If the user clicks an icon, the

Nakatoh et al. [24] proposed a method for extracting names of local culinary dishes from travel blogs written in Japanese, which were identified when the blog entry included both the name of a sightseeing destination and the word "tourism". They extracted local dishes by gathering nouns that are dependent on the verb "" (eat). Tsai and Chou [32] also proposed a method for extracting dish names from restaurant review blogs written in

<sup>1</sup> We use the term *travel blog*. Other studies use the term "Travelogues" [10], indicating social networking service (SNS)

<sup>2</sup> Although Nanba et al. identified Japanese travel blogs, their method can be applied to blogs written in other

**Figure 2.** Travel blog entries plotted on a Google map

In the following, we explain the detail of the bootstrapping-based and machine learning-based information extraction approaches based on Nanba's work [25]. Nanba et al. extracted pairs comprising a location name and a local product from travel blogs written in Japanese, which were identified using the method described in Section 2.1. For the efficient extraction of travel information, they employed a bootstrapping method.

First, they prepared 482 pairs as seeds for the bootstrapping. These pairs were obtained automatically from a "Web Japanese N-gram" database provided by Google, Inc. The database comprises *N*-grams (*N* = 1-7) extracted from 20 billion Japanese sentences on the Web. They applied the pattern "[] []" ([slot of "location name"] local product [slot of "local product"]) to the database, and extracted location names and local products from each corresponding slot, thereby obtaining the 482 pairs.

1. Input a travel blog entry.

2. Extract a hyperlink and any surrounding sentences that mention the link (a citing area).

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

139

A hyperlink may be classified as more than one type. For example, a hyperlink to "" (Chinese noodle museum, http://www.raumen.co.jp/home/) was classified as types S and R, because the visitors to this museum can learn the history of

For the classification of link types, they employed a machine learning technique using the

• Whether the word is a cue phrase, detailed as follows, where the numbers in brackets

**Cue phrase The number**

**Cue phrase The number**

A list of tourist spots, collected from Wikipedia. 17,371

Words frequently used in the name of tourist spots, such as "" (zoo) or "" (museum). 138 Words related to sightseeing, such as "" (sightseeing) or "" (stroll). 172 Other words. 131

Words that are frequently used in the name of hotels, such as "" (hotel) or "" (Japanese inn). 9

Words that are frequently used when tourists stay in accommodation,

"" (front desk) or "" (guest room). 29

such as "" (stay) or "" (check in). 14 Other words. 21

**of cues**

**of cues**

3. Classify the link by taking account of the information in the citing area.

They classified link types into the following four categories.

• S (Spot): The information is about tourist spots.

• O (Other): Other than types S, H, and R.

Chinese noodles in addition to eating them.

following features.

**Table 1.** Cues for type S

**Table 2.** Cues for type H

• A word.

• H (Hotel): The information is about accommodation. • R (Restaurant): The information is about restaurants.

shown for each feature represent the number of cues.

Component words for accommodations, such as

Second, they applied a machine learning-based information extraction technique to the travel blogs identified in the previous step, and obtained new pairs. In this step, they prepared training data for the machine learning in the following three steps.


As a machine learning method, they used CRF. The CRF-based method identifies the class of each word in a given sentence. Features and tags are given in the CRF method as follows: (1) *k* tags occur before a target word; (2) *k* features occur before a target word; and (3) *k* features follow a target word. They used the value of *k* = 2, which was determined in a pilot study. They used the following six features for machine learning.


## **2.3. Automatic compilation of travel links**

Collections of Web links are usefel information sources. However, maintaining these collections manually is costly. Therefore, an automatic method for compiling collections of Web links is required. In this section, we introduce a method that compiles travel links automatically.

From travel blog entries, which were automatically identified using the method mentioned in Section 2.1, Ishino et al. [15] extracted the hyperlinks to useful Web sites for a tourist spot included by bloggers, and thereby constructed collections of hyperlinks for tourist spots. The procedure for classifying links in travel blog entries is as follows.

<sup>5</sup> Here, a location name corresponds to only a local product in each sentence.

1. Input a travel blog entry.

4 Text Mining

• Word.

database comprises *N*-grams (*N* = 1-7) extracted from 20 billion Japanese sentences on the Web. They applied the pattern "[] []" ([slot of "location name"] local product [slot of "local product"]) to the database, and extracted location names and local

Second, they applied a machine learning-based information extraction technique to the travel blogs identified in the previous step, and obtained new pairs. In this step, they prepared

1. Select 200 sentences that contain both a location name and a local product from the 482 pairs. Then automatically create 200 tagged sentences, to which both "location" and

2. Prepare another 200 sentences that contain only a location name. Then create 200 tagged

3. Apply machine learning to the 400 tagged sentences, and obtain a system that

As a machine learning method, they used CRF. The CRF-based method identifies the class of each word in a given sentence. Features and tags are given in the CRF method as follows: (1) *k* tags occur before a target word; (2) *k* features occur before a target word; and (3) *k* features follow a target word. They used the value of *k* = 2, which was determined in a pilot study.

• Whether the word is a cue word, such as "", "", "" (local product), ""

• Whether the word is frequently used in the names of local products or souvenirs, such as

Collections of Web links are usefel information sources. However, maintaining these collections manually is costly. Therefore, an automatic method for compiling collections of Web links is required. In this section, we introduce a method that compiles travel links

From travel blog entries, which were automatically identified using the method mentioned in Section 2.1, Ishino et al. [15] extracted the hyperlinks to useful Web sites for a tourist spot included by bloggers, and thereby constructed collections of hyperlinks for tourist spots. The

automatically allocates "location" and "product" tags to given sentences.

• The part of speech to which the word belongs (noun, verb, adjective, etc.)

products from each corresponding slot, thereby obtaining the 482 pairs.

training data for the machine learning in the following three steps.

sentences, to which the "location" tag is assigned.

They used the following six features for machine learning.

• Whether the word is a quotation mark.

• Whether the word is a surface case.

"cake" or "noodle".

automatically.

(famous confection), or "" (souvenir).

**2.3. Automatic compilation of travel links**

procedure for classifying links in travel blog entries is as follows.

<sup>5</sup> Here, a location name corresponds to only a local product in each sentence.

"product" tags are assigned.5


They classified link types into the following four categories.


A hyperlink may be classified as more than one type. For example, a hyperlink to "" (Chinese noodle museum, http://www.raumen.co.jp/home/) was classified as types S and R, because the visitors to this museum can learn the history of Chinese noodles in addition to eating them.

For the classification of link types, they employed a machine learning technique using the following features.



**Table 1.** Cues for type S


Based on this method, Ishino et al. constructed a travel link search system.<sup>6</sup> The system generated a list of URLs for Web sites related to a location, and automatically identified link types and the context of citations ("citing areas"), where the blog authors described the sites. Figure 3 shows a list of links related to "" (Osaka).

**3. Travelers' behavior analysis**

section, we focus on the analysis of travelers' behavior.

• FROM tag indicates the departure place.

• METHOD tag indicates the transportation device. • TIME tag indicates the time of transportation.

to<TO>Osaka< /TO> by <METHOD>bus< /METHOD>.

a pilot study. They used the following features for machine learning.

• The part of speech to which the word belongs (noun, verb, adjective, etc.).

• TO tag indicates the destination. • VIA tag indicates the route.

The following is a tagged example.

• Whether the word is a quotation mark. • Whether the word is a cue phrase.

"" (from) or " " (left): 40.

• A word.

shown as follows.

task.

The analysis of people's transportation information is considered an important issue in various fields, such as city planning, architectural planning, car navigation, sightseeing administration, crime prevention, and tracing the spread of infection of epidemics. In this

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

141

Ishino et al. [15] proposed a method to extract people's transportation information from automatically identified travel blogs written in Japanese [25]. They used machine learning to extract information, such as "departure place", "destination", or "transportation device",

It took <TIME>five hours< /TIME> to travel from <FROM>Hiroshima< /FROM>

They formulated the task of identifying the class of each word in a given sentence and solved it using machine learning. For the machine learning method, they used CRF [20], in the same way as Nanba et al. [25], which we mentioned in Section 2.2. The CRF-based method identifies the class of each entry. Features and tags are used in the CRF method as follows: (1) *k* tags occur before a target entry; (2) *k* features occur before a target entry; and (3) *k* features follow a target entry. They used the value *k* = 4 <sup>7</sup> , which was determined via

The details of cue phrases, together with the number of cue phrases of the given type, are

1. **FROM**: The word is a cue that often appears immediately after the FROM tag, such as

<sup>7</sup> Nanba et al.[25] used the smaller value *k* = 2 in the extraction of pairs comprising a location name and a local product (Section 2.2), because the tags are determined by a word itself or its adjacent words in most cases in Nanba's

from travel blog entries. First, the tags used in their examination are defined.

**Figure 3.** A list of Web sites for a travel spot


**Table 3.** Cues for type R

<sup>6</sup> http://www.ls.info.hiroshima-cu.ac.jp/travel/

## **3. Travelers' behavior analysis**

6 Text Mining

Based on this method, Ishino et al. constructed a travel link search system.<sup>6</sup> The system generated a list of URLs for Web sites related to a location, and automatically identified link types and the context of citations ("citing areas"), where the blog authors described the sites.

**Cue phrase The number**

Dish names such as "omelet", collected from Wikipedia. 2,779 Cooking styles such as "Italian cuisine", collected from Wikipedia. 114

such as "" (restaurant) or "" (dining room). 21

or "" (delicious). 52 General words that indicate food, such as "" (rice) or "" (cooking). 31 Other words. 31

Words that are frequently used in the name of restaurants,

Words that are used when taking meals, such as "" (eat)

of cues

Figure 3 shows a list of links related to "" (Osaka).

**Figure 3.** A list of Web sites for a travel spot

**Table 3.** Cues for type R

<sup>6</sup> http://www.ls.info.hiroshima-cu.ac.jp/travel/

The analysis of people's transportation information is considered an important issue in various fields, such as city planning, architectural planning, car navigation, sightseeing administration, crime prevention, and tracing the spread of infection of epidemics. In this section, we focus on the analysis of travelers' behavior.

Ishino et al. [15] proposed a method to extract people's transportation information from automatically identified travel blogs written in Japanese [25]. They used machine learning to extract information, such as "departure place", "destination", or "transportation device", from travel blog entries. First, the tags used in their examination are defined.


The following is a tagged example.

It took <TIME>five hours< /TIME> to travel from <FROM>Hiroshima< /FROM> to<TO>Osaka< /TO> by <METHOD>bus< /METHOD>.

They formulated the task of identifying the class of each word in a given sentence and solved it using machine learning. For the machine learning method, they used CRF [20], in the same way as Nanba et al. [25], which we mentioned in Section 2.2. The CRF-based method identifies the class of each entry. Features and tags are used in the CRF method as follows: (1) *k* tags occur before a target entry; (2) *k* features occur before a target entry; and (3) *k* features follow a target entry. They used the value *k* = 4 <sup>7</sup> , which was determined via a pilot study. They used the following features for machine learning.


The details of cue phrases, together with the number of cue phrases of the given type, are shown as follows.

1. **FROM**: The word is a cue that often appears immediately after the FROM tag, such as "" (from) or " " (left): 40.

<sup>7</sup> Nanba et al.[25] used the smaller value *k* = 2 in the extraction of pairs comprising a location name and a local product (Section 2.2), because the tags are determined by a word itself or its adjacent words in most cases in Nanba's task.


They also constructed a visualization of transportation information, which is shown in Figure 4. In this figure, each arrow indicates a link from a departure place to a destination. In addition to arrows, transportation methods, such as trains or buses, are shown as icons.

Transportation information can also be extracted from texts written in English. Davidov [6] presented an algorithm framework that enables automated acquisition of map-link information from the Web, based on linguistic patterns such as "from X to". Given a set of locations as initial seeds, he retrieved an extended set of locations from the Web, and produced a map-link network that connected these locations using edges showing the transportation type.

**Figure 4.** Example of transportation information automatically extracted from travel blogs

and the words in the text to which it corresponds are also presented.

with related images.

obtained from photos and videos.

they developed three modules: (1) destination recommendation for flexible queries; (2) characteristics summarization for a given destination, with representative tags and snippets; and (3) identification of informative parts of a travel blog and enriching recommendations

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

143

Figure 5 shows an example of the system output. In this figure, a travel blog segment<sup>9</sup> is enriched with three images that depict its most informative parts. Each image's original tags

Wu et al. [34] proposed a system that summarized tourism-related information. When a user (traveler) entered a query, such as "What is the historical background of Tian Tan?", the system searched for and obtained information from Wikipedia, Flickr, YouTube, and official tourism Web sites using the tourist spot name as a query. The system also classified the query as belonging to one of five categories—"general", "history", "landscape", "indoor scenery", and "outdoor scenery"—in order to provide users with more relevant information. For example, when a query is classified as belonging to the "history" category, the information is obtained from texts, while for a query regarding "outdoor scenery", the information is

<sup>9</sup> A segment of a Maui travel blog entitled "Our Maiden Journey to Magical Maui",

http://www.igougo.com/journal-j23321-Maui-Our\_Maiden\_Journey\_to\_Magical\_Maui.html

## **4. Recommending travel information**

Recommendation systems provide a promising approach to ranking commercial products or documents according to a user's interests. In this section, we describe several studies and services that recommend travel information. We describe the recommendation of tourist spots, landmarks, travel products, accomodation, and photos.

## **4.1. Recommending tourist spots**

Recommending tourist spots8 has been well studied in the multimedia field. Movies and images are used as information sources in addition to texts. In this section, we describe two multimedia studies.

Hao et al. [10] proposed a method for mining location-representative knowledge from travel blogs based on a probabilistic topic model (the Location–Topic model). Using this model,

<sup>8</sup> Here, we use the terms "tourist spot" and "landmark" for a region, such as "Paris" or "New York", and also for a location or building, such as "the Eiffel Tower" or "Statue of Liberty".

<sup>142</sup> Theory and Applications for Advanced Text Mining Text Mining Automatic Compilation of Travel Information from Texts: A Survey 9 Automatic Compilation of Travel Information from Texts: A Survey http://dx.doi.org/10.5772/51290 143

**Figure 4.** Example of transportation information automatically extracted from travel blogs

8 Text Mining

2. **FROM & TO**: The word is frequently used in the name of a tourist spot, such as ""

3. **TO**: The word is a cue that often appears immediately after the "TO" tag, such as ""

4. **VIA**: The word is a cue that often appears immediately after the "via" tag, such as ""

5. **METHOD**: The word is the name of a transportation device, such as "" (airplane)

6. (TIME): The word is an expression related to time, such as "" (minute) or "" (hour):

They also constructed a visualization of transportation information, which is shown in Figure 4. In this figure, each arrow indicates a link from a departure place to a destination. In addition to arrows, transportation methods, such as trains or buses, are shown as icons.

Transportation information can also be extracted from texts written in English. Davidov [6] presented an algorithm framework that enables automated acquisition of map-link information from the Web, based on linguistic patterns such as "from X to". Given a set of locations as initial seeds, he retrieved an extended set of locations from the Web, and produced a map-link network that connected these locations using edges showing the

Recommendation systems provide a promising approach to ranking commercial products or documents according to a user's interests. In this section, we describe several studies and services that recommend travel information. We describe the recommendation of tourist

Recommending tourist spots8 has been well studied in the multimedia field. Movies and images are used as information sources in addition to texts. In this section, we describe two

Hao et al. [10] proposed a method for mining location-representative knowledge from travel blogs based on a probabilistic topic model (the Location–Topic model). Using this model,

<sup>8</sup> Here, we use the terms "tourist spot" and "landmark" for a region, such as "Paris" or "New York", and also for a

The word is frequently used in the name of a destination, such as "" (sightseeing

(museum) or "" (amusement park): 45.

The word is the name of a tourist spot: 13,779. The word is the name of a station or airport: 9437.

tour) or "" (station): 11.

(to) or "" (arrival): 271.

(via) or "" (through): 43.

or "" (car): 148.

77.

transportation type.

multimedia studies.

The word is the name of a highway: 101.

The word is the name of a vehicle: 128. The word is the name of a train or bus: 2033.

**4. Recommending travel information**

**4.1. Recommending tourist spots**

spots, landmarks, travel products, accomodation, and photos.

location or building, such as "the Eiffel Tower" or "Statue of Liberty".

they developed three modules: (1) destination recommendation for flexible queries; (2) characteristics summarization for a given destination, with representative tags and snippets; and (3) identification of informative parts of a travel blog and enriching recommendations with related images.

Figure 5 shows an example of the system output. In this figure, a travel blog segment<sup>9</sup> is enriched with three images that depict its most informative parts. Each image's original tags and the words in the text to which it corresponds are also presented.

Wu et al. [34] proposed a system that summarized tourism-related information. When a user (traveler) entered a query, such as "What is the historical background of Tian Tan?", the system searched for and obtained information from Wikipedia, Flickr, YouTube, and official tourism Web sites using the tourist spot name as a query. The system also classified the query as belonging to one of five categories—"general", "history", "landscape", "indoor scenery", and "outdoor scenery"—in order to provide users with more relevant information. For example, when a query is classified as belonging to the "history" category, the information is obtained from texts, while for a query regarding "outdoor scenery", the information is obtained from photos and videos.

<sup>9</sup> A segment of a Maui travel blog entitled "Our Maiden Journey to Magical Maui", http://www.igougo.com/journal-j23321-Maui-Our\_Maiden\_Journey\_to\_Magical\_Maui.html

First, the method for extracting keywords from the citing areas of links of type S is described. The cues for type S, such as tourist spots collected from Wikipedia and words frequently used in the names of tourist spots, tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type S. If the citing areas of these links contained candidate keywords, they extracted the candidates as keywords. In addition, if citing areas contained

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

145

The cues for type R, such as dish names and cooking styles, also tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type R. If the citing areas for links of type R contained candidate keywords, they extracted them as keywords.

Titov and McDonald [31] proposed an aspect-based summarization system, and applied the method to the summarization of hotel reviews. The system took as input a set of user reviews for a specific product or service with a numeric rating (left side in Figure 6), and produced a set of relevant aspects, which they called an aspect-based summary (right side in Figure 6). To extract all relevant mentions in each review for each aspect, they introduced a topic model. They applied their method to hotel reviews on the TripAdvisor Web site12, and

To obtain more reliable hotel reviews, opinion spams should be detected and eliminated. Opinion spams are fictitious opinions that have been deliberately written to sound authentic. Ott et al. [27] proposed a method to detect opinion spam among consumer reviews of hotels. They created 400 deceptive opinions using the Amazon Mechanical Turk (AMT) crowdsourcing service<sup>13</sup> by asking anonymous online workers (Turkers) to create the opinion spam for 20 chosen hotels. In addition to these spam messages, they selected 6,977 truthful

Bressan et al. [2] proposed a travel blog assistant system that facilitated the travel blog writing by selecting for each blog paragraph the most relevant images from an image set.

names of places, they extracted the names as keywords.

**4.4. Recommending accommodation**

obtained aspect-based summaries for each hotel.

**Figure 6.** Producing aspect mentions from a corpus of aspect rated reviews

opinions from TripAdvisor, and used both groups for their task.

**4.5. Recommending photos**

The procedure is as follows.

<sup>12</sup> http://www.tripadvisor.com <sup>13</sup> https://www.mturk.com/

**Figure 5.** Example of travel blog segment visually enriched with related images

## **4.2. Recommending landmarks**

Finding and recommending landmarks is considered an important research topic in the multimedia field, along with recommending tourist spots. Abbasi et al. [1] focused on the photo-sharing system Flickr, and proposed a method to identify landmark photos using tags and social Flickr groups. Gao et al. [7] also proposed a method to identify landmarks using Flickr and the Yahoo Travel Guide.

Ji et al. [17] proposed another method for finding landmarks. They adopted the method of clustering blog photos relating to a particular tourist site, such as Louvre Museum in Paris.<sup>10</sup> Then they represented these photos as a graph based on the clustering results, and detected landmarks using link analysis methods, such as the PageRank [3] and HITS [19] algorithms.

## **4.3. Recommending travel products**

Ishino et al. [14] proposed a method that added links to advertisements for travel products to the travel information links that were described in Section 2.3.<sup>11</sup> The procedure for providing ad links is as follows.


They extracted keywords for travel products corresponding to the link type. They used the same cues to classify travel information links [15] (see Section 2.3), and then extracted keywords from the citing areas of links of types S (Spot) and R (Restaurant).

<sup>10</sup> For calculating the similarity between two photos, they used the Bag-of-Visual-Words representation [18, 26], which represents an image as a set of salient regions (visual words), called Bag-of-Visual-Words vectors. Then the similarity between photos is measured based on the cosine distance between their Bag-of-Visual-Words vectors. In addition to the features in each image, they also used textual information for each photo, such as the title, description, and surrounding text.

<sup>11</sup> http://www.ls.info.hiroshima-cu.ac.jp/travel/

First, the method for extracting keywords from the citing areas of links of type S is described. The cues for type S, such as tourist spots collected from Wikipedia and words frequently used in the names of tourist spots, tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type S. If the citing areas of these links contained candidate keywords, they extracted the candidates as keywords. In addition, if citing areas contained names of places, they extracted the names as keywords.

The cues for type R, such as dish names and cooking styles, also tend to become keywords. Therefore, they registered these cues as candidate keywords for links of type R. If the citing areas for links of type R contained candidate keywords, they extracted them as keywords.

## **4.4. Recommending accommodation**

10 Text Mining

**Figure 5.** Example of travel blog segment visually enriched with related images

Finding and recommending landmarks is considered an important research topic in the multimedia field, along with recommending tourist spots. Abbasi et al. [1] focused on the photo-sharing system Flickr, and proposed a method to identify landmark photos using tags and social Flickr groups. Gao et al. [7] also proposed a method to identify landmarks

Ji et al. [17] proposed another method for finding landmarks. They adopted the method of clustering blog photos relating to a particular tourist site, such as Louvre Museum in Paris.<sup>10</sup> Then they represented these photos as a graph based on the clustering results, and detected landmarks using link analysis methods, such as the PageRank [3] and HITS [19] algorithms.

Ishino et al. [14] proposed a method that added links to advertisements for travel products to the travel information links that were described in Section 2.3.<sup>11</sup> The procedure for providing

3. Extract product data containing all keywords, and calculate the similarity between the

4. Provide the ad link to the product data having the highest similarity to the travel

They extracted keywords for travel products corresponding to the link type. They used the same cues to classify travel information links [15] (see Section 2.3), and then extracted

<sup>10</sup> For calculating the similarity between two photos, they used the Bag-of-Visual-Words representation [18, 26], which represents an image as a set of salient regions (visual words), called Bag-of-Visual-Words vectors. Then the similarity between photos is measured based on the cosine distance between their Bag-of-Visual-Words vectors. In addition to the features in each image, they also used textual information for each photo, such as the title, description, and

1. Input a link type and the citing areas of a travel information link.

citing areas of a travel information link and the product data.

keywords from the citing areas of links of types S (Spot) and R (Restaurant).

**4.2. Recommending landmarks**

using Flickr and the Yahoo Travel Guide.

**4.3. Recommending travel products**

2. Extract keywords from the citing areas.

ad links is as follows.

information link.

surrounding text.

<sup>11</sup> http://www.ls.info.hiroshima-cu.ac.jp/travel/

Titov and McDonald [31] proposed an aspect-based summarization system, and applied the method to the summarization of hotel reviews. The system took as input a set of user reviews for a specific product or service with a numeric rating (left side in Figure 6), and produced a set of relevant aspects, which they called an aspect-based summary (right side in Figure 6). To extract all relevant mentions in each review for each aspect, they introduced a topic model. They applied their method to hotel reviews on the TripAdvisor Web site12, and obtained aspect-based summaries for each hotel.

**Figure 6.** Producing aspect mentions from a corpus of aspect rated reviews

To obtain more reliable hotel reviews, opinion spams should be detected and eliminated. Opinion spams are fictitious opinions that have been deliberately written to sound authentic. Ott et al. [27] proposed a method to detect opinion spam among consumer reviews of hotels. They created 400 deceptive opinions using the Amazon Mechanical Turk (AMT) crowdsourcing service<sup>13</sup> by asking anonymous online workers (Turkers) to create the opinion spam for 20 chosen hotels. In addition to these spam messages, they selected 6,977 truthful opinions from TripAdvisor, and used both groups for their task.

## **4.5. Recommending photos**

Bressan et al. [2] proposed a travel blog assistant system that facilitated the travel blog writing by selecting for each blog paragraph the most relevant images from an image set. The procedure is as follows.

<sup>12</sup> http://www.tripadvisor.com

<sup>13</sup> https://www.mturk.com/


## **5. Interfaces for travel information access**

In this section, we describe two studies that focused on interfaces for travel information access.

## **5.1. Providing travel information along streetcar lines**

Ishino et al. [13] proposed a method for collecting blog entries about the Hiroshima Electric Railway (Hiroden) from a blog database.15 Hiroden blog entries were defined as travel journals that provide regional information for streetcar stations in Hiroshima. The task of collecting Hiroden blog entries was divided into two steps: (1) collection of blog entries; and (2) identification of Hiroden blog entries.

**Figure 7.** A route map of the Hiroden system

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

147

**Figure 8.** A list of links to Hiroden blog entries

Figure 7 shows a route map used by the system for providing travel information along the Hiroden streetcar lines. The route map shows Hiroden streetcar stations and major tourist spots. The steps in the search procedure are as follows.


#### **5.2. Natural language interface for accessing databases**

Several ontologies for e-tourism have been developed (see Section 6). Unfortunately, the gap between human users who want to retrieve information and the Semantic Web is yet to be cloased. Ruiz-Martínez et al. [30] proposed a method for querying ontological knowledge bases using natural language sentences. For example, when the user inputted the query "I want to visit the most important tourist attractions in Paris", the system conducted part-of-speech tagging, lemmatizing, and modification of query terms by synonyms, and finally searched the ontology.

<sup>14</sup> Bressan et al. used images that were categorized into 44 classes as training data for visual categorization. Each class was given a short text name, such as "clouds and sky" or "beach". When an image was categorized as belonging to classes A and B using the visual categorizer, the short texts given to each class were assigned as keywords of the image.

<sup>15</sup> http://165.242.101.30/travel/hiroden/

<sup>146</sup> Theory and Applications for Advanced Text Mining Text Mining Automatic Compilation of Travel Information from Texts: A Survey 13 Automatic Compilation of Travel Information from Texts: A Survey http://dx.doi.org/10.5772/51290 147

**Figure 7.** A route map of the Hiroden system

12 Text Mining

access.

aspects of and objects in the image.14

system using a repository of multimedia objects.

between the extracted metadata and the paragraph.

**5.1. Providing travel information along streetcar lines**

spots. The steps in the search procedure are as follows.

• (Step 2) Click the link to a Hiroden blog entry to display it.

**5.2. Natural language interface for accessing databases**

**5. Interfaces for travel information access**

(2) identification of Hiroden blog entries.

finally searched the ontology.

<sup>15</sup> http://165.242.101.30/travel/hiroden/

image.

1. The system adds metadata to the traveler's photos based on a generic visual categorizer, which provides annotations (short textual keywords) related to some generic visual

2. Textual information (tags) was obtained using a cross-content information retrieval

3. For a given paragraph, the system ranked the uploaded images according to the similarity

In this section, we describe two studies that focused on interfaces for travel information

Ishino et al. [13] proposed a method for collecting blog entries about the Hiroshima Electric Railway (Hiroden) from a blog database.15 Hiroden blog entries were defined as travel journals that provide regional information for streetcar stations in Hiroshima. The task of collecting Hiroden blog entries was divided into two steps: (1) collection of blog entries; and

Figure 7 shows a route map used by the system for providing travel information along the Hiroden streetcar lines. The route map shows Hiroden streetcar stations and major tourist

• (Step 1) Click the Hiroden streetcar station, such as " " (Atomic Bomb Dome), in Figure 7 to generate a list of links to Hiroden blog entries (Figure 8).

Several ontologies for e-tourism have been developed (see Section 6). Unfortunately, the gap between human users who want to retrieve information and the Semantic Web is yet to be cloased. Ruiz-Martínez et al. [30] proposed a method for querying ontological knowledge bases using natural language sentences. For example, when the user inputted the query "I want to visit the most important tourist attractions in Paris", the system conducted part-of-speech tagging, lemmatizing, and modification of query terms by synonyms, and

<sup>14</sup> Bressan et al. used images that were categorized into 44 classes as training data for visual categorization. Each class was given a short text name, such as "clouds and sky" or "beach". When an image was categorized as belonging to classes A and B using the visual categorizer, the short texts given to each class were assigned as keywords of the

**Figure 8.** A list of links to Hiroden blog entries

## **6. Linguistic resources for studies of automatic compilation of travel information from texts**

**Ontologies for Travel**

**Evaluation Workshop**

of geographic information.

*NTCIR GeoTime*

documents.

in Section 6.

coordinates or a postal address.

destinations to users' needs [16].

*GeoCLEF: Geographic Information Retrieval*

**7. Conclusions and future directions**

• The World Tourism Organization (WTO) provides a multilingual thesaurus in English,

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

149

• DERI's e-Tourism Working group has created a tourism ontology called "OnTour" [28]. This ontology describes the main conventional concepts for tourism such as accommodation or activities, together with other supplementary concepts such as GPS

• LA\_DMS is an ontology for tourism destinations that was developed for the Destination Management System (DMS). This system adapts information requests about tourist

GeoCLEF (http://ir.shef.ac.uk/geoclef/) was the cross-language geographic retrieval track run as part of the Cross-Language Evaluation Forum (CLEF). It operated from 2005 to 2008 [11, 12, 21, 22]. The goal of this task was to retrieve news articles relevant to particular aspects

NTCIR GeoTime was another cross-language geographic retrieval track run as part of the NTCIR. It operated from 2008 to 2011 [8, 9]. The focus of this task was searching with geographic and temporal constraints using Japanese and English news articles as target

In this chapter, we have introduced the state of the art of research and services related to

• We mentioned in Section 2 that several natural language processing technologies are useful for creating databases for travel. These technologies may also be applied to maintain manually created databases or ontologies for travel, such as those discussed

• Multilingualization of the ontologies for travel using machine translation techniques [4] is also considered an important task for encouraging further studies in this research field. • There are many different locations that have the same name (place name polysemy), and there may be multiple names for a given location (place name synonymy). To eliminate this geo-ambiguity problem, Ji et al. [17] proposed the Hierarchical-comparison Geo-Disambiguation (HGD) algorithm, which distinguished the city-level location using a combination of its lower-level locations, derived from the hierarchical location relationships. In addition to this method, several natural language processing

travel information. There are several future directions for this research field.

French, and Spanish that provides a standard terminology for tourism [33].

Many other ontologies for travel were introduced by Ruiz-Martínez et al. [30].

## **Text Corpora**


## **Databases for Travel**


## **Useful Sites or Services for Travel**


## **Ontologies for Travel**

14 Text Mining

**information from texts**

• TripAdvisor: http://tripadvisor.com

• Footstops: http://footstops.com

• IgoUgo: http://www.igougo.com

• Travbuddy: http://www.travbuddy.com

• TravelBlog: http://www.travelblog.org

• TravelPod: http://www.travelpod.com

**Useful Sites or Services for Travel**

• WikiTravel: http://wikitravel.org

• Yahoo Travel Guide: http://travel.yahoo.com/

written in English. • 4travel: http://4travel.jp

**Databases for Travel**

cities are listed.

get there.

classified at city level in a geographic hierarchy. • Travellerspoint: http://www.travellerspoint.com

**Text Corpora**

**6. Linguistic resources for studies of automatic compilation of travel**

This site provides fifty million reviews written in various languages.

This site provides more than 8,000 blog entries written in English.

This site provides more than 180,000 blog entries written in English.

Japanese. Each review is classified at city level in a geographic hierarchy.

• Rakuten travel data: http://www.nii.ac.jp/cscenter/idr/datalist.html (Japanese)

• Travel product data in Rakuten Shopping Mall (Rakuten Ichiba):

The data comprise 50 million items. Each item has name, code, price, URL, picture, shop

This site provides an area-based recommendation service. For each country, several main

The travel recommendation system contributed by "WikiTravellers". For each destination, the articles in WikiTravel generally include all or parts of the following information: history, climate, landmarks, work information, shopping information, food, and how to

Basic information about 11,468 properties and 350,000 reviews

http://www.nii.ac.jp/cscenter/idr/datalist.html (Japanese)

code, category ID, and descriptive text and registration data.

This site provides 530,000 reviews and 62,000 blog entries written in English.

This site provides more than 90,000 reviews and 180,000 blog entries written in English.

This site provides more than 600,000 blog entries written in English. Each entry is

This site is one of the oldest travel portal, started since 1997, and provides blog entries

This site provides approximately 300,000 reviews and 600,000 blog entries written in


Many other ontologies for travel were introduced by Ruiz-Martínez et al. [30].

## **Evaluation Workshop**

#### *GeoCLEF: Geographic Information Retrieval*

GeoCLEF (http://ir.shef.ac.uk/geoclef/) was the cross-language geographic retrieval track run as part of the Cross-Language Evaluation Forum (CLEF). It operated from 2005 to 2008 [11, 12, 21, 22]. The goal of this task was to retrieve news articles relevant to particular aspects of geographic information.

#### *NTCIR GeoTime*

NTCIR GeoTime was another cross-language geographic retrieval track run as part of the NTCIR. It operated from 2008 to 2011 [8, 9]. The focus of this task was searching with geographic and temporal constraints using Japanese and English news articles as target documents.

## **7. Conclusions and future directions**

In this chapter, we have introduced the state of the art of research and services related to travel information. There are several future directions for this research field.


technologies, such as automatic acquisition of synonyms [5, 29, 35, 36] and word sense disambiguation [23], are available.

[9] Gey, F., Larson, R., Kando, N., Machado, J., and Sakai, T. (2010) NTCIR-GeoTime Overview: Evaluating Geographic and Temporal Search. Proceedings of NTCIR-8

Automatic Compilation of Travel Information from Texts: A Survey

http://dx.doi.org/10.5772/51290

151

[10] Hao, Q., Cai, R., Wang, C., Xiao, R., Yang, J.-M., Pang, Y., and Zhang, L. (2010) Equip Tourists with Knowledge Mined from Travelogues. Proceedings of World Wide Web

[11] Gey, F., Larson, R.R., Sanderson, M., Bischoff, K., Mandl, T., Womser-Hacker, C., Santos, D., Rocha, P., Nunzio, G.M.D., Ferro, N. (2006) GeoCLEF 2006: The CLEF 2006 Cross-Language Geographic Information Retrieval Track Overview. Proceedings

[12] Gey, F. Larson, R.R., Sanderson, M., Joho, H., Clough, P., and Petras, V. (2005) GeoCLEF: The CLEF 2005 Cross-Language Geographic Information Retrieval Track Overview.

[13] Ishino, A., Nanba, H., and Takezawa, T. (2012) Construction of a System for Providing Travel Information along Hiroden Streetcar Lines. Proceedings of the 3rd

[14] Ishino, A., Nanba, H., and Takezawa, T. (2011) Providing Ad Links to Travel Blog Entries Based on Link Types. Proceedings of the 9th Workshop on Asian Language

[15] Ishino, A., Nanba, H., and Takezawa, T. (2011) Automatic Compilation of an Online Travel Portal from Automatically Extracted Travel Blog Entries. Proceedings of ENTER

[16] Jakkilinki, R., Ceorgievski, M., and Sharda, N. (2007) Connecting Destinations with an Ontology-Based e-Tourism Planner. Information and Communication Technologies in

[17] Ji, R., Xie, X., Yao, H., and Ma, W.-Y. (2009) Mining City Landmarks from Blogs by

[18] Jia, M.-L., Fan, X., Xie, X., Li, M.-J., and Ma, W.-Y. (2006) Photo-to-search: Using Camera

[19] Kleinberg, J. (1999) Authoritative Sources in a Hyperlinked Environment, Journal of

[20] Lafferty, J., McCallum, A., and Pereira, F. (2001) Conditional Random Field: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the

[21] Mandl, T., Carvalho, P., Nunzio, G.M.D., Gey, F., Larson, R.R., Santos, D., Womser-Hacker, C. (2008) GeoCLEF 2008: The CLEF 2008 Cross-Language Geographic

[22] Mandl, T., Gey, F., Nunzio, G.M.D., Ferro, N., Larson, R.R., Sanderson, M., Santos, D., Womser-Hacker, C., Xie, X. (2007) GeoCLEF 2007: The CLEF 2007

Information Retrieval Track Overview. Proceedings of CLEF 2008, pp.808–821.

Graph Modeling. Proceedings of ACM Multimedia'09, pp.105–114.

Phones to Inquire of the Surrounding World. Mobile Data Management.

IIAI International Conference on e-Services and Knowledge Management.

Lecture Notes in Computer Science, LNCS4022, pp.908–919.

Resources, collocated with IJCNLP 2011, pp.63–70.

Workshop Meeting.

Conference 2010.

2011.

Tourism, pp.21–32.

the ACM, Vol.46, No.5, pp.604–622.

18th Conference on Machine Learning: pp.282–289.

of CLEF 2006, pp.852–876.

• Recommending landmarks (landmark finding) is a standard research topic in image processing using Flickr. In this chapter, we mentioned three studies [1, 7, 17] that relied mainly on image processing and tag-based recommendation techniques rather than natural language processing. The authors believe that there is still room to improve the methods of recommending landmarks by natural language processing, because sentiment analysis techniques, such as those used for recommending accommodation, have not yet been used for recommending landmarks.

## **Author details**

Hidetsugu Nanba⋆, Aya Ishino and Toshiyuki Takezawa

<sup>⋆</sup> Address all correspondence to: nanba@hiroshima-cu.ac.jp

Graduate School of Information Sciences, Hiroshima City University, Japan

## **References**


[9] Gey, F., Larson, R., Kando, N., Machado, J., and Sakai, T. (2010) NTCIR-GeoTime Overview: Evaluating Geographic and Temporal Search. Proceedings of NTCIR-8 Workshop Meeting.

16 Text Mining

**Author details** Hidetsugu Nanba⋆,

**References**

disambiguation [23], are available.

been used for recommending landmarks.

<sup>⋆</sup> Address all correspondence to: nanba@hiroshima-cu.ac.jp

Metadata Mining for Image Understanding.

Natural Language Processing, pp.267–275.

Engine. Proceedings of World Wide Web Conference 1998.

Graduate School of Information Sciences, Hiroshima City University, Japan

[1] Abbasi, R., Chernov, S., Nejdl, W., Paiu, R., Staab, S. (2009) Exploiting Flickr Tags and Groups for Finding Landmark Photos. Proceedings of ECIR 2009, pp.654–661.

[2] Bressan, M., Csurka, G., Hoppenot, Y., and Renders, J.M. (2008) Travel Blog Assistant System (TBAS) - An Example Scenario of How to Enrich Text with Images and Images with Text using Online Multimedia Repositories. Proceedings of VISAPP Workshop on

[3] Brin, S. and Page, L. (1998) The Anatomy of a Large-scale Hypertextual Web Search

[4] Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., and Mercer, R.L. (1993) The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics,

[5] Callison-Burch, C., Koehn, P., and Osborne, M. (2006) Improved Statistical Machine

[6] Davidov, D. (2009). Geo-mining: Discovery of Road and Transport Networks Using Directional Patterns. Proceedings of the 2009 Conference on Empirical Methods in

[7] Gao, Y., Tang, J, Hong, R., Dai, Q., Chua, T.-S., and Jain, R. (2010) W2Go: A Travel Guidance System by Automatic Landmark Ranking. Proceedings of ACM

[8] Gey, F., Larson, R., Machado, J., and Yoshioka, M. (2011) NTCIR9-GeoTime Overview: Evaluating Geographic and Temporal Search: Round 2. Proceedings of NTCIR-9

Translation Using Paraphrases. Proceedings of NAACL 2006, pp.17–24.

Aya Ishino and Toshiyuki Takezawa

Vol.19, No.2, pp.263-311.

Multimedia'10.

Workshop Meeting.

technologies, such as automatic acquisition of synonyms [5, 29, 35, 36] and word sense

• Recommending landmarks (landmark finding) is a standard research topic in image processing using Flickr. In this chapter, we mentioned three studies [1, 7, 17] that relied mainly on image processing and tag-based recommendation techniques rather than natural language processing. The authors believe that there is still room to improve the methods of recommending landmarks by natural language processing, because sentiment analysis techniques, such as those used for recommending accommodation, have not yet


Cross-Language Geographic Information Retrieval Track Overview. Proceedings of CLEF 2007, pp.745–772.

**Chapter 7**

**Provisional chapter**

**Analyses on Text Data Related to the Safety of Drug**

One of main raison d'etre of medical care should cure patients and save their lives. Drug safety has attracted attention for a long time, with an emphasis on toxicity and side effects of drugs. Additional to this, the safety of drug use is attracting increasing attention from the perspective of medical accident prevention. In order to prevent medical accidents, such as errors involving medicines, double dosage and insufficient dosage, it is necessary to ensure the proper treatment of the right medicines, namely, safety of drug use. The confirmation of usage should be one of the keys to identifying errors and prevention from misuse. Consider the case when a doctor inputs prescription data into a computerized order entry system for medicines. If the system shows him information concerning therapeutic indications, he can subsequently avoid the errors. To enable this, the order entry system requires the databases

containing information on dosage regimens so that the proper usage can be verified.

insert data, we need to analyze the sentences in package insert data.

The most reliable data, which can be a source of the databases, is a package insert published by pharmaceutical companies as an official document attached to its medicine. Original package inserts are, however, distributed as paper documents and unsuitable for processing by a computer system. In Japan, Pharmaceutical and Medical Devices Agency (PMDA), which is an extra-departmental body of the Japanese Ministry of Health, Labor and Welfare, has released SGML formatted package insert data. SGML is an old-established markup language, which adds metadata and structures to data by tagging, which is defined by DTD. In fact, it is difficult to leverage the data structure defined in the DTD for analysis of the data. This is because the definition of data structure is ambiguous and because the information is not well structured, namely, described by the sentences in tagged elements. This hinders the utilization of the SGML formatted package insert data, especially as a database used in computer systems that ensure the safety of medicinal usage. We should also note that the SGML version package inserts usually describe their contents as sentences, as is described in the original paper version package inserts. In order to obtain information from package

Other important sources of knowledge besides official packge inserts are practices of medical experts. One of the useful and important ways to understand what people think is to conduct

and reproduction in any medium, provided the original work is properly cited.

©2012 Kimura, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Kimura; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

distribution, and reproduction in any medium, provided the original work is properly cited.

**Analyses on Text Data Related to the Safety**

**of Drug Use Based on Text Mining Techniques**

**Use Based on Text Mining Techniques**

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

Masaomi Kimura

Masaomi Kimura

**1. Introduction**

http://dx.doi.org/10.5772/51195


**Provisional chapter**

## **Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques**

Masaomi Kimura Masaomi Kimura

18 Text Mining

CLEF 2007, pp.745–772.

152 Theory and Applications for Advanced Text Mining Text Mining

Processing, chapter 7, MIT Press.

Proceedings of CVPR 2006.

World Tourism Organization.

pp.1021–1029.

Tourism Information,ICIC Express Letters, Vol.4, No.5.

Association for Computational Linguistics, pp.309–319.

[28] Prantner, K. (2004) OnTour -The Ontology-, DERI Innsbruck.

Paraphrase Generation. Proceedings of EMNLP 2004, pp.142–149.

on Natural Language Processing, pp.205–208.

Cross-Language Geographic Information Retrieval Track Overview. Proceedings of

[23] Manning, C. D. and Schu tze, H (2000) Foundations of Statistical Natural Language

[24] Nakatoh, T., Yin, C., and Hirokawa, S. (2011) Characteristic Grammatical Context of

[25] Nanba, H., Taguma, H., Ozaki, T., Kobayashi, D., Ishino, A., and Takezawa, T. (2009) Automatic Compilation of Travel Information from Automatically Identified Travel Blogs. Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference

[26] Nister, D. and Stewenius, H. (2006) Scalable Recognition with a Vocabulary Tree.

[27] Ott, M., Choi, Y. Cardie, C., and Hancock, J.T. (2011) Finding Deceptive Opinion Spam by Any Stretch of the Imagination. Proceedings of the 49th Annual Meeting of the

[29] Quirk, C., Brockett, C., and Dolan, W. (2004) Monolingual Machine Translation for

[30] Ruiz-Martínez, J.M., Castellanos-Nieves, D., Valencia-Garc˜ia, R., Fernãndez-Breis, J.T., Garc˜ia-Sãnchez, F., Vivancos-Vicente, P.J., Castejõn-Garrido, J.S., Camõn, J.B., and Mart˜inez-Bejar, R. (2009) Accessing Touristic Knowledge Bases through a Natural ˜

[31] Titov, I. and McDonald, R. (2008) A Joint Model of Text and Aspect Ratings for Sentiment Summarization. Proceedings of Annual Meeting of the Association for

[32] Tsai, R.T.-H. and Chou, C.-H. (2011) Extracting Dish Names from Chinese Blog Reviews Using Suffix Arrays and a Multi-Modal CRF Model, Proceedings of ACM SIGIR 2011.

[33] World Tourism Organization (2001) Thesaurus on Tourism and Leisure Activities of the

[34] Wu, X., Li, J., and Neo, S.-Y. (2008) Personalized Multimedia Web Summarization for

[35] Zhao, S., Niu, C., Zhou, M., Liu, T., and Li, S. (2008) Combining Multiple Resources to Improve SMT-based Paraphrasing Model. Proceedings of ACL-HLT 2008,

[36] Zhou, L., Lin, C.-Y., Munteanu, D.S., and Hovy, E. (2006) ParaEval: Using Paraphrases to Evaluate Summaries Automatically. Proceedings of HLT-NAACL 2006, pp.447–454.

Language Interface, Proceedings of PKAW 2008, LNAI 5465, pp.147–160.

Computational Linguistics & Human Language Technology, pp.308–316.

Tourist. Proceedings of World Wide Web Conference 2008.

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51195

## **1. Introduction**

One of main raison d'etre of medical care should cure patients and save their lives. Drug safety has attracted attention for a long time, with an emphasis on toxicity and side effects of drugs. Additional to this, the safety of drug use is attracting increasing attention from the perspective of medical accident prevention. In order to prevent medical accidents, such as errors involving medicines, double dosage and insufficient dosage, it is necessary to ensure the proper treatment of the right medicines, namely, safety of drug use. The confirmation of usage should be one of the keys to identifying errors and prevention from misuse. Consider the case when a doctor inputs prescription data into a computerized order entry system for medicines. If the system shows him information concerning therapeutic indications, he can subsequently avoid the errors. To enable this, the order entry system requires the databases containing information on dosage regimens so that the proper usage can be verified.

The most reliable data, which can be a source of the databases, is a package insert published by pharmaceutical companies as an official document attached to its medicine. Original package inserts are, however, distributed as paper documents and unsuitable for processing by a computer system. In Japan, Pharmaceutical and Medical Devices Agency (PMDA), which is an extra-departmental body of the Japanese Ministry of Health, Labor and Welfare, has released SGML formatted package insert data. SGML is an old-established markup language, which adds metadata and structures to data by tagging, which is defined by DTD. In fact, it is difficult to leverage the data structure defined in the DTD for analysis of the data. This is because the definition of data structure is ambiguous and because the information is not well structured, namely, described by the sentences in tagged elements. This hinders the utilization of the SGML formatted package insert data, especially as a database used in computer systems that ensure the safety of medicinal usage. We should also note that the SGML version package inserts usually describe their contents as sentences, as is described in the original paper version package inserts. In order to obtain information from package insert data, we need to analyze the sentences in package insert data.

Other important sources of knowledge besides official packge inserts are practices of medical experts. One of the useful and important ways to understand what people think is to conduct

Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Kimura; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

©2012 Kimura, licensee InTech. This is an open access chapter distributed under the terms of the Creative

a survey in the form of a questionnaire. In particular, the freely described data included in the questionnaire responses represent an important source to let us know the real thoughts of the people. However, it is not easy to analyze such freely described data by hand, since a large number of responses are anticipated and subsequent analysis using manual counting may be influenced by the individual prejudice of the analysts involved. It is, therefore, suitable to apply a text mining approach to objectively analyze such freely described data. As readers know, text mining is an analytical technique based on data mining / statistical analysis algorithms and NLP algorithms. It has wide applicability — including clustering research papers or newspaper articles, finding trends in call center logs or blogged articles, and so on. The clustering of textual data is popular as a commonly-available method to classify data and understand their structure. Unlike such applications, however, the freely described data contained in the responses of a questionnaire have characteristics such as a small number of short sentences in each piece of data and wide-ranging content that precludes the application of clustering algorithms to classify it. In this chapter, we review the cases of application of our method to questionnaire data.

**2. Method**

*2.1.1. Introduction*

topic are evaluated by the respondents.

the form of a large-scale questionnaire.

*2.1.2. Theory*

**2.1. Word-link method and dependency-link method [3]**

co-occurrence relationships of words and their structure in sentences.

sentences which contain the main opinions of the respondents.

• *W*(*s*1) = { (drug), (safety), (important) },

the sentence *si*. We also define a set of dependency relations *D*(*si*) = {*d<sup>i</sup>*

• *W*(*s*2) = { (drug), (safety), (improved), (needs)},

denotes a dependency relation in the sentence *si*, and their union set *D* = ∪*iD*(*si*).

analysis to *si*, we obtain a series of words *W*(*si*) = {*w<sup>i</sup>*

In order to determine the features of freely described data, the easiest and simplest way is to apply morphological analysis and count the number of the root (main part) of morphemes, which shows us particular words recurring frequently and suggests the nature of the themes discussed by respondents. This method, however, derives a difficult result to interpret in the case where there are several different topics contained in the entire free descriptions contained in the questionnaire responses. This is because that method can show the appearance of words but does not preserve their inter-relations. This method cannot, therefore, provide us with more in-depth information, such as how matters related to the

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

155

Regarding the syntax tree of a sentence based on modification relationships as semi-structured data, Matsuzawa et al. [1] and Kudo et al. [2] have applied pattern mining algorithms to extract frequently appearing subtrees, namely, sub-sentences recurring frequently in plural sentences more than a specified number of times (support). These represent rigorous means to determine the pattern of sub-sentences, which preserves the

As for the freely described data written by respondents, there is no guarantee of them expressing the same opinion in sentence of the same structure. If the respondents write similar sentences but with slightly different structures, it is difficult to identify the sentences by only matching their substructures alone. In addition, we have to maintain the entire data in memory at the same time when we use the pattern mining algorithm, which prunes the substructure appearing less than the support during the process. It is preferable that the algorithm be applicable to the huge size of data to a sufficient extent to cover surveillance in

In this section, we, therefore, suggest a method featuring summarized description data, by initially aggregating modification relations and then limiting them to instances appearing more than the support. By connecting the resultant modification relations and finding word sequences which can be reconstituted into understandable sentences, we can expect to extract

Let *si* (*i* = 1 ··· *n*) denote the sentences in freely described text data. Applying morphological

For instance, if we target the two sentences, *s*1=" "(the safety of drug is important) and *s*2=" "(the safety of drug needs improved),

<sup>1</sup>, *<sup>w</sup><sup>i</sup>*

<sup>2</sup>, ···}, where *<sup>w</sup><sup>i</sup>*

*<sup>j</sup>* denotes a word in

<sup>2</sup>, ···}, where *<sup>d</sup><sup>i</sup>*

*j*

<sup>1</sup>, *<sup>d</sup><sup>i</sup>*

As we mentioned above, it is necessary to avoid medical accidents. In order to take a countermeasure, past cases must be investigated to identify their causes and suitable countermeasures. Medical incidents, caused by treatment with the wrong medicines, are strongly related to medical accidents occurring due to a lack of safety in drug usage. Medical incidents are the ones that may potentially become medical accidents without certain suppression factors, and tend to occur more frequently than medical accidents. Incorporating Heinrich's law, which shows the tendency of frequency and seriousness of industrial accidents, we can estimate that for every serious medical accident, there are 300 incidents and thirty minor accidents. This can be interpreted as medical accidents having many causes, most of which are eliminated by certain suppression factors, which lead to incidents, while the remaining causes lead to medical accidents. From this perspective, we can expect both medical accidents and incidents to originate from identical causes, which suggests that the analysis of data concerning incidents is valid in order to investigate the cause of medical accidents, since their occurrence frequency tends to be much larger than that of medical accidents. Though simple aggregation calculations and descriptive statistics have already been applied to drug-related medical incident data, the analyses are too simple to extract sufficient information, such as the reasons behind incidents depending on the circumstances. To ensure such analyses could be properly performed, we should apply text mining technique to the texts describing incidents.

In this chapter, we introduce the techniques that we have developed, Word-link method and Dependency-link method, and review their application to the following data:

	- Application to an analysis on descriptions of dosage regimens described in package inserts of medicines
	- Application to data obtained by nation-wide investigations based on questionnaires about the 'therapeutic classification mark' printed on transdermal cardiac patches
	- Application to incident data disclosed by Government of Japan

## **2. Method**

2

our method to questionnaire data.

mining technique to the texts describing incidents.

• Package inserts

• Questionnaire data

• Medical incident data

inserts of medicines

a survey in the form of a questionnaire. In particular, the freely described data included in the questionnaire responses represent an important source to let us know the real thoughts of the people. However, it is not easy to analyze such freely described data by hand, since a large number of responses are anticipated and subsequent analysis using manual counting may be influenced by the individual prejudice of the analysts involved. It is, therefore, suitable to apply a text mining approach to objectively analyze such freely described data. As readers know, text mining is an analytical technique based on data mining / statistical analysis algorithms and NLP algorithms. It has wide applicability — including clustering research papers or newspaper articles, finding trends in call center logs or blogged articles, and so on. The clustering of textual data is popular as a commonly-available method to classify data and understand their structure. Unlike such applications, however, the freely described data contained in the responses of a questionnaire have characteristics such as a small number of short sentences in each piece of data and wide-ranging content that precludes the application of clustering algorithms to classify it. In this chapter, we review the cases of application of

As we mentioned above, it is necessary to avoid medical accidents. In order to take a countermeasure, past cases must be investigated to identify their causes and suitable countermeasures. Medical incidents, caused by treatment with the wrong medicines, are strongly related to medical accidents occurring due to a lack of safety in drug usage. Medical incidents are the ones that may potentially become medical accidents without certain suppression factors, and tend to occur more frequently than medical accidents. Incorporating Heinrich's law, which shows the tendency of frequency and seriousness of industrial accidents, we can estimate that for every serious medical accident, there are 300 incidents and thirty minor accidents. This can be interpreted as medical accidents having many causes, most of which are eliminated by certain suppression factors, which lead to incidents, while the remaining causes lead to medical accidents. From this perspective, we can expect both medical accidents and incidents to originate from identical causes, which suggests that the analysis of data concerning incidents is valid in order to investigate the cause of medical accidents, since their occurrence frequency tends to be much larger than that of medical accidents. Though simple aggregation calculations and descriptive statistics have already been applied to drug-related medical incident data, the analyses are too simple to extract sufficient information, such as the reasons behind incidents depending on the circumstances. To ensure such analyses could be properly performed, we should apply text

In this chapter, we introduce the techniques that we have developed, Word-link method and

• Application to an analysis on descriptions of dosage regimens described in package

• Application to data obtained by nation-wide investigations based on questionnaires about the 'therapeutic classification mark' printed on transdermal cardiac patches

Dependency-link method, and review their application to the following data:

• Application to incident data disclosed by Government of Japan

## **2.1. Word-link method and dependency-link method [3]**

#### *2.1.1. Introduction*

In order to determine the features of freely described data, the easiest and simplest way is to apply morphological analysis and count the number of the root (main part) of morphemes, which shows us particular words recurring frequently and suggests the nature of the themes discussed by respondents. This method, however, derives a difficult result to interpret in the case where there are several different topics contained in the entire free descriptions contained in the questionnaire responses. This is because that method can show the appearance of words but does not preserve their inter-relations. This method cannot, therefore, provide us with more in-depth information, such as how matters related to the topic are evaluated by the respondents.

Regarding the syntax tree of a sentence based on modification relationships as semi-structured data, Matsuzawa et al. [1] and Kudo et al. [2] have applied pattern mining algorithms to extract frequently appearing subtrees, namely, sub-sentences recurring frequently in plural sentences more than a specified number of times (support). These represent rigorous means to determine the pattern of sub-sentences, which preserves the co-occurrence relationships of words and their structure in sentences.

As for the freely described data written by respondents, there is no guarantee of them expressing the same opinion in sentence of the same structure. If the respondents write similar sentences but with slightly different structures, it is difficult to identify the sentences by only matching their substructures alone. In addition, we have to maintain the entire data in memory at the same time when we use the pattern mining algorithm, which prunes the substructure appearing less than the support during the process. It is preferable that the algorithm be applicable to the huge size of data to a sufficient extent to cover surveillance in the form of a large-scale questionnaire.

In this section, we, therefore, suggest a method featuring summarized description data, by initially aggregating modification relations and then limiting them to instances appearing more than the support. By connecting the resultant modification relations and finding word sequences which can be reconstituted into understandable sentences, we can expect to extract sentences which contain the main opinions of the respondents.

#### *2.1.2. Theory*

Let *si* (*i* = 1 ··· *n*) denote the sentences in freely described text data. Applying morphological analysis to *si*, we obtain a series of words *W*(*si*) = {*w<sup>i</sup>* <sup>1</sup>, *<sup>w</sup><sup>i</sup>* <sup>2</sup>, ···}, where *<sup>w</sup><sup>i</sup> <sup>j</sup>* denotes a word in the sentence *si*. We also define a set of dependency relations *D*(*si*) = {*d<sup>i</sup>* <sup>1</sup>, *<sup>d</sup><sup>i</sup>* <sup>2</sup>, ···}, where *<sup>d</sup><sup>i</sup> j* denotes a dependency relation in the sentence *si*, and their union set *D* = ∪*iD*(*si*).

For instance, if we target the two sentences, *s*1=" "(the safety of drug is important) and *s*2=" "(the safety of drug needs improved),


Note that, following the linkage of *d<sup>i</sup> <sup>j</sup>* ∈ *D*(*si*), we can reproduce the original sentence *si* except for the order of appearance of modifications which modify the same word. If the word *w<sup>i</sup> <sup>j</sup>* modifies another word *<sup>w</sup><sup>i</sup> <sup>k</sup>* and the dependency relation *<sup>d</sup><sup>i</sup>* <sup>∈</sup> *<sup>D</sup>*(*si*) is related to these words, we can define 'counterpart' functions such as

$$d^i = \mathcal{L}(w^i\_{j\prime} w^i\_k) \tag{1}$$

where 'card' denotes the cardinality of a set. The above statement can be described via *suppD*(*d*) as follows: if there are *η* sentences, which have the same dependency structure as

*suppD*(*d<sup>i</sup>*

*<sup>D</sup><sup>η</sup>* <sup>=</sup> {*<sup>d</sup>* <sup>|</sup> *<sup>d</sup>* <sup>∈</sup> *<sup>D</sup>*,*suppD*(*d*) <sup>≥</sup> *<sup>η</sup>*},

each modification relation in sentences with the same dependency structure, namely more than *η* times, is a member of *Dη*. These dependency relations satisfy the same relation as Eq.2.1.2, though, in general, we cannot necessarily expect the existence of the dependency

sentences described by plural respondents and with an equivalent dependency structure by following the linkage of dependency relations in *Dη*, which satisfies the relation Eq.2.1.2.

In fact, we should be aware that the extraction of a series of dependency relations in *D<sup>η</sup>* satisfying Eq.2.1.2 is a necessary condition to find such sentences and the co-occurrence of dependency relations is not preserved in this operation. In other words, the elements of

same sentence. In order to ensure the co-occurrence of dependency relations, it is necessary

the same sentence. If more sentences exist than the support, which contains a series of

by more respondents than the number preliminarily determined. Taking the calculation cost and the degree of freedom of expression into account, we relax the above restriction as

which are contained in the same sentence. Let *d* → *d*′ denotes such a pair of dependency

2. Next, find the two pairs *d* → *d*′ and *d*′ → *d*′′, where the dependency relationship *d*′ in both pairs is identical. If such pairs exist, we presume there is a link connecting these

3. Finally, follow the linkages of such pairs which appear more than *η*′ times and reproduce sentences (Third step). *η*′ is the threshold to limit the lower boundary of the number of

In this method, each of two pairs of dependency relations *d* → *d*′ and *d*′ → *d*′′ contains a

Since the variations of the structures of descriptions related to common opinions in a set of questionnaire data tend to be small, such overlap of words is (at least empirically) sufficient

) and *E*(*d*′

1. Firstly, find the pairs of dependency relations *d*, *d*′ ∈ *D* satisfying *E*(*d*) = *S*(*d*′

(We call this method using a series of dependency relations the 'word-link method'.)

, which satisfy the relation *E*(*d*) = *S*(*d*′

to confirm that the dependency relations *d* and *d*′ satisfying *E*(*d*) = *S*(*d*′

*<sup>k</sup>*) <sup>≥</sup> *<sup>η</sup>*.

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

*<sup>k</sup>* ∈ *D*(*si*). Thus the following

http://dx.doi.org/10.5772/51195

157

) for each *d*′ ∈ *Dη*. We can therefore expect to find

), do not necessarily appear in the

), we can conclude that the sentences are written

) = *S*(*d*′′), which appears in the same sentence.

) are included in

), both of

*si* , the number of sentences is equivalent to *η*, which contains *d<sup>i</sup>*

*<sup>k</sup>* ∈ *D*(*si*)

Therefore, If we limit *D* to the set with the constraint of Eq.2.1.2:

inequality holds for each *d<sup>i</sup>*

relation *d* ∈ *D<sup>η</sup>* such that *E*(*d*) = *S*(*d*′

dependency relations satisfying *E*(*d*) = *S*(*d*′

*Dη*, *d* and *d*′

follows:

relations (First step).

pairs (Second step).

common pair of words *E*(*d*) = *S*(*d*′

appearances.

$$w^i\_j = \mathcal{S}(d^i) \tag{2}$$

$$w\_k^i = E(d^i). \tag{3}$$

The function *L* denotes dependency linkage between *w<sup>i</sup> j* , *w<sup>i</sup> <sup>k</sup>* and *S* and *E* returns a modifying word and a modified word respectively. For instance, as for the dependency (drug)→ (safety), *d* = *L*( (drug), (safety)), (drug)= *S*(*d*) and (safety)=*E*(*d*).

Note that some relations between these functions hold as follows:

$$d^i = L(S(d^i), E(d^i))\tag{4}$$

$$w\_j^l = \mathcal{S}(L(w\_{j'}^l w\_k^l))\tag{5}$$

$$w\_k^i = E(L(w\_{j'}^i w\_k^i)).\tag{6}$$

Let us assume the verb of the main clause is modified by other words but does not modify another word in the target language. For all dependency relations *d<sup>i</sup>* ∈ *D*(*si*) whose *E*(*d<sup>i</sup>* ) is not the verb of the main clause of *si*, there exists another *<sup>d</sup>*′*<sup>i</sup>* <sup>∈</sup> *<sup>D</sup>*(*si*) which satisfies

$$E(d^i) = S(d'^i)\_{\prime\prime}$$

because each word but the verb of the main clause necessarily modifies other word in the sentence.

Thus, there exists *d*′ ∈ *D* satisfying *E*(*d*) = *S*(*d*′ ) for each *d* ∈ *D*, if *d* is not the verb of the main clause of the original sentences {*si*}. Since *D*(*si*) ⊂ *D* because of the definition of *D*, we can find a series of modification relations which satisfy the Eq.2.1.2 in *D* and reproduce all the original sentences {*si*} by following their linkage.

However, rather than all sentences, we are only interested in the sentences described by plural respondents. If the same sentences appear *η* times, the dependency relations in the sentences will also recur (more than) *η* times. Let us define a 'support' function:

$$\operatorname{supp} p\_D(d) = \operatorname{card} \{ s\_i \mid d \in D(s\_i) \}\_{\mathcal{V}}$$

where 'card' denotes the cardinality of a set. The above statement can be described via *suppD*(*d*) as follows: if there are *η* sentences, which have the same dependency structure as *si* , the number of sentences is equivalent to *η*, which contains *d<sup>i</sup> <sup>k</sup>* ∈ *D*(*si*). Thus the following inequality holds for each *d<sup>i</sup> <sup>k</sup>* ∈ *D*(*si*)

$$supp p\_D(d\_k^i) \ge \eta.$$

Therefore, If we limit *D* to the set with the constraint of Eq.2.1.2:

4

word *w<sup>i</sup>*

sentence.

• *D*(*s*1) = { (drug)→ (safety), (safety)→ (important)},

(improved)→ (needs) }

Note that, following the linkage of *d<sup>i</sup>*

*<sup>j</sup>* modifies another word *<sup>w</sup><sup>i</sup>*

(needs), (improved)→ (needs)}.

words, we can define 'counterpart' functions such as

The function *L* denotes dependency linkage between *w<sup>i</sup>*

Thus, there exists *d*′ ∈ *D* satisfying *E*(*d*) = *S*(*d*′

all the original sentences {*si*} by following their linkage.

Note that some relations between these functions hold as follows:

• *D*(*s*2) = { (drug)→ (safety), (safety) → (needs),

• *D* = { (drug)→ (safety), (safety)→ (important), (safety) →

except for the order of appearance of modifications which modify the same word. If the

*d<sup>i</sup>* = *L*(*w<sup>i</sup>*

*<sup>k</sup>* <sup>=</sup> *<sup>E</sup>*(*d<sup>i</sup>*

word and a modified word respectively. For instance, as for the dependency (drug)→ (safety), *d* = *L*( (drug), (safety)), (drug)= *S*(*d*) and (safety)=*E*(*d*).

*d<sup>i</sup>* = *L*(*S*(*d<sup>i</sup>*

*<sup>j</sup>* <sup>=</sup> *<sup>S</sup>*(*L*(*w<sup>i</sup>*

*<sup>k</sup>* <sup>=</sup> *<sup>E</sup>*(*L*(*w<sup>i</sup>*

not the verb of the main clause of *si*, there exists another *<sup>d</sup>*′*<sup>i</sup>* <sup>∈</sup> *<sup>D</sup>*(*si*) which satisfies

*E*(*d<sup>i</sup>*

*wi*

*wi*

*wi <sup>j</sup>* <sup>=</sup> *<sup>S</sup>*(*d<sup>i</sup>*

*wi*

*j* , *w<sup>i</sup>*

> *j* , *w<sup>i</sup>*

), *E*(*d<sup>i</sup>*

*j* , *w<sup>i</sup>*

*j* , *w<sup>i</sup>*

Let us assume the verb of the main clause is modified by other words but does not modify another word in the target language. For all dependency relations *d<sup>i</sup>* ∈ *D*(*si*) whose *E*(*d<sup>i</sup>*

) = *<sup>S</sup>*(*d*′*<sup>i</sup>*

because each word but the verb of the main clause necessarily modifies other word in the

main clause of the original sentences {*si*}. Since *D*(*si*) ⊂ *D* because of the definition of *D*, we can find a series of modification relations which satisfy the Eq.2.1.2 in *D* and reproduce

However, rather than all sentences, we are only interested in the sentences described by plural respondents. If the same sentences appear *η* times, the dependency relations in the

*suppD*(*d*) = *card*{*si* | *d* ∈ *D*(*si*)},

sentences will also recur (more than) *η* times. Let us define a 'support' function:

),

*<sup>j</sup>* ∈ *D*(*si*), we can reproduce the original sentence *si*

*<sup>k</sup>*) (1)

) (2)

). (3)

*<sup>k</sup>* and *S* and *E* returns a modifying

)) (4)

*<sup>k</sup>*)) (5)

*<sup>k</sup>*)). (6)

) for each *d* ∈ *D*, if *d* is not the verb of the

) is

*<sup>k</sup>* and the dependency relation *<sup>d</sup><sup>i</sup>* <sup>∈</sup> *<sup>D</sup>*(*si*) is related to these

$$D^\eta = \{ d \mid d \in D, \operatorname{supp} p\_D(d) \ge \eta \},$$

each modification relation in sentences with the same dependency structure, namely more than *η* times, is a member of *Dη*. These dependency relations satisfy the same relation as Eq.2.1.2, though, in general, we cannot necessarily expect the existence of the dependency relation *d* ∈ *D<sup>η</sup>* such that *E*(*d*) = *S*(*d*′ ) for each *d*′ ∈ *Dη*. We can therefore expect to find sentences described by plural respondents and with an equivalent dependency structure by following the linkage of dependency relations in *Dη*, which satisfies the relation Eq.2.1.2. (We call this method using a series of dependency relations the 'word-link method'.)

In fact, we should be aware that the extraction of a series of dependency relations in *D<sup>η</sup>* satisfying Eq.2.1.2 is a necessary condition to find such sentences and the co-occurrence of dependency relations is not preserved in this operation. In other words, the elements of *Dη*, *d* and *d*′ , which satisfy the relation *E*(*d*) = *S*(*d*′ ), do not necessarily appear in the same sentence. In order to ensure the co-occurrence of dependency relations, it is necessary to confirm that the dependency relations *d* and *d*′ satisfying *E*(*d*) = *S*(*d*′ ) are included in the same sentence. If more sentences exist than the support, which contains a series of dependency relations satisfying *E*(*d*) = *S*(*d*′ ), we can conclude that the sentences are written by more respondents than the number preliminarily determined. Taking the calculation cost and the degree of freedom of expression into account, we relax the above restriction as follows:


In this method, each of two pairs of dependency relations *d* → *d*′ and *d*′ → *d*′′ contains a common pair of words *E*(*d*) = *S*(*d*′ ) and *E*(*d*′ ) = *S*(*d*′′), which appears in the same sentence. Since the variations of the structures of descriptions related to common opinions in a set of questionnaire data tend to be small, such overlap of words is (at least empirically) sufficient to approximately reproduce sentences summarizing original sentences. (We call this method using the series of the pairs of modification relations the 'dependency-link method'.)

The part of dosage regimens contain 'detail' elements. They describe information concerning dosage regimens as sentences and are suitable to apply a text mining technique in order to

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

159

We applied the word-link method to descriptions in 'detail' elements concerning the dosage regimens in each SGML package insert. Since, as a minimum, dosage, administration and adaptation diseases will differ for each medicine, with a considerable scope of expression, our original method, whereby attempts are made to find patterns, including the use of nouns, might result in a failure to find the common sentences. We thus extend it to determine the tendency for the co-occurrence of nouns and particles (parts of speech which play roles similar to prepositions in English) and extract structural patterns except for noun variations.

1. We retrieve sentences in the 'detail' elements and apply dependency analysis to them. 2. If the segment in the dependency contains a noun, we differentiate the latter from the segment. The resultant characters are expected to be particles, hence we name a 'particle

3. We aggregate nouns that appear in segments including each particle candidate and find the characteristics of the particle candidates in use. We call the part of the segment

4. We replace the found nouns with a symbol such as ' ' in order to mask them, and apply the word-link method. If there are certain rules governing the way in which particles should be used, this method extracts the common structures of sentences and suggests us the idea of data items, for which descriptions must be converted into a

Fig. 1 shows the distribution chart of particle candidates with their frequencies. First, we investigate the nature of the nouns involved in the segments containing the particle candidates appearing frequently in the sentences of dosage regimens. Fig. 1 indicates that the particle candidate of more than 50% of the segments is a null character, namely the segments contain only their main part. Since the targets in Fig. 1 are all segments contained in sentences of dosage regimens, they involve not only nouns but also other part of speech

obtained by removing a particle segment the 'main part of segment'.

**Figure 1.** The particle candidates of segments included in the 'detail' elements.

find potential meta data of dosage regimens.

The analytical steps are as follows:

candidate' in this paper.

structured data form.

In addition, our method helps us find the sentences which have similar dependency structures. We usally visualize the result as a graph structure, whose nodes denote modifying or modified words and edges denote dependency relationships between the words. We can expect that such sentences are placed in the same graph structure since they share the same words and the similar dependency relations.

## **3. Application**

## **3.1. Analysis on descriptions of dosage regimens in package inserts of medicines [4]**

To prevent medical accidents, such as mix-ups involving medicines, double dosage and insufficient dosage, it is necessary to ensure the proper treatment of the right medicines, namely, 'safety of usage' of medicines.

There occurred, in some Japanese hospitals, fatal accidents due to mix-ups involving a steroid, Saxizon, with a similarly-titled medicine, Succine, which is a muscle relaxant. There are two conceivable ways to avoid such accidents, one of which is to prevent the naming and use of medicines resembling other medicines in their name, both in terms of appearance and sound. Another method is to confirm the medicine by checking the actual usage based on their dosage regimens. Though the former method can be realized by utilizing a name checking system provided by the Japan Pharmaceutical Information Center or making a rule to adopt medicines which have confusing names, the accident is known to have occurred despite the existence of a rule to reject Succine due to its confusing name.

This suggests to us that the latter, namely the confirmation of usage, should be the key to identifying error. Consider the case when a doctor inputs prescription data into a computerized order entry system for medicines. If the system shows him information concerning therapeutic indications, he can subsequently avoid mix-ups of medicines such as the case in question. To enable this, the order entry system requires a database containing information on dosage regimens so that the proper usage can be verified.

As is described in Introduction in this chapter, the structure of the portion of dosage regimens in package insert data does not achieve sufficiently fine granularity to enable its effective utilization in a computer system, such as the order entry system mentioned above. In this section, we show the method to find the description patterns of the sentences in the dosage regimen portion of the SGML formatted package inserts data. Based on this result, we also propose the data structure of dosage regimen information, which will be the basis of a drug information database to ensure safe usage.

The target data in this section is the SGML formatted package insert data of medicines for medical care, which can be downloaded from the PMDA web site. Since we need the list of medicines to retrieve the data, we utilize the standard medicine master data (the version released on September 30, 2007), which is provided with The Medical Information System Development Center (MEDIS-DC). Using the master data, we obtained 11,685 SGML files, which are our target data.

The part of dosage regimens contain 'detail' elements. They describe information concerning dosage regimens as sentences and are suitable to apply a text mining technique in order to find potential meta data of dosage regimens.

6

to approximately reproduce sentences summarizing original sentences. (We call this method using the series of the pairs of modification relations the 'dependency-link method'.)

In addition, our method helps us find the sentences which have similar dependency structures. We usally visualize the result as a graph structure, whose nodes denote modifying or modified words and edges denote dependency relationships between the words. We can expect that such sentences are placed in the same graph structure since they share the same

To prevent medical accidents, such as mix-ups involving medicines, double dosage and insufficient dosage, it is necessary to ensure the proper treatment of the right medicines,

There occurred, in some Japanese hospitals, fatal accidents due to mix-ups involving a steroid, Saxizon, with a similarly-titled medicine, Succine, which is a muscle relaxant. There are two conceivable ways to avoid such accidents, one of which is to prevent the naming and use of medicines resembling other medicines in their name, both in terms of appearance and sound. Another method is to confirm the medicine by checking the actual usage based on their dosage regimens. Though the former method can be realized by utilizing a name checking system provided by the Japan Pharmaceutical Information Center or making a rule to adopt medicines which have confusing names, the accident is known to have occurred

This suggests to us that the latter, namely the confirmation of usage, should be the key to identifying error. Consider the case when a doctor inputs prescription data into a computerized order entry system for medicines. If the system shows him information concerning therapeutic indications, he can subsequently avoid mix-ups of medicines such as the case in question. To enable this, the order entry system requires a database containing

As is described in Introduction in this chapter, the structure of the portion of dosage regimens in package insert data does not achieve sufficiently fine granularity to enable its effective utilization in a computer system, such as the order entry system mentioned above. In this section, we show the method to find the description patterns of the sentences in the dosage regimen portion of the SGML formatted package inserts data. Based on this result, we also propose the data structure of dosage regimen information, which will be the basis of a drug

The target data in this section is the SGML formatted package insert data of medicines for medical care, which can be downloaded from the PMDA web site. Since we need the list of medicines to retrieve the data, we utilize the standard medicine master data (the version released on September 30, 2007), which is provided with The Medical Information System Development Center (MEDIS-DC). Using the master data, we obtained 11,685 SGML files,

**3.1. Analysis on descriptions of dosage regimens in package inserts of**

despite the existence of a rule to reject Succine due to its confusing name.

information on dosage regimens so that the proper usage can be verified.

words and the similar dependency relations.

namely, 'safety of usage' of medicines.

information database to ensure safe usage.

which are our target data.

**3. Application**

**medicines [4]**

We applied the word-link method to descriptions in 'detail' elements concerning the dosage regimens in each SGML package insert. Since, as a minimum, dosage, administration and adaptation diseases will differ for each medicine, with a considerable scope of expression, our original method, whereby attempts are made to find patterns, including the use of nouns, might result in a failure to find the common sentences. We thus extend it to determine the tendency for the co-occurrence of nouns and particles (parts of speech which play roles similar to prepositions in English) and extract structural patterns except for noun variations. The analytical steps are as follows:



**Figure 1.** The particle candidates of segments included in the 'detail' elements.

Fig. 1 shows the distribution chart of particle candidates with their frequencies. First, we investigate the nature of the nouns involved in the segments containing the particle candidates appearing frequently in the sentences of dosage regimens. Fig. 1 indicates that the particle candidate of more than 50% of the segments is a null character, namely the segments contain only their main part. Since the targets in Fig. 1 are all segments contained in sentences of dosage regimens, they involve not only nouns but also other part of speech such as verbs. The particle candidate of segments whose main word is not a noun is expected to be a null character. In the following analysis, we thus exclude segments whose main word is not a noun.

Fig. 2 shows nouns in the segments whose particle candidate is a null character. This indicates that such segments contain information about units of administration, ' ' (days), ' ' (times), 'mg', the manner of administration, ' ' (arbitrarily), ' ' (usually), and the condition of age such as ' ' (age) and ' ' (adult) and so on.


**Figure 4.** The nouns whose segment has a particle candidate (at/to).(top 20)

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

161

**Figure 5.** The nouns whose segment has a particle candidate (as). (top 20)

**Figure 6.** The nouns whose segment has a particle candidate (for). (top 20)

**Figure 2.** The nouns whose segment has a null character as the particle candidate.


**Figure 3.** The nouns whose segment has a particle candidate . (top 20)

We outline the nouns in the segments, including each particle segment, as follows:


160 Theory and Applications for Advanced Text Mining Text Mining Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques 9 Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques http://dx.doi.org/10.5772/51195 161


**Figure 4.** The nouns whose segment has a particle candidate (at/to).(top 20)

8

is not a noun.

such as verbs. The particle candidate of segments whose main word is not a noun is expected to be a null character. In the following analysis, we thus exclude segments whose main word

Fig. 2 shows nouns in the segments whose particle candidate is a null character. This indicates that such segments contain information about units of administration, ' ' (days), ' ' (times), 'mg', the manner of administration, ' ' (arbitrarily), ' ' (usually), and the

condition of age such as ' ' (age) and ' ' (adult) and so on.

**Figure 2.** The nouns whose segment has a null character as the particle candidate.

**Figure 3.** The nouns whose segment has a particle candidate . (top 20)

' (in a vein).

We outline the nouns in the segments, including each particle segment, as follows:

• Fig. 3 shows nouns in the segments, including ' ' as a particle segment. We can see that they express amounts of medication such as 'mg', ' ' (tablets) and ' ' (titers). • The nouns in the segments whose particle segment is ' ' (at/to) are shown in Fig. 4, which shows that the particle segments tend to be used with frequency-related words such as ' ' (times) and ' ' (sometimes), and concerning the timing of administration, such as ' ' (inter cibos) and ' ' (before bedtime), administration site such as '


**Figure 5.** The nouns whose segment has a particle candidate (as). (top 20)


**Figure 6.** The nouns whose segment has a particle candidate (for). (top 20)


shows the verbs used in the sentences of dosage regimens. To absorb the difference in verb expressions, we replace verbs of similar meanings with a representative verb. For instance, the verbs, ' ' (dose orally) and ' ' (drip-feed intravenously) have analogous meanings in terms of medication and are hence consolidated into a single verb. In this paper, to enhance comprehension, we consolidated them into ' ' (administrate/use). Moreover, we consolidated the verbs that mean increase or decrease into

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

163

**Figure 9.** The result of the word-link method applied to 'detail' elements (the links show co-occurrence more than 1149 times).

Following this consolidation, we applied the word-link method and obtained sentence structures based on dependency relationships. Fig. 9 shows the links of dependency relationships appearing more than 1149 times. Based on this figure, we can read the following

• Increase or decrease according to conditions such as indication (disease) and age (Part A

• Dosage based on the information concerning the administration site, frequency, object person, symptoms, amount of medication and (the amount of) active gradients (Part B).

Based on these and the fact that verbs indicate the method of administration, we can see that

Blue nodes denote modifying words and red nodes denote modified words.

• Daily dosage (Part C) and description of conditions (Part D)

the data structure to describe dosage regimens needs the following items:

contents:

in Fig. 9 ).

• Indication (disease) • Objective person • Administration site • Amount of medication

' ' (escalate) and replaced the verb ' ' (divide) with ' ' (split).

**Figure 7.** The nouns whose segment has a particle candidate (depending on). (top 20)



**Figure 8.** The verbs included in detail elements describing dosage regimens.(top 20)

Based on the results shown above, we can find the tendency of contents in the segments including each particle segment. We replaced each segment containing nouns with the symbol ' ', and applied the word-link method to the replaced sentences. Fig. 8 shows the verbs used in the sentences of dosage regimens. To absorb the difference in verb expressions, we replace verbs of similar meanings with a representative verb. For instance, the verbs, ' ' (dose orally) and ' ' (drip-feed intravenously) have analogous meanings in terms of medication and are hence consolidated into a single verb. In this paper, to enhance comprehension, we consolidated them into ' ' (administrate/use). Moreover, we consolidated the verbs that mean increase or decrease into ' ' (escalate) and replaced the verb ' ' (divide) with ' ' (split).

**Figure 9.** The result of the word-link method applied to 'detail' elements (the links show co-occurrence more than 1149 times). Blue nodes denote modifying words and red nodes denote modified words.

Following this consolidation, we applied the word-link method and obtained sentence structures based on dependency relationships. Fig. 9 shows the links of dependency relationships appearing more than 1149 times. Based on this figure, we can read the following contents:


Based on these and the fact that verbs indicate the method of administration, we can see that the data structure to describe dosage regimens needs the following items:


10

**Figure 7.** The nouns whose segment has a particle candidate (depending on). (top 20)

**Figure 8.** The verbs included in detail elements describing dosage regimens.(top 20)

infection) and ' ' (hepatic disease).

• The particle segment ' ' (as) is included in the segments whose main words are nouns, as shown in Fig. 5. Besides the nouns for the formulaic phrases, ' ' (as a rule), '(1) ' (as a daily dosage) and ' ' (as a maintenance dosage),

• Fig. 6 shows nouns in the segments including the particle segment ' ' (for). This mainly contains nouns showing an object person such as ' ' (adult), ' ' (child) and ' ' (elder person). It also shows the name of symptoms such as ' ' (severe

• In Fig. 7, segments whose particle candidate is ' ' (depending on) tend to contain the word ' ' (symptom). In this figure, we can also read words such as ' ' (body weight), ' ' (age), ' ' (objective) and so on. This results and the meaning of the particle candidate suggest that these segments show the condition to adjust a dose.

Based on the results shown above, we can find the tendency of contents in the segments including each particle segment. We replaced each segment containing nouns with the symbol ' ', and applied the word-link method to the replaced sentences. Fig. 8

the other nouns shown in the figure represent active ingredients of medicines.



**Respondents [A typical sentence (translated)]**

Doctors, Nurses **[Easy to use.]**

translated in English with some examples of original sentences in Japanese.

method:

Pharmacists **[There are few burdens for patients.]** Nurses **[I do not know well.]**

Doctors **[The number of oral drugs decreases.]**

Doctors, **[It is usable for the patients who** Pharmacists, **have a difficulty in taking the** Nurses **medicine orally.]**

**Examples of sentences originally obtained by the method.**

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

165

**[The medicine works slowly.]** / **[For the hope/ease of patients]** /

**[Mental effects]**

**Table 2.** The resultant sentences obtained by the dependency-link method for Q1. In this table we show the typical sentences

The table shows that all medical experts prioritized reducing the load of patients as the reason for selecting the transdermal patch, since it could be used by patients who were unable to take medicines orally. In addition, this shows that doctors and nurses focused on

The following is a summary of the results for Q2 -Q4 obtained by the dependency-link

For Q 2, the result shows that medical experts appreciated the name of the medicine and the therapeutic classification mark printed on the patch in order to prevent medical accidents and considered it necessary to have a space for the date. The doctors also required a patch that was much smaller and that changed color depending on the amount of time having elapsed. The nurses focused on the behavior of patients, while the pharmacists emphasized

The result of Q 3 shows numerous patients' opinions concerning the medicine, skin symptoms, mentality, and the site of the patch. We can also see that patients in their 40s and 50s mainly commented on skin symptoms, although those in their 60s to 80s covered all these opinions. This suggests that the younger generation focused on the functions of the

For Q 4, we obtained a result showing that patients asked nurses and pharmacists questions about where to place the patch and how to use it. Nurses also asked questions concerning the effect of the medicine, while pharmacists asked about displays on the patch or packaging and when to use it. This suggests that patients expect nurses to tell them about the efficacy

the ease of use and that doctors also prioritized the effect of the medicine.

the widespread need for awareness regarding correct use of the medicine.

medicine, while older patients focused on other factors, like ease of mind.

of the medicine and pharmacists to tell them about usage.

**Table 1.** The free description part of the questionnaire concrening the therapeutic classification mark printed on a cardiac transdermal patch.


## **3.2. A questionnaire concerning the therapeutic classification mark printed on a cardiac transdermal patch [5]**

In certain hospitals in Japan, medical accidents have occurred, whereby patients suffering from lung ailments and those suffering from heart disease were mixed up and operations were performed without any modification. It is known that the incident happened because a cardiac transdermal patch was placed on the body of the heart disease sufferer, which indicated when the patients were delivered. If surgeons had known what the patch signified, they would have avoided making a mistake with the surgery. To prevent recurrences, the pharmaceutical company marketing the patches voluntarily printed a 'therapeutic classification mark' on them. The 'therapeutic classification mark' is a security feature linked to the use of the drug and shows that the patch is a cardiac medicine. We applied our method to the free description part of a questionnaire, which is conducted as a nationwide investigation into the 'therapeutic classification mark' printed on isosorbide dinitrate transdermal patches. The respondents were doctors, pharmacists, nurses and patients and the number of respondents and the questions asked are listed in Table 1.

Table 2 lists the resulting sentences for the dependency-linking method(*η*′ = 3), where we filled postpositions and implemented classification by respondent and topic. We only presented representative sentences in the content columns where there are many sentences with similar meanings.


12

transdermal patch.

• Frequency

• Amount of active gradient • The way of administration

• Conditions of increase or decrease

**cardiac transdermal patch [5]**

with similar meanings.

**No. Questions Respondents Num of (Originally Japanese) responses**

**Table 1.** The free description part of the questionnaire concrening the therapeutic classification mark printed on a cardiac

**3.2. A questionnaire concerning the therapeutic classification mark printed on a**

In certain hospitals in Japan, medical accidents have occurred, whereby patients suffering from lung ailments and those suffering from heart disease were mixed up and operations were performed without any modification. It is known that the incident happened because a cardiac transdermal patch was placed on the body of the heart disease sufferer, which indicated when the patients were delivered. If surgeons had known what the patch signified, they would have avoided making a mistake with the surgery. To prevent recurrences, the pharmaceutical company marketing the patches voluntarily printed a 'therapeutic classification mark' on them. The 'therapeutic classification mark' is a security feature linked to the use of the drug and shows that the patch is a cardiac medicine. We applied our method to the free description part of a questionnaire, which is conducted as a nationwide investigation into the 'therapeutic classification mark' printed on isosorbide dinitrate transdermal patches. The respondents were doctors, pharmacists, nurses and patients and the number of respondents and the questions asked are listed in Table 1.

Table 2 lists the resulting sentences for the dependency-linking method(*η*′ = 3), where we filled postpositions and implemented classification by respondent and topic. We only presented representative sentences in the content columns where there are many sentences

Doctors, Pharmacists, Nurses

Doctors, Pharmacists, Nurses

Patients 529

Pharmacists, Nurses

737

2115

533

Q1 Why did you select the transdermal patch?

Q2 What are the preventive measures to avoid medical accidents related to the

Q3 What is your opinion of the cardiac transdermal patch?

Q4 Have you ever been asked by patients about the transdermal patch with the therapeutic classification

transdermal patch?

mark on it?

**Table 2.** The resultant sentences obtained by the dependency-link method for Q1. In this table we show the typical sentences translated in English with some examples of original sentences in Japanese.

The table shows that all medical experts prioritized reducing the load of patients as the reason for selecting the transdermal patch, since it could be used by patients who were unable to take medicines orally. In addition, this shows that doctors and nurses focused on the ease of use and that doctors also prioritized the effect of the medicine.

The following is a summary of the results for Q2 -Q4 obtained by the dependency-link method:

For Q 2, the result shows that medical experts appreciated the name of the medicine and the therapeutic classification mark printed on the patch in order to prevent medical accidents and considered it necessary to have a space for the date. The doctors also required a patch that was much smaller and that changed color depending on the amount of time having elapsed. The nurses focused on the behavior of patients, while the pharmacists emphasized the widespread need for awareness regarding correct use of the medicine.

The result of Q 3 shows numerous patients' opinions concerning the medicine, skin symptoms, mentality, and the site of the patch. We can also see that patients in their 40s and 50s mainly commented on skin symptoms, although those in their 60s to 80s covered all these opinions. This suggests that the younger generation focused on the functions of the medicine, while older patients focused on other factors, like ease of mind.

For Q 4, we obtained a result showing that patients asked nurses and pharmacists questions about where to place the patch and how to use it. Nurses also asked questions concerning the effect of the medicine, while pharmacists asked about displays on the patch or packaging and when to use it. This suggests that patients expect nurses to tell them about the efficacy of the medicine and pharmacists to tell them about usage.

The result clarifies that opinions differed depending on the viewpoints of the respondents, although they all wanted to use the same medicine safely. This meant that it is necessary to collect and analyze people's opinions from various backgrounds to ensure drugs are being used safely.

*3.3.1. background/cause of incidents*

We applied the Word-link method to data in the field 'background/cause of incidents' in order to determine the concrete information concerning the cause of incidents. The method was applied by occupation to determine the difference in backgrounds and the causes of incidents depending on the job title. We fixed the value of each *η* so as to make a resultant graph understandable for us. Figure 10 and Fig. 11 show the result of nurses' and pharmacists' comments, respectively. Both figures contain the common opinions, namely, 'the problem of the checking system of the protocol and the rule' (A) and 'confirmation is insufficient' (B), nurses point out 'the systematic problem of communication'(C) and pharmacists 'the problem of adoption of medicines' (C'). We can see that, though B arises

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

167

due to individual faults, A, C and C' are systematic problems.

**Figure 10.** The backgrounds and causes of incidents caused by nurses (*η* = 4).

**Figure 11.** The backgrounds and causes of incidents caused by pharmacists (*η* = 3).

## **3.3. Incident data related to the safety of drug use [6]**

The target data were reports of medical near-miss cases related to medicines and collected by the surveys of the Japan Council for Quality Health Care, which is an extra-departmental body of the Japanese Ministry of Health, Labor and Welfare. We analyzed 858 records from the 12th - 14th surveys, whose data attributes are shown in Table 3. This is because they contain free-description data such as 'Background / cause of the incident' and 'Candidates of counter measures'. Applying text mining to such data required the deletion of characters such as symbols and unnecessary line feed characters. We must also standardize synonyms, since it is difficult to control by making respondents use standard terms to reduce the number of diverse expressions. For this reason, we standardized the words using the dictionary prepared for this analysis.


**Table 3.** Data attributes of records corrected by 12th - 14th surveys.

#### *3.3.1. background/cause of incidents*

14

used safely.

prepared for this analysis.

The result clarifies that opinions differed depending on the viewpoints of the respondents, although they all wanted to use the same medicine safely. This meant that it is necessary to collect and analyze people's opinions from various backgrounds to ensure drugs are being

The target data were reports of medical near-miss cases related to medicines and collected by the surveys of the Japan Council for Quality Health Care, which is an extra-departmental body of the Japanese Ministry of Health, Labor and Welfare. We analyzed 858 records from the 12th - 14th surveys, whose data attributes are shown in Table 3. This is because they contain free-description data such as 'Background / cause of the incident' and 'Candidates of counter measures'. Applying text mining to such data required the deletion of characters such as symbols and unnecessary line feed characters. We must also standardize synonyms, since it is difficult to control by making respondents use standard terms to reduce the number of diverse expressions. For this reason, we standardized the words using the dictionary

> **Day of the week** Weekday or holiday Time Place Department Content of incident Psychosomatic state of the patient Job title Experience (year/month) Affiliation (year/month) Medical benefit class Nonproprietary name Name of wrong drug Dosage form of wrong drug Effect of wrong drug Name of right drug Dosage form of right drug Medical benefit of right drug Discussed cause Concrete descriptions of the incident Background/cause of the incidents Candidates of counter measures Comment

**3.3. Incident data related to the safety of drug use [6]**

**Table 3.** Data attributes of records corrected by 12th - 14th surveys.

We applied the Word-link method to data in the field 'background/cause of incidents' in order to determine the concrete information concerning the cause of incidents. The method was applied by occupation to determine the difference in backgrounds and the causes of incidents depending on the job title. We fixed the value of each *η* so as to make a resultant graph understandable for us. Figure 10 and Fig. 11 show the result of nurses' and pharmacists' comments, respectively. Both figures contain the common opinions, namely, 'the problem of the checking system of the protocol and the rule' (A) and 'confirmation is insufficient' (B), nurses point out 'the systematic problem of communication'(C) and pharmacists 'the problem of adoption of medicines' (C'). We can see that, though B arises due to individual faults, A, C and C' are systematic problems.

**Figure 10.** The backgrounds and causes of incidents caused by nurses (*η* = 4).

**Figure 11.** The backgrounds and causes of incidents caused by pharmacists (*η* = 3).

#### *3.3.2. Countermeasures*

We applied Word-link method to the field 'Candidates of countermeasures' to summarize the nurses' and the pharmacists' opinions concerning the countermeasures to prevent the incidents. Fig. 12 is the summary of the counter measures described by nurses, and suggests that there are many opinions stating '(it is necessary to) instruct to confirm and check', 'make a speech' and 'ensure confirmation'. Fig. 13 shows the summary of the countermeasures proposed by pharmacists. This explains that, besides the confirmation and audit, it is also necessary to attract (pharmacists') attention and to devise ways of displaying medicines such as labels.

**4. Discussion**

**4.1. Methods**

perform our method.

The three analyses suggest that our method can be a powerful tool to extract the parts of sentences that commonly appear in original sentences. The target data have been Japanese sentences. Let us discuss whether our method is applicable to the data in the other language, English. As we introduced in Section 2.1, Word-link method and Dependency-link method utilize dependency relationships in target sentences. One of the representative dependency parsers for English sentences is Stanford parser [7–9], which provides us with the dependency relationships in Stanford Dependencies format. In principle, it enables us to

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

169

• Directions of dependency relationships. The dependency relationships in a Japanese sentence always have forward direction, whereas the relationships in an English sentence can have both forward and backward direction. Let us show an example that illustrates this. The Japanese sentence ' ' corresponds to the English sentence 'John talked to Taro'. In the both sentences, there exist dependency relationships, " (John) → (talk)" and " (Taro) → (talk)". We should note that both ' '(John) and ' '(Taro) also appear prior to the verb ' '(talk) in the Japanese sentence. This coincidence of order helps us to suggest the sentences that frequently appear in original data. <sup>1</sup> However, in the English sentence, the noun 'Taro' follows the verb 'talked'. Though this helps to distinguish a subject and an object, it does not preserve the order of words that appear in original sentences. Because of this, as for the dependency relationship between an object and a verb, we should swap their order (e.g. (talk) →

• Treatment of a relative pronoun. In English sentences, we frequently use a relative pronoun. It essentially requires reference resolution to identify an antecedent that is modified by the relative pronoun. Reference resolution often requires semantics of words and the knowledge related to them. Because of this, it is currently a difficult problem to find a right antecedent. In contrast, Japanese language does not have a relative pronoun. The relationship between a relative clause and its antecedent is built in normal modification relationships. Therefore, Japanese sentences do not cause the difficulty that

• Zero pronoun. In Japanese language, we often omits a subject in a sentence. Such omission is usually called as 'zero pronoun'. In contrast, a subject in an English sentence is seldom omitted. This fact tells us that we can expect the patterns that include subjects in English sentences. If there are only the patterns without subjects, this indicates no definite subjects that appear in the target sentences. However, as for Japanese sentences, we cannot necessarily obtain information about subjects and may have to guess them

<sup>1</sup> Of course, if you need to distinguish which is a subject or an object, you should focus on particles as we did in

based on the semantics of words included in the obtained patterns.

The difference between Japanese and English data comes from the followings:

(Taro)) to reproduce summarizing sentences.

originates from a relative pronoun.

Section 3.1.

Compared with the both results, except for the pharmacists' opinion concerning the innovation of labels, only few opinions exist on the countermeasures related to the system of the medical scenarios. This suggests that the medical experts such as nurses and pharmacists tend to try to find solutions to problems within themselves. To solve the structural problems of medical situations, it is important not only to promote the efforts of each medical expert, but also to strive to improve the organization to which they belong. It is also desirable for them to be aware of the importance of organizational innovation, and to combat the systematic error.

**Figure 12.** The countermeasures of incidents caused by nurses. (*η* = 5)

**Figure 13.** The countermeasures of incidents caused by pharmacists. (*η* = 4)

## **4. Discussion**

#### **4.1. Methods**

16

*3.3.2. Countermeasures*

as labels.

systematic error.

**Figure 12.** The countermeasures of incidents caused by nurses. (*η* = 5)

**Figure 13.** The countermeasures of incidents caused by pharmacists. (*η* = 4)

We applied Word-link method to the field 'Candidates of countermeasures' to summarize the nurses' and the pharmacists' opinions concerning the countermeasures to prevent the incidents. Fig. 12 is the summary of the counter measures described by nurses, and suggests that there are many opinions stating '(it is necessary to) instruct to confirm and check', 'make a speech' and 'ensure confirmation'. Fig. 13 shows the summary of the countermeasures proposed by pharmacists. This explains that, besides the confirmation and audit, it is also necessary to attract (pharmacists') attention and to devise ways of displaying medicines such

Compared with the both results, except for the pharmacists' opinion concerning the innovation of labels, only few opinions exist on the countermeasures related to the system of the medical scenarios. This suggests that the medical experts such as nurses and pharmacists tend to try to find solutions to problems within themselves. To solve the structural problems of medical situations, it is important not only to promote the efforts of each medical expert, but also to strive to improve the organization to which they belong. It is also desirable for them to be aware of the importance of organizational innovation, and to combat the The three analyses suggest that our method can be a powerful tool to extract the parts of sentences that commonly appear in original sentences. The target data have been Japanese sentences. Let us discuss whether our method is applicable to the data in the other language, English. As we introduced in Section 2.1, Word-link method and Dependency-link method utilize dependency relationships in target sentences. One of the representative dependency parsers for English sentences is Stanford parser [7–9], which provides us with the dependency relationships in Stanford Dependencies format. In principle, it enables us to perform our method.

The difference between Japanese and English data comes from the followings:


<sup>1</sup> Of course, if you need to distinguish which is a subject or an object, you should focus on particles as we did in Section 3.1.

## **4.2. Application**

In this subsection, let us briefly review related works and discuss text mining applied to the description data related to medical safety.

was correspondence analysis between keywords appearing in sentences and attributes of respondents, such as a type of their affiliation and their profession. As a result, they obtained the tendency that mentors in hospitals feel anxious about mismatch between

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

171

Malpractice reduction is one of important themes of medical safety. A lot of governments or institutions construct incident reporting system and analyze the collected report data to find

Kawanaka et al. [16, 17] utilized Self Organizing Map (SOM) to make a map expressing the relationships of sentences in incident report data. They calculated the co-occurrence possibility of keywords in sentences and defined a characteristic vector for each keyword. They also defined a vector to characterize a report by summing up the vectors whose corresponding keywords appear in it. They input a vector for each report to SOM algorithm. As a result, they found two clusters of reports, the former of which is summarized as "Forget of inscription to medication note" and the latter is as "Forget of administration of medicine taken before sleep". Based on this technique, they also proposed an incident report analysis

Baba et al. [18] proposed a method to analyze the co-occurrence relation of the words that

Classification is a start point to analyze incident reports. Empirically, the incident types seem to obey Zipf's law. This makes it difficult to classify reports by naive application of clustering algorithms, because they generate too many small-size clusters or a large-size cluster of . If we target major incidents, the better strategy to understand reports is to focus on relative large-size clusters and to summarize the reports in them. However, one should also note that there exist important but less frequently occurring cases. Thus, it is expected to introduce a

All of the above studies suggest that text mining studies tend to focus on words not syntactic structures. Remember that stochastical approach and data mining assume table-type structured data.This might be the reason why it is more difficult to analyze syntactic structures than words. However, as Richard et al. pointed out the importance of the use of syntactic information [13], syntactic structures include information much richer than just a collection of words. They also provide us with easier interpretation of results. This is a

In this chapter, we introduced the text mining method to analyze text data such as documents and questionnaire response data, and reviewed the studies where we used the method.

Our method utilizes syntactical information of target sentences. We extract a dependency relations from each sentence and restrict them to the ones that appear more than frequency threshold. Connecting common words in the resultant dependencies produces the patterns that contain the frequently appearing portions of sentences. We reviewed the study where we applied the method to drug package inserts, questionnaire data and medical incident

parameter to measure importance and use it to narrow down clusters to focus on.

appear in the medical incident reports using concept lattice.

learning contents and real situation.

basis of the strategy of our method.

**5. Conclusion**

*4.2.3. Medical incident data*

knowledge therein.

system.

## *4.2.1. Package inserts*

U.S. Food and Drug Administration [14] also defines a specification of a package insert document markup standard, Structured Product Labeling (SPL), and . This is similar to SGML formatted package inserts disclosed by PMDA. Thus, in this chapter, we identify SPL with package inserts.

Recently, there emerge several studies which analyze descriptions in drug package inserts. Let us review some of them.

Duke et al. [10, 11] developed a tool, SPLICER, which utilized natural language processing to extract information from SPLs. It parses SPL by identification of target parts, removal of XML tags and extraction of terms. It also identify sysnonymns of the extracted terms by mapping them to medical dictionary, MedDRA. In their study, they applied their tool to quantitatively show the "overwarning" of adverse events in the package inserts of newer and more commonly prescribed drugs. They also showed that recent FDA guide lines do not succeed in reducing overwarning.

Bisgin et al. [12] applied a text mining method, topic modeling, to package insert data. A topic modeling method, latent Dirichlet allocation (LDA), explores the probabilistic patterns of 'topics', implicitly expressed by words in documents. They identified topics corresponding to adverse events or therapeutic application. This enabled them to identify potential adverse events that might arise from specific drugs.

Richard et al. [13] applied machine learning techniques to package insert data. It is a trial to automatically identify pharmacokinetic drug-drug interaction based on unstructured data. They created a corpus of package inserts, which is manually annotated by a pharmacist and a drug information expert. Using the corpus data as a training set, they evaluated the accuracy of identification and obtained F-measure of 0.8-0.9.

The number of the studies that deal with adverse events seems to be much more than the ones that deal with safety of drug usage. For the purpose of finding adverse events, package inserts are just one of text sources. Other sources are academic papers or Medline abstracts. We expect that there emerge more studies from the various viewpoint of safety to utilize package insert data.

## *4.2.2. Questionnaire data*

There are many studies where text mining approach is applied to questionnaire data. However, as for application in the area of medication, there are only a few studies. This might be because analysts tend to take a traditional approach, manual reading, because it captures the written information more precisely than text mining. However, it is obviously time and cost consuming.

Suzuki et al. [15] applied a text mining technique to questionnaire data about clinical practice pre-education conducted to pharmacists, providers of clinical practices. Their method was correspondence analysis between keywords appearing in sentences and attributes of respondents, such as a type of their affiliation and their profession. As a result, they obtained the tendency that mentors in hospitals feel anxious about mismatch between learning contents and real situation.

#### *4.2.3. Medical incident data*

18

**4.2. Application**

*4.2.1. Package inserts*

with package inserts.

package insert data.

*4.2.2. Questionnaire data*

time and cost consuming.

Let us review some of them.

succeed in reducing overwarning.

events that might arise from specific drugs.

of identification and obtained F-measure of 0.8-0.9.

description data related to medical safety.

In this subsection, let us briefly review related works and discuss text mining applied to the

U.S. Food and Drug Administration [14] also defines a specification of a package insert document markup standard, Structured Product Labeling (SPL), and . This is similar to SGML formatted package inserts disclosed by PMDA. Thus, in this chapter, we identify SPL

Recently, there emerge several studies which analyze descriptions in drug package inserts.

Duke et al. [10, 11] developed a tool, SPLICER, which utilized natural language processing to extract information from SPLs. It parses SPL by identification of target parts, removal of XML tags and extraction of terms. It also identify sysnonymns of the extracted terms by mapping them to medical dictionary, MedDRA. In their study, they applied their tool to quantitatively show the "overwarning" of adverse events in the package inserts of newer and more commonly prescribed drugs. They also showed that recent FDA guide lines do not

Bisgin et al. [12] applied a text mining method, topic modeling, to package insert data. A topic modeling method, latent Dirichlet allocation (LDA), explores the probabilistic patterns of 'topics', implicitly expressed by words in documents. They identified topics corresponding to adverse events or therapeutic application. This enabled them to identify potential adverse

Richard et al. [13] applied machine learning techniques to package insert data. It is a trial to automatically identify pharmacokinetic drug-drug interaction based on unstructured data. They created a corpus of package inserts, which is manually annotated by a pharmacist and a drug information expert. Using the corpus data as a training set, they evaluated the accuracy

The number of the studies that deal with adverse events seems to be much more than the ones that deal with safety of drug usage. For the purpose of finding adverse events, package inserts are just one of text sources. Other sources are academic papers or Medline abstracts. We expect that there emerge more studies from the various viewpoint of safety to utilize

There are many studies where text mining approach is applied to questionnaire data. However, as for application in the area of medication, there are only a few studies. This might be because analysts tend to take a traditional approach, manual reading, because it captures the written information more precisely than text mining. However, it is obviously

Suzuki et al. [15] applied a text mining technique to questionnaire data about clinical practice pre-education conducted to pharmacists, providers of clinical practices. Their method Malpractice reduction is one of important themes of medical safety. A lot of governments or institutions construct incident reporting system and analyze the collected report data to find knowledge therein.

Kawanaka et al. [16, 17] utilized Self Organizing Map (SOM) to make a map expressing the relationships of sentences in incident report data. They calculated the co-occurrence possibility of keywords in sentences and defined a characteristic vector for each keyword. They also defined a vector to characterize a report by summing up the vectors whose corresponding keywords appear in it. They input a vector for each report to SOM algorithm. As a result, they found two clusters of reports, the former of which is summarized as "Forget of inscription to medication note" and the latter is as "Forget of administration of medicine taken before sleep". Based on this technique, they also proposed an incident report analysis system.

Baba et al. [18] proposed a method to analyze the co-occurrence relation of the words that appear in the medical incident reports using concept lattice.

Classification is a start point to analyze incident reports. Empirically, the incident types seem to obey Zipf's law. This makes it difficult to classify reports by naive application of clustering algorithms, because they generate too many small-size clusters or a large-size cluster of . If we target major incidents, the better strategy to understand reports is to focus on relative large-size clusters and to summarize the reports in them. However, one should also note that there exist important but less frequently occurring cases. Thus, it is expected to introduce a parameter to measure importance and use it to narrow down clusters to focus on.

All of the above studies suggest that text mining studies tend to focus on words not syntactic structures. Remember that stochastical approach and data mining assume table-type structured data.This might be the reason why it is more difficult to analyze syntactic structures than words. However, as Richard et al. pointed out the importance of the use of syntactic information [13], syntactic structures include information much richer than just a collection of words. They also provide us with easier interpretation of results. This is a basis of the strategy of our method.

## **5. Conclusion**

In this chapter, we introduced the text mining method to analyze text data such as documents and questionnaire response data, and reviewed the studies where we used the method.

Our method utilizes syntactical information of target sentences. We extract a dependency relations from each sentence and restrict them to the ones that appear more than frequency threshold. Connecting common words in the resultant dependencies produces the patterns that contain the frequently appearing portions of sentences. We reviewed the study where we applied the method to drug package inserts, questionnaire data and medical incident reports. We discussed the consideration points to apply our method to English sentences. We also introduced the related works and discussed their tendency.

[10] Duke, J. & Friedlin, J. (2010) ADESSA: A Real-Time Decision Support Service for Delivery of Semantically Coded Adverse Drug Event Data. *AMIA Annu Symp Proc. 2010*,

Analyses on Text Data Related to the Safety of Drug Use Based on Text Mining Techniques

http://dx.doi.org/10.5772/51195

173

[11] Duke, J.; Friedlin, J.; Ryan, P. (2011) A Quantitative Analysis of Adverse Events and Overwarning in Drug Labeling, *Arch Intern Med* Vol.171, No.10, 2011, 944–946.

[12] Bisgin, H.; Liu, Z.; Fang, H.; Xu, X.;Tong, W.(2011) Mining FDA drug labels using an unsupervised learning technique - topic modeling, *BMC Bioinformatics*12(Suppl 10),

[13] Richard, B.; Gregory, G.; Henk, H.(2012) Using Natural Language Processing to Extract Drug-Drug Interaction Information from Package Inserts, *Proceedings of the 2012 Workshop on Biomedical Natural Language Processing*, pp.206-213, Montréal, Canada,

[14] FDA (2008) Structured Product Labeling Resources

[15] Suzuki, S.; Koinuma, M.; Hidaka, Y.; Koike, K.; Nakamura, H.(2009) The Consciousness Research and Analysis on the Directive Pharmacists Who Provide Pre-education Prior to Clinical Practice?An Effort in the College of Pharmacy Nihon

[16] Otani, Y.; Kawanaka, H.; Yoshikawa T.; Yamamoto, K.; Shinogi, T.; Tsuruoka S.(2005) Keyword Extraction from Incident Reports and Keyword Map Generation Method Using Self Organizing Map *Proceedings of IEEE International Conference on Systems, Man*

[17] Kawanaka, H.; Otani, Y.; Yamamoto, K.; Shinogi, T.; Tsuruoka S.(2007) Tendency Discovery from Incident Report Map Generated by Self Organizing Map and its Development *Proceedings of IEEE International Conference on Systems, Man and Cybernetics*

[18] Baba, T.; Liu, L.; Hirokawa S.(2010) Formal Concept Analysis of Medical Incident

2012, Association for Computational Linguistics.

*and Cybernetics 2005*, pp.1030–1035 .

*2007*, pp.2016–2021.

http://www.fda.gov/ForIndustry/DataStandards/.

Reports *KES 2010*, Part III, LNAI 6278, pp. 207–214.

University?*YAKUGAKU ZASSHI* Vol. 129, No.9, 1103-1112, 2009

pp.177-181.

S11.

Though an analysis on medical safety data is important, most of the data are untouched to be analyzed. It is expected that not only text mining techniques are developed but also they are applied to medical safety data.

## **Author details**

Masaomi Kimura

Shibaura Institute of Technology, Japan

#### **6. References References**


[10] Duke, J. & Friedlin, J. (2010) ADESSA: A Real-Time Decision Support Service for Delivery of Semantically Coded Adverse Drug Event Data. *AMIA Annu Symp Proc. 2010*, pp.177-181.

20

172 Text Mining

reports. We discussed the consideration points to apply our method to English sentences.

Though an analysis on medical safety data is important, most of the data are untouched to be analyzed. It is expected that not only text mining techniques are developed but also they

[1] Matsuzawa, H. (2001) Mining Structured Association Patterns from Large Databases, *Transactions of Information Processing Society of Japan*, Vol.42, No.SIG 8(TOD 10),

[2] Kudo, T., Yamamoto, K., Tsuboi, Y., Matsumoto, Y. (2002) Mining Syntactic Structures from Text Database, *IPSJ SIG Notes. ICS*, Vol.2002, No.45(20020523), pp.139-144.

[3] Kimura, M. (2009) The Method to Analyze Freely Described Data from Questionnaires *Journal of Advanced Computational Intelligence and Intelligent Informatics* , Vol.13 No.3

[4] Kimura, M.; Okada, K.; Nabeta, K.; Ohkura,M.; Tsuchiya, F. (2009) Analysis on Descriptions of Dosage Regimens in Package Inserts of Medicines, In: *Human Interface and the Management of Information. Information and Interaction*, Vol.5618 pp.539-548.

[5] Kimura, M.; Furukawa, H.; Tsukamoto, H.; Tasaki, H.; Kuga, M.; Ohkura,M.; Tsuchiya, F. (2005) Analysis of Questionnaires Regarding Safety of Drug Use, Application of Text Mining to Free Description Questionnaires, *The Japanese Journal of Ergonomics*, Vol.41

[6] Kimura,M.; Tatsuno,K.; Hayasaka,T.; Takahashi,Y.; Aoto,T.; Ohkura,M.; Tsuchiya,F.(2007) The Analysis of Near-Miss Cases Using Data-Mining Approach.

[7] Klein, D. & Manning, C. (2003a) Accurate Unlexicalized Parsing. *Proceedings of the 41st*

[8] Klein, D. & Manning, C. (2003b). Fast Exact Inference with a Factored Model for Natural Language Parsing. In: *Advances in Neural Information Processing Systems 15*,

[9] Marneffe, M.C.; Bill MacCartney, B.; Manning, C.(2006) Generating Typed Dependency

*Human-Computer Interaction. HCI Applications and Services* pp.474-483, Beijing.

*Meeting of the Association for Computational Linguistics*, pp. 423-430.

We also introduced the related works and discussed their tendency.

are applied to medical safety data.

Shibaura Institute of Technology, Japan

**Author details**

Masaomi Kimura

**6. References**

**References**

pp.21-35.

pp.268-274.

No.5 pp.297-305.

Cambridge, MA: MIT Press, pp. 3-10.

Parses from Phrase Structure Parses. In *LREC*.


**Chapter 8**

**Biomedical Named Entity Recognition: A Survey of**

It is well known that the rapid growth and dissemination of the Internet has resulted in huge amounts of information generated and shared, available in the form of textual data, images, videos or sounds. This overwhelming surge of data is also true for specific areas such as biomedicine, where the number of published documents, such as articles, books and technical reports, is increasing exponentially. For instance, the MEDLINE literature database contains over 20 million references to journal papers, covering a wide range of biomedical fields. In order to organize and manage these data, several manual curation efforts have been set up to identify, in texts, information regarding entities (e.g. genes and proteins) and their relations (e.g. protein-protein interactions). The extracted information is stored in structured knowledge resources, such as Swiss-Prot [1] and GenBank [2]. However, the ef‐ fort required to continually update these databases makes this a very demanding and ex‐ pensive task, naturally leading to increasing interest in the application of Text Mining (TM)

One major focus of TM research has been on Named Entity Recognition (NER), a crucial ini‐ tial step in information extraction, aimed at identifying chunks of text that refer to specific entities of interest, such as gene, protein, drug and disease names. Such systems can be inte‐ grated in larger biomedical Information Extraction (IE) pipelines, which may use the auto‐ matically extracted names to perform other tasks, such as relation extraction, classification or/and topic modeling. However, biomedical names have various characteristics that may

and reproduction in any medium, provided the original work is properly cited.

© 2012 Campos et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

**•** Many entity names are descriptive (e.g. "normal thymic epithelial cells");

**Machine-Learning Tools**

http://dx.doi.org/10.5772/51066

systems to help perform those tasks.

difficult their recognition in texts [3]:

**1. Introduction**

David Campos, Sérgio Matos and José Luís Oliveira

Additional information is available at the end of the chapter
