Preface

Chapter 9 **Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh Language 197** Fadoua Ataa Allah and Siham Boulaknadel

**VI** Contents

Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data includes useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques are developed, the text mining research field continues to expand for needs arising from various application fields. This book introduces representative works based on the latest trend of advanced text mining techniques. It is composed of 9 chapters.

In Chapter 1, Sung-Pil Choi et al. survey methods that extract relations between keywords (or key phrases) described in texts. This chapter focuses on the kernel-based relation extraction methods and introduces some representative methods. They are compared based on their characteristics and performance. This chapter summarizes their advantages and disadvantages.

In Chapter 2, Hidenao Abe focuses on a method that extracts temporal patterns of terms from temporally published documents. The method is composed of automatic term extraction, term importance indices, and temporal clustering. Medical research documents related to migraine drug therapy are analyzed based on the method. The validity of extracted temporal patterns is evaluated.

In Chapter 3, Alan L. Porter et al. introduce a stepwise method that compiles lists of informative terms and phrases. The method includes field selection, removal of stop words and common terms, term manipulation, cleaning and removal of noise terms, and term consolidation. They apply it to Science, Technology and Innovative (ST&I) information sets and present case results based on each process.

In Chapter 4, Alessio Leoncini et al. tackle on two issues which are required for the analysis of web pages. One is text summarization and the other is page segmentation. Semantic networks are introduced for these issues. They map natural language into an abstract representation. Their effectiveness is shown by applying them to the topic extraction task from the benchmark data set.

In Chapter 5, Hiep Luong et al. present an acquisition method of domain-specific ontology. The method uses two key techniques. One is lexical expansion based on WordNet. It extracts new vocabulary words from data sources. The other is text mining from domainspecific literature. It enriches concepts of the words. Experimental results of an amphibian morphology ontology show the validity of acquired ontology.

In Chapter 6, Hidetsugu Nanba et al. overview the latest research and services related to the automatic compilation of travel information. Especially, they elaborately overview automatic construction of database for travel information, travelers' behavior analysis, recommendation of travel information, interfaces accessing travel information, and available linguistic resources.

**Chapter 1**

. After that, the Automatic Content Ex‐

**Survey on Kernel-Based Relation Extraction**

Relation extraction refers to the method of efficient detection and identification of prede‐ fined semantic relationships within a set of entities in text documents (Zelenco, Aone, & Ri‐ chardella, 2003; Zhang, Zhou, and Aiti, 2008). The importance of this method was recognized first at the Message Understanding Conference (MUC, 2001) that had been held

traction (ACE, 2009) Workshop facilitated numerous researches that from 1999 to 2008 had been promoted by NIST2 as a new project. Currently, the workshop is held every year being the greatest world forum for comparison and evaluation of new technology in the field of information extraction such as named entity recognition, relation extraction, event extrac‐ tion, and temporal information extraction. This workshop is conducted as a sub-field of Text

According to ACE, an entity in the text is a representation for naming a real object. Exempla‐ ry entities include the names of persons, locations, facilities and organizations. A sentence including these entities can express the semantic relationships in between them. For exam‐ ple, in the sentence "*President Clinton was in Washington today,*" there is the "*Located*" relation between "*Clinton*" and "*Washington*". In the sentence "*Steve Balmer, CEO of Microsoft, said…*"

Many relation extraction techniques have been developed in the framework of various tech‐ nological workshops mentioned above. Most relation extraction methods developed so far are based on supervised learning that requires learning collections. These methods are clas‐

> © 2012 Jung et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

distribution, and reproduction in any medium, provided the original work is properly cited.

and reproduction in any medium, provided the original work is properly cited.

Analytics Conference (TAC, 2012) which is currently under the supervision of NIST.

Hanmin Jung, Sung-Pil Choi, Seungwoo Lee and

Additional information is available at the end of the chapter

from 1987 to 1997 under the supervision of DARPA1

the relation of "*Role* (*CEO*, *Microsoft*)" can be extracted.

1 Defense Advanced Research Projects Agency of the U.S. 2 National Institute of Standards and Technology of the U.S.

Sa-Kwang Song

**1. Introduction**

http://dx.doi.org/10.5772/51005

In Chapter 7, Masaomi Kimura tackles on the task which safely uses drug based on text mining techniques. The word-link method and the dependency-link method are introduced for this task. This chapter shows results applying them to three data sets: the package insert for medical care, a questionnaire for therapeutic classification mark printed on patches, and reports of medical near-miss cases related to medicines.

In Chapter 8, David Campos et al. survey machine learning-based tools for biomedical named entity recognition. Firstly, sub-processes composing of the tools and techniques for the sub-processes are introduced. Next, representative tools for gene and protein names, chemical names, and disorders names are introduced. They are compared to the resources, the techniques, and the performance results.

In Chapter 9, Fadoua Ataa Allah presents strategies which enhance under or less resourced language. Especially, this chapter focuses on the issues related to Amazigh language, where it is a branch of the Afro-Asiatic languages and is spoken in the northern part of Africa. It introduces the linguistic features, the problems realizing for computer language processing, and the primary experiments.

I believe that these chapters will give new knowledge in text mining fields. Also, I believe that they helpsmany readers open their research fields.

Lastly, I would like to express my sincere gratitude to this book project team in InTech. Especially, I would like to thank Publishing Process Managers Adriana Pecar and Marina Jozipovic. They have guided me through step by step and supported me during this project. In addition, my gratitude goes to my wife Sachiko Sakurai for her constant support and encouragement.

> **Shigeaki Sakurai** Toshiba Corporation & Tokyo Institute of Technology, Japan

## **Survey on Kernel-Based Relation Extraction**

In Chapter 6, Hidetsugu Nanba et al. overview the latest research and services related to the automatic compilation of travel information. Especially, they elaborately overview automatic construction of database for travel information, travelers' behavior analysis, recommendation of travel information, interfaces accessing travel information, and available

In Chapter 7, Masaomi Kimura tackles on the task which safely uses drug based on text mining techniques. The word-link method and the dependency-link method are introduced for this task. This chapter shows results applying them to three data sets: the package insert for medical care, a questionnaire for therapeutic classification mark printed on patches, and

In Chapter 8, David Campos et al. survey machine learning-based tools for biomedical named entity recognition. Firstly, sub-processes composing of the tools and techniques for the sub-processes are introduced. Next, representative tools for gene and protein names, chemical names, and disorders names are introduced. They are compared to the resources,

In Chapter 9, Fadoua Ataa Allah presents strategies which enhance under or less resourced language. Especially, this chapter focuses on the issues related to Amazigh language, where it is a branch of the Afro-Asiatic languages and is spoken in the northern part of Africa. It introduces the linguistic features, the problems realizing for computer language processing,

I believe that these chapters will give new knowledge in text mining fields. Also, I believe

Lastly, I would like to express my sincere gratitude to this book project team in InTech. Especially, I would like to thank Publishing Process Managers Adriana Pecar and Marina Jozipovic. They have guided me through step by step and supported me during this project. In addition, my gratitude goes to my wife Sachiko Sakurai for her constant support and

**Shigeaki Sakurai**

Japan

Toshiba Corporation & Tokyo Institute of Technology,

reports of medical near-miss cases related to medicines.

that they helpsmany readers open their research fields.

the techniques, and the performance results.

and the primary experiments.

encouragement.

linguistic resources.

VIII Preface

Hanmin Jung, Sung-Pil Choi, Seungwoo Lee and Sa-Kwang Song

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51005

## **1. Introduction**

Relation extraction refers to the method of efficient detection and identification of prede‐ fined semantic relationships within a set of entities in text documents (Zelenco, Aone, & Ri‐ chardella, 2003; Zhang, Zhou, and Aiti, 2008). The importance of this method was recognized first at the Message Understanding Conference (MUC, 2001) that had been held from 1987 to 1997 under the supervision of DARPA1 . After that, the Automatic Content Ex‐ traction (ACE, 2009) Workshop facilitated numerous researches that from 1999 to 2008 had been promoted by NIST2 as a new project. Currently, the workshop is held every year being the greatest world forum for comparison and evaluation of new technology in the field of information extraction such as named entity recognition, relation extraction, event extrac‐ tion, and temporal information extraction. This workshop is conducted as a sub-field of Text Analytics Conference (TAC, 2012) which is currently under the supervision of NIST.

According to ACE, an entity in the text is a representation for naming a real object. Exempla‐ ry entities include the names of persons, locations, facilities and organizations. A sentence including these entities can express the semantic relationships in between them. For exam‐ ple, in the sentence "*President Clinton was in Washington today,*" there is the "*Located*" relation between "*Clinton*" and "*Washington*". In the sentence "*Steve Balmer, CEO of Microsoft, said…*" the relation of "*Role* (*CEO*, *Microsoft*)" can be extracted.

Many relation extraction techniques have been developed in the framework of various tech‐ nological workshops mentioned above. Most relation extraction methods developed so far are based on supervised learning that requires learning collections. These methods are clas‐

<sup>2</sup> National Institute of Standards and Technology of the U.S.

<sup>© 2012</sup> Jung et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

<sup>1</sup> Defense Advanced Research Projects Agency of the U.S.

sified into feature-based methods, semi-supervised learning methods, bootstrapping meth‐ ods, and kernel-based methods (Bach & Badaskar, 2007; Choi, Jeong, Choi, and Myaeng, 2009). Feature-based methods rely on classification models for automatically specifying the category where a relevant feature vector belongs. At that, surrounding contextual features are used to identify semantic relations between the two entities in a specific sentence and represent them as a feature vector. The major drawback of the supervised learning-based methods, however, is that they require learning collections. Semi-supervised learning and bootstrapping methods, on the other hand, use a large corpora or web documents, based on reduced learning collections that are progressively expanded to overcome the above disad‐ vantage. Kernel-based methods (Collins & Duffy, 2001), in turn, devise kernel functions that are most appropriate for relation extraction and apply them for learning in the form of a ker‐ nel set optimized for syntactic analysis and part-of-speech tagging. The kernel function itself is used for measuring the similarity between two instances, which are the main objects of machine learning. General kernel-based models will be discussed in detail in Section 3.

survey papers based on the importance and effect of the methodology have been published (Bach and Badaskar, 2007; Moncecchi, Minel and Wonsever, 2010). However, they fail to fully analyze particular functional principles or characteristics of the kernel-based relation extraction models announced so far, and just cite the contents of individual articles or de‐ scribe limited analysis. Although the performance of most kernel-based relation extraction methods has been demonstrated on the basis of ACE evaluation collections, comparison and

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 3

This chapter, unlike existing case studies, makes a close analysis of operation principles and individual characteristics of five kernel-based relation extraction methods starting from Ze‐ lenco, et al. (2003) which is the source of kernel-based relation extraction studies, to the com‐ posite kernel, which is considered the most advanced kernel-based relation method (Choi, et al., 2009; Zhang, Zhang, Su, et al., 2006). The focus will be laid on the ACE collection to com‐ pare the overall performance of each method. We hope this study will contribute to further research of kernel-based relation extraction of even higher performance and to high-level

Section 2 outlines supervised learning-based relation extraction methods and in section 3 we discuss kernel-based machine learning. Section 4 closely analyzes five exemplary kernelbased relation extraction methods. As mentioned above, Section 5 also compares the per‐ formance of these methods to analyze advantages and disadvantages of each method.

As discussed above, relation extraction methods are classified into three categories. The difference between feature-based and kernel-based methods is shown in the following Figure 1. With respect to machine learning procedure, these two are different from semi-

On the left of Figure 1, individual sentences that make up a learning collection have at least two entities (black square) of which the relation is manually extracted and predefined. Since most relation extraction methods studied so far work with binary relations, learning exam‐ ples are modified for convenient relation extraction from the pair of entities by preprocess‐ ing the original learning collection. These modified learning examples are referred to as *relation instance*. The relation instance is defined as an element of the learning collection modified so that it can be efficiently applied to the relevant relation extraction methods on

The aforementioned modification is closely related to feature information used in relation extraction. Since most supervised learning-based methods use both the entity itself and the contextual information about the entity, it is important to collect contextual informa‐ tion efficiently for improving performance. Linguistic processing (part-of-speech tagging, base phrase recognition, syntactic analysis, etc.) for individual learning sentences in the

analysis of the overall performance has not been made so far.

general kernel studies for linguistic processing and text mining.

**2. Supervised learning-based relation extraction**

the basis of specific sentence that contains at least two entities.

Section 6 draws a conclusion.

supervised learning methods.

As one representative approach of the feature-based methods, (Kambhatla, 2004) combines various types of lexical, syntactic, and semantic features required for relation extraction by using maximum entropy model. Although it is based on the same type of composite features as that proposed by Kambhatla (2004), Zhou, Su, Zhang, and Zhang (2005) make the use of support vector machines for relation extraction that allows flexible kernel combination. Zhao and Grishman (2005) have classified all features available by that point in time in or‐ der to create individual linear kernels, and attempted relation extraction by using composite kernels made of individual linear kernels. Most feature-based methods aim at applying fea‐ ture engineering algorithms for selecting optimal features for relation extraction, and appli‐ cation of syntactic structures was very limited.

Exemplary semi-supervised learning and bootstrapping methods are Snowball (Agichtein & Gravano, 2000) and DIPRE (Brin, 1999). They rely on a few learning collections for making the use of bootstrapping methods similar to the Yarowsky algorithm (Yarowsky, 1995) for gather‐ ing various syntactic patterns that denote relations between the two entities in a large webbased text corpus. Recent developments include KnowItAll (Etzioni, et al., 2005) and TextRunner (Yates, et al., 2007) methods for automatically collecting lexical patterns of target relations and entity pairs based on ample web resources. Although this approach does not re‐ quire large learning collections, its disadvantage is that many incorrect patterns are detected through expanding pattern collections, and that only one relation can be handled at a time.

Kernel-based relation extraction methods were first attempted by Zelenco, et al. (2003). Ze‐ lenco, et al., devised contiguous subtree kernels and sparse subtree kernels for recursively measuring similarity of two parse trees in order to apply them to binary relation extraction that demonstrated relatively high performance. After that, a variety of kernel functions for relation extraction have been suggested, e.g., dependency parse trees (Culotta and Sorensen, 2004), convolution parse tree kernels (Zhang, Zhang and Su, 2006), and composite kernels (Choi et al., 2009; Zhang, Zhang, Su and Zhou, 2006), which show even better performance.

In this chapter, case analysis was carried out for kernel-based relation extraction methods, which are considered to be the most successful approach so far. Of course, some previous survey papers based on the importance and effect of the methodology have been published (Bach and Badaskar, 2007; Moncecchi, Minel and Wonsever, 2010). However, they fail to fully analyze particular functional principles or characteristics of the kernel-based relation extraction models announced so far, and just cite the contents of individual articles or de‐ scribe limited analysis. Although the performance of most kernel-based relation extraction methods has been demonstrated on the basis of ACE evaluation collections, comparison and analysis of the overall performance has not been made so far.

This chapter, unlike existing case studies, makes a close analysis of operation principles and individual characteristics of five kernel-based relation extraction methods starting from Ze‐ lenco, et al. (2003) which is the source of kernel-based relation extraction studies, to the com‐ posite kernel, which is considered the most advanced kernel-based relation method (Choi, et al., 2009; Zhang, Zhang, Su, et al., 2006). The focus will be laid on the ACE collection to com‐ pare the overall performance of each method. We hope this study will contribute to further research of kernel-based relation extraction of even higher performance and to high-level general kernel studies for linguistic processing and text mining.

Section 2 outlines supervised learning-based relation extraction methods and in section 3 we discuss kernel-based machine learning. Section 4 closely analyzes five exemplary kernelbased relation extraction methods. As mentioned above, Section 5 also compares the per‐ formance of these methods to analyze advantages and disadvantages of each method. Section 6 draws a conclusion.

## **2. Supervised learning-based relation extraction**

sified into feature-based methods, semi-supervised learning methods, bootstrapping meth‐ ods, and kernel-based methods (Bach & Badaskar, 2007; Choi, Jeong, Choi, and Myaeng, 2009). Feature-based methods rely on classification models for automatically specifying the category where a relevant feature vector belongs. At that, surrounding contextual features are used to identify semantic relations between the two entities in a specific sentence and represent them as a feature vector. The major drawback of the supervised learning-based methods, however, is that they require learning collections. Semi-supervised learning and bootstrapping methods, on the other hand, use a large corpora or web documents, based on reduced learning collections that are progressively expanded to overcome the above disad‐ vantage. Kernel-based methods (Collins & Duffy, 2001), in turn, devise kernel functions that are most appropriate for relation extraction and apply them for learning in the form of a ker‐ nel set optimized for syntactic analysis and part-of-speech tagging. The kernel function itself is used for measuring the similarity between two instances, which are the main objects of machine learning. General kernel-based models will be discussed in detail in Section 3.

As one representative approach of the feature-based methods, (Kambhatla, 2004) combines various types of lexical, syntactic, and semantic features required for relation extraction by using maximum entropy model. Although it is based on the same type of composite features as that proposed by Kambhatla (2004), Zhou, Su, Zhang, and Zhang (2005) make the use of support vector machines for relation extraction that allows flexible kernel combination. Zhao and Grishman (2005) have classified all features available by that point in time in or‐ der to create individual linear kernels, and attempted relation extraction by using composite kernels made of individual linear kernels. Most feature-based methods aim at applying fea‐ ture engineering algorithms for selecting optimal features for relation extraction, and appli‐

Exemplary semi-supervised learning and bootstrapping methods are Snowball (Agichtein & Gravano, 2000) and DIPRE (Brin, 1999). They rely on a few learning collections for making the use of bootstrapping methods similar to the Yarowsky algorithm (Yarowsky, 1995) for gather‐ ing various syntactic patterns that denote relations between the two entities in a large webbased text corpus. Recent developments include KnowItAll (Etzioni, et al., 2005) and TextRunner (Yates, et al., 2007) methods for automatically collecting lexical patterns of target relations and entity pairs based on ample web resources. Although this approach does not re‐ quire large learning collections, its disadvantage is that many incorrect patterns are detected through expanding pattern collections, and that only one relation can be handled at a time.

Kernel-based relation extraction methods were first attempted by Zelenco, et al. (2003). Ze‐ lenco, et al., devised contiguous subtree kernels and sparse subtree kernels for recursively measuring similarity of two parse trees in order to apply them to binary relation extraction that demonstrated relatively high performance. After that, a variety of kernel functions for relation extraction have been suggested, e.g., dependency parse trees (Culotta and Sorensen, 2004), convolution parse tree kernels (Zhang, Zhang and Su, 2006), and composite kernels (Choi et al., 2009; Zhang, Zhang, Su and Zhou, 2006), which show even better performance. In this chapter, case analysis was carried out for kernel-based relation extraction methods, which are considered to be the most successful approach so far. Of course, some previous

cation of syntactic structures was very limited.

2 Theory and Applications for Advanced Text Mining Text Mining

As discussed above, relation extraction methods are classified into three categories. The difference between feature-based and kernel-based methods is shown in the following Figure 1. With respect to machine learning procedure, these two are different from semisupervised learning methods.

On the left of Figure 1, individual sentences that make up a learning collection have at least two entities (black square) of which the relation is manually extracted and predefined. Since most relation extraction methods studied so far work with binary relations, learning exam‐ ples are modified for convenient relation extraction from the pair of entities by preprocess‐ ing the original learning collection. These modified learning examples are referred to as *relation instance*. The relation instance is defined as an element of the learning collection modified so that it can be efficiently applied to the relevant relation extraction methods on the basis of specific sentence that contains at least two entities.

The aforementioned modification is closely related to feature information used in relation extraction. Since most supervised learning-based methods use both the entity itself and the contextual information about the entity, it is important to collect contextual informa‐ tion efficiently for improving performance. Linguistic processing (part-of-speech tagging, base phrase recognition, syntactic analysis, etc.) for individual learning sentences in the pre-processing step contributes to making a base for effective feature selection and extrac‐ tion. For example, when one sentence shown in the above Figure 1 goes through syntactic analysis, one relation instance is composed of a parse tree and the locations of entities in‐ dicated in the parse tree (Fundel, Küffner, & Zimmer, 2007; Zhang, et al., 2008; Zhou, Zhang, Ji, and Zhu, 2007). A single sentence can be represented as a feature vector or a syntactic graph (Jiang and Zhai, 2007; W. Li, Zhang, Wei, Hou, and Lu, 2008; Zhang, Zhang, and Su, 2006). The type of such relation instances depends on the relation extrac‐ tion methods, and can involve various preprocessing tasks as well (D. P. T. Nguyen, Mat‐ suo, and Ishizuka, 2007; Zhang, Zhang, and Su, 2006).

extraction methods, the selection and creation of kernel functions are the most fundamen‐

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 5

As shown in the Figure 1, kernel functions (linear, polynomial, and sigmoid) can be used in feature-based methods as well. They are, however, functions applied only to instances ex‐ pressed with vectors. On the other hand, kernel-based methods are not limited in terms of

Most machine learning methods are carried out on a feature basis. That is, each instance to which an answer (label) is attached, is modified into a feature sequence or *N*-dimensional vector f1, f2, …, fN to be used in the learning process. For example, important features for identifying the relation between two entities in a single sentence are entity type, contextual information between, before and after entities' occurrence, part-of-speech information for contextual words, and dependency relation path information for the two entities (Choi et al., 2009; Kambhatla, 2004; W. Li, Zhang, et al., 2008; Zhang, Zhang, and Su, 2006). These data are selected as a single feature, respectively, to be represented with a vector for automatic

In section 2, we discussed that the essence of feature-based methods is to create a feature vector that can best express individual learning examples. However, in many cases, it is not possible to express the feature vector reasonably. For example, a feature space is re‐ quired for expressing syntactic information3 of a specific sentence as a feature vector, and it is almost impossible to express it as a feature vector in a limited space in some cases (Cristianini and Shawe-Taylor, 2000). Kernel-based methods are for learning by calculat‐ ing kernel functions between two examples while keeping the original learning example without additional feature expression (Cristianini and Shawe-Taylor, 2000). The kernel function is defined as the mapping *K* :*Χ* ×*Χ* → 0, *∞*) from the input space *Χ* to the simi‐

examples in the input space **υ** to the multidimensional feature space. The kernel function is symmetric, and exhibits positive semi-definite features. With the kernel function, it is not necessary to calculate all features one by one, and machine learning can thus be car‐ ried out based only on similarity between two learning examples. Exemplary models where learning is carried out on the basis of all similarity matrices between learning ex‐ amples include Perceptron (Rosenblatt, 1958), Voted Perceptron (Freund and Schapire, 1999), and Support Vector Machines (Cortes and Vapnik, 1995) (Moncecchi, et al., 2010). Recently, kernel-based matching learning methods draw increasingly more attention, and are widely used for pattern recognition, data and text mining, and web mining. The per‐

(*y*). Here, *ϕ*(*x*) is the mapping function from learning

the type of the instance, and thus can contain various kernel functions.

**3. Overview of kernel-based machine learning methods**

tal part that affects the overall performance.

classification of the relation between the entities.

*i ϕi* (*x*)*ϕ<sup>i</sup>*

3 Dependency grammar relation, parse tree, etc. between words in a sentence.

larity score *ϕ*(*x*) <sup>⋅</sup>*ϕ*(*y*)=∑

**Figure 1.** Learning process for supervised learning-based relation extraction.

In general, feature-based relation extraction methods follow the procedures shown in the upper part of Figure 1. That is, feature collections that "*can express individual learning ex‐ amples the best*" are extracted (feature extraction.) The learning examples are feature-vec‐ torized, and inductive learning is carried out using selected machine learning models. On the contrary, kernel-based relation extraction shown in the lower part of Figure 1 devises a kernel function that "*can calculate similarity of any two learning examples the most effective‐ ly*" to replace feature extraction process. Here, the measurement of similarity between the learning examples is not general similarity measurement in a general sense. That is, the function for enhancing similarity between the two sentences or instances that express the same relation is the most effective kernel function from the viewpoint of relation extrac‐ tion. For example, two sentences "*Washington is in the U.S.*" and "*Seoul is located in Korea*" use different object entities but feature the same relation ("*located*.") Therefore, an efficient kernel function would detect a high similarity between these two sentences. On the other hand, since the sentences "*Washington is the capital of the United States*" and "*Washington is located in the United States*" express the same object entities but different relations, the sim‐ ilarity between them should be determined as very low. As such, in kernel-based relation extraction methods, the selection and creation of kernel functions are the most fundamen‐ tal part that affects the overall performance.

As shown in the Figure 1, kernel functions (linear, polynomial, and sigmoid) can be used in feature-based methods as well. They are, however, functions applied only to instances ex‐ pressed with vectors. On the other hand, kernel-based methods are not limited in terms of the type of the instance, and thus can contain various kernel functions.

## **3. Overview of kernel-based machine learning methods**

pre-processing step contributes to making a base for effective feature selection and extrac‐ tion. For example, when one sentence shown in the above Figure 1 goes through syntactic analysis, one relation instance is composed of a parse tree and the locations of entities in‐ dicated in the parse tree (Fundel, Küffner, & Zimmer, 2007; Zhang, et al., 2008; Zhou, Zhang, Ji, and Zhu, 2007). A single sentence can be represented as a feature vector or a syntactic graph (Jiang and Zhai, 2007; W. Li, Zhang, Wei, Hou, and Lu, 2008; Zhang, Zhang, and Su, 2006). The type of such relation instances depends on the relation extrac‐ tion methods, and can involve various preprocessing tasks as well (D. P. T. Nguyen, Mat‐

In general, feature-based relation extraction methods follow the procedures shown in the upper part of Figure 1. That is, feature collections that "*can express individual learning ex‐ amples the best*" are extracted (feature extraction.) The learning examples are feature-vec‐ torized, and inductive learning is carried out using selected machine learning models. On the contrary, kernel-based relation extraction shown in the lower part of Figure 1 devises a kernel function that "*can calculate similarity of any two learning examples the most effective‐ ly*" to replace feature extraction process. Here, the measurement of similarity between the learning examples is not general similarity measurement in a general sense. That is, the function for enhancing similarity between the two sentences or instances that express the same relation is the most effective kernel function from the viewpoint of relation extrac‐ tion. For example, two sentences "*Washington is in the U.S.*" and "*Seoul is located in Korea*" use different object entities but feature the same relation ("*located*.") Therefore, an efficient kernel function would detect a high similarity between these two sentences. On the other hand, since the sentences "*Washington is the capital of the United States*" and "*Washington is located in the United States*" express the same object entities but different relations, the sim‐ ilarity between them should be determined as very low. As such, in kernel-based relation

suo, and Ishizuka, 2007; Zhang, Zhang, and Su, 2006).

4 Theory and Applications for Advanced Text Mining Text Mining

**Figure 1.** Learning process for supervised learning-based relation extraction.

Most machine learning methods are carried out on a feature basis. That is, each instance to which an answer (label) is attached, is modified into a feature sequence or *N*-dimensional vector f1, f2, …, fN to be used in the learning process. For example, important features for identifying the relation between two entities in a single sentence are entity type, contextual information between, before and after entities' occurrence, part-of-speech information for contextual words, and dependency relation path information for the two entities (Choi et al., 2009; Kambhatla, 2004; W. Li, Zhang, et al., 2008; Zhang, Zhang, and Su, 2006). These data are selected as a single feature, respectively, to be represented with a vector for automatic classification of the relation between the entities.

In section 2, we discussed that the essence of feature-based methods is to create a feature vector that can best express individual learning examples. However, in many cases, it is not possible to express the feature vector reasonably. For example, a feature space is re‐ quired for expressing syntactic information3 of a specific sentence as a feature vector, and it is almost impossible to express it as a feature vector in a limited space in some cases (Cristianini and Shawe-Taylor, 2000). Kernel-based methods are for learning by calculat‐ ing kernel functions between two examples while keeping the original learning example without additional feature expression (Cristianini and Shawe-Taylor, 2000). The kernel function is defined as the mapping *K* :*Χ* ×*Χ* → 0, *∞*) from the input space *Χ* to the simi‐ larity score *ϕ*(*x*) <sup>⋅</sup>*ϕ*(*y*)=∑ *i ϕi* (*x*)*ϕ<sup>i</sup>* (*y*). Here, *ϕ*(*x*) is the mapping function from learning

examples in the input space **υ** to the multidimensional feature space. The kernel function is symmetric, and exhibits positive semi-definite features. With the kernel function, it is not necessary to calculate all features one by one, and machine learning can thus be car‐ ried out based only on similarity between two learning examples. Exemplary models where learning is carried out on the basis of all similarity matrices between learning ex‐ amples include Perceptron (Rosenblatt, 1958), Voted Perceptron (Freund and Schapire, 1999), and Support Vector Machines (Cortes and Vapnik, 1995) (Moncecchi, et al., 2010). Recently, kernel-based matching learning methods draw increasingly more attention, and are widely used for pattern recognition, data and text mining, and web mining. The per‐

<sup>3</sup> Dependency grammar relation, parse tree, etc. between words in a sentence.

formance of kernel methods, however, depends to a great extent on kernel function selec‐ tion or configuration (J. Li, Zhang, Li, and Chen, 2008).

**4.1. Tree kernel-based method (Zelenco, et al., 2003)**

**Figure 2.** Exemplary shallow parsing result for relation extraction.

*C*." In its turn, "*Hardcom C*." is an "*affiliation*" of "*John Smith*".

ple result from the REES.

This study is known as the first application of kernel method to relation extraction. The parse trees derived from shallow parsing are used for measuring similarity between the sen‐ tences containing entities. In their study, the REES (Relation and Event Extraction System) is used to analyze part-of-speeches and types of individual words in a sentence, as well as the syntactic structure of the sentence in question. Here, the REES is a relation and event extrac‐ tion system developed by Aone & Ramos-Santacruz (2000). The Figure 2 below shows exam‐

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 7

As shown in Figure 2, when the syntactic analysis of the input sentence is completed, partic‐ ular information for words of the sentence is analyzed and extracted. Four types of attribute information are attached to all words or entities other than articles and stop words. Type represents the part-of-speech or entity type of the current word. While "*John Smith*" is the "*Person*" type, "*scientist*" is specified as "PNP" representing a personal noun. Head repre‐ sents the presence of key words of composite nouns or preposition phrases. Role represents the relation between the two entities. In Figure 2, "*John Smith*" is the "*member*" of "*Hardcom*

As one can see, it is possible to identify and extract the relation between the two entities in a specific sentence at a given level if the REES is used. Since the system, however, was developed on the basis of rules, it has some limitations in terms of scalability and generality (Zelenco et al., 2003). Zelenco et al. (2003) have constructed tree kernels on the basis of the REES analysis re‐ sult with a view of better relation extraction so as to overcome the limitations of machine learn‐ ing and low performance. The kernel function defined in this study for measuring the similarity of a pair of shallow parsing trees consists of the following chain of equations. The

comparison function for each individual configuration node in the trees is as follows.

Kernel-based learning methods are used also for natural language processing. Linear, poly‐ nomial and Gaussian kernels are typical in simple feature vector-based machine learning. Convolutional kernel (Collins & Duffy, 2001) is used for efficient learning of structural data such as trees or graphs. The convolution kernel is a type of kernel function featuring the idea of sequence kernels (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002), tree kernels (Culotta & Sorensen, 2004; Reichartz, Korte, & Paass, 2009; Zelenco, et al., 2003; Zhang, Zhang, & Su, 2006; Zhang, et al., 2008), and graph kernels (Gartner, Flach, & Wrobel, 2003). The convolutional kernel can measure the overall similarity by defining "sub-kernels" for measuring similarity between the components of an individual entity and calculating similarity convolution among the components. For example, the sequence kernel divides the relevant sequence into subsequences for measuring similarity of two sequences to calculate overall similarity by means of similarity measurement between the subsequences. Likewise, the tree kernel divides a tree into its sub-trees to calculate similarity between them and then it calculates the convolution of these similarities.

As described above, another advantage of the kernel methods is that learning is possible as a single kernel function for input instance collections of different type. For example, (Choi et al., 2009; Zhang, Zhang, Su, et al., 2006) have demonstrated a composite kernel for which the convolutional parse tree kernel is combined with the entity kernel for highperformance relation extraction.

## **4. Kernel-based relation extraction**

The most prominent characteristic of the relation extraction models derived so far is that lin‐ guistic analysis is used to carefully identify relation expressions and syntactic structures di‐ rectly and indirectly expressed in specific sentences. In this section, five important research results are discussed and analyzed. Of course, there are many other important studies that have drawn much attention due to their high performance. Most of approaches, however, just modify or supplement the five basic methods discussed below. Therefore, this study can be an important reference for supplementing existing research results in the future or study‐ ing new mechanisms for relation extraction, by intuitively explaining the details of major studies. Firstly, tree kernel methods originally proposed by Zelenco, et al. (2003) are covered in detail. Then, the method proposed by Culotta & Sorensen (2004) is covered where the de‐ pendency tree kernel was used for the first time. Also, kernel-based relation extraction (Bu‐ nescu & Mooney, 2005) using dependency path between two entities in a specific sentence on the basis of similar dependency trees is discussed. Additionally, the subsequence kernelbased relation extraction method proposed by (Bunescu & Mooney (2006) is explained. Fi‐ nally, the relation extraction models (Zhang, Zhang, Su, et al., 2006) based on the composite kernel for which various kernels are combined on the basis of the convolution parse tree kernel proposed by Collins & Duffy (2001) are covered in detail.

## **4.1. Tree kernel-based method (Zelenco, et al., 2003)**

formance of kernel methods, however, depends to a great extent on kernel function selec‐

Kernel-based learning methods are used also for natural language processing. Linear, poly‐ nomial and Gaussian kernels are typical in simple feature vector-based machine learning. Convolutional kernel (Collins & Duffy, 2001) is used for efficient learning of structural data such as trees or graphs. The convolution kernel is a type of kernel function featuring the idea of sequence kernels (Lodhi, Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002), tree kernels (Culotta & Sorensen, 2004; Reichartz, Korte, & Paass, 2009; Zelenco, et al., 2003; Zhang, Zhang, & Su, 2006; Zhang, et al., 2008), and graph kernels (Gartner, Flach, & Wrobel, 2003). The convolutional kernel can measure the overall similarity by defining "sub-kernels" for measuring similarity between the components of an individual entity and calculating similarity convolution among the components. For example, the sequence kernel divides the relevant sequence into subsequences for measuring similarity of two sequences to calculate overall similarity by means of similarity measurement between the subsequences. Likewise, the tree kernel divides a tree into its sub-trees to calculate similarity between them and then

As described above, another advantage of the kernel methods is that learning is possible as a single kernel function for input instance collections of different type. For example, (Choi et al., 2009; Zhang, Zhang, Su, et al., 2006) have demonstrated a composite kernel for which the convolutional parse tree kernel is combined with the entity kernel for high-

The most prominent characteristic of the relation extraction models derived so far is that lin‐ guistic analysis is used to carefully identify relation expressions and syntactic structures di‐ rectly and indirectly expressed in specific sentences. In this section, five important research results are discussed and analyzed. Of course, there are many other important studies that have drawn much attention due to their high performance. Most of approaches, however, just modify or supplement the five basic methods discussed below. Therefore, this study can be an important reference for supplementing existing research results in the future or study‐ ing new mechanisms for relation extraction, by intuitively explaining the details of major studies. Firstly, tree kernel methods originally proposed by Zelenco, et al. (2003) are covered in detail. Then, the method proposed by Culotta & Sorensen (2004) is covered where the de‐ pendency tree kernel was used for the first time. Also, kernel-based relation extraction (Bu‐ nescu & Mooney, 2005) using dependency path between two entities in a specific sentence on the basis of similar dependency trees is discussed. Additionally, the subsequence kernelbased relation extraction method proposed by (Bunescu & Mooney (2006) is explained. Fi‐ nally, the relation extraction models (Zhang, Zhang, Su, et al., 2006) based on the composite kernel for which various kernels are combined on the basis of the convolution parse tree

tion or configuration (J. Li, Zhang, Li, and Chen, 2008).

6 Theory and Applications for Advanced Text Mining Text Mining

it calculates the convolution of these similarities.

performance relation extraction.

**4. Kernel-based relation extraction**

kernel proposed by Collins & Duffy (2001) are covered in detail.

This study is known as the first application of kernel method to relation extraction. The parse trees derived from shallow parsing are used for measuring similarity between the sen‐ tences containing entities. In their study, the REES (Relation and Event Extraction System) is used to analyze part-of-speeches and types of individual words in a sentence, as well as the syntactic structure of the sentence in question. Here, the REES is a relation and event extrac‐ tion system developed by Aone & Ramos-Santacruz (2000). The Figure 2 below shows exam‐ ple result from the REES.

**Figure 2.** Exemplary shallow parsing result for relation extraction.

As shown in Figure 2, when the syntactic analysis of the input sentence is completed, partic‐ ular information for words of the sentence is analyzed and extracted. Four types of attribute information are attached to all words or entities other than articles and stop words. Type represents the part-of-speech or entity type of the current word. While "*John Smith*" is the "*Person*" type, "*scientist*" is specified as "PNP" representing a personal noun. Head repre‐ sents the presence of key words of composite nouns or preposition phrases. Role represents the relation between the two entities. In Figure 2, "*John Smith*" is the "*member*" of "*Hardcom C*." In its turn, "*Hardcom C*." is an "*affiliation*" of "*John Smith*".

As one can see, it is possible to identify and extract the relation between the two entities in a specific sentence at a given level if the REES is used. Since the system, however, was developed on the basis of rules, it has some limitations in terms of scalability and generality (Zelenco et al., 2003). Zelenco et al. (2003) have constructed tree kernels on the basis of the REES analysis re‐ sult with a view of better relation extraction so as to overcome the limitations of machine learn‐ ing and low performance. The kernel function defined in this study for measuring the similarity of a pair of shallow parsing trees consists of the following chain of equations. The comparison function for each individual configuration node in the trees is as follows.

$$t(P\_1, p\_1, P\_2, p) = \begin{cases} 1, & \text{if } P\_1.Type = P\_2.Type \text{ and } P\_1.Rule = P\_2.Rule \\ 0, & \text{otherwise} \end{cases} \tag{1}$$

*Pi*

standing of the kernel function.

**Figure 3.** Two sample parse trees for illustrating kernel calculation process.

*K*(*P*1, *P*2) =*k*(*P*1.*Sentence*.*p*, *P*2.*Sentence*.*p*)

+ ∑*i*, *j*,*l*(*i*)=*l*( *j*)

*i*

=*k*(*P*1.*Sentence*.*p*, *P*2.*Sentence*.*p*)

 represents the tree to be compared. The similarity between the two trees is calculated by adding up the similarity function *k* (Equation 2) for the current node (parent node) and the similarity calculation function *K <sup>c</sup>* (Equation 3) between the child nodes. The kernel calcula‐ tion process for the following parse trees will be described in detail later for intuitive under‐

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 9

The original sentence of the parse tree on the left side is "*John Smith is a chief scientist of Hard‐ com C*.", and the original sentence of the tree on the right side is "*James Brown is a scientist at the University of Illinois*." For convenience, the left-side tree is referred to as *P* 1, and the rightside tree as *P* 2. For easy explanation, "*chief*" and "*University of Illinois*" are removed or ab‐ breviated. The kernel function for the two trees is primarily expressed as in the following.

*SSK*( *P*1.*Person*, *P*1.*Verb*, *P*1.*PNP* , *P*2.*Person*, *P*2.*Verb*, *P*2.*PNP* , *i*, *j*)

Equation 4 is used to calculate the tree kernel between the two trees *P* 1 and *P* 2, Here *P*

.*Sentence*.*p* represents the root node of *i*-th tree. Equation 3 is used to calculate *K <sup>c</sup>* for each child node of the root node to expand it into the sum of the *SSK* function. The Figure 4 be‐

Figure 4 shows the process of calculating the kernel function between the second-level child nodes of the two trees. Since all nodes at this level have unexpectedly the matching node type as shown in Figure 3, kernel similarity between the subsequences of each matching node as shown in the equations on the right side of Figure 3. Since matching only between subsequences of the same length is implemented as in Equation 3, non-matching subsequen‐

(5)

+ *Kc*( *P*1.*Person*, *P*1.*Verb*, *P*1.*PNP* , *P*2.*Person*, *P*2.*Verb*, *P*2.*PNP* )

low shows the process of calculating the kernel of the *SSK* function.

In this equation, *P <sup>i</sup>* .*p* represents a specific parent node in the parsing tree; *P <sup>i</sup>* .*Type* represents word and entity type information; and *P <sup>i</sup>* .*Role* represents relation information between the two entities. Equation 1 is called *matching function*, and is used for comparing between the part-of-speeches, entity type and relation type information for each node. If both type and role are the same as in binary comparison, 1 is returned and otherwise 0.

$$k(P\_1, p, \, P\_2, p) = \begin{cases} 1, & \text{if } P\_1.Text = P\_2.Text \\ 0, & \text{otherwise} \end{cases} \tag{2}$$

Equation 2 represents the function for deciding whether two nodes comprise the same words or entities. (Zelenco et al., 2003) named this *similarity function*. The recursive kernel function *K <sup>c</sup>* for the child node of a specific parent node, is defined as follows on the basis of the two func‐ tions. Although all the functions above do not use "Head" field in each node for simplicity in (Zelenco et al., 2003), it would be valuable to use the field for better performance.

$$\begin{aligned} K\_c(P\_1, c\_r \ P\_2, c) &= \sum\_{i, j, l(0) = l(1)} SSK(P\_1, c\_r \ P\_2, c\_r \ i\_r \ j) \\ SSK(P\_1, c\_r \ P\_2, c\_r \ i\_r \ j) &= \lambda^{d(I)} \lambda^{d(I)} K(P\_1 \mathbf{I} \mathbf{J} \cdot P\_2 \mathbf{I} \mathbf{J}) \prod\_{s = 1, \dots, l(l)} t(P\_1 \mathbf{I} \mathbf{j}\_s \mathbf{J} \cdot P\_2 \mathbf{I} \mathbf{j}\_s \mathbf{J}) \\\ K(P\_1 \mathbf{I} \mathbf{j}\_s \ P\_2 \mathbf{I} \mathbf{j} \mathbf{J}) &= \sum\_{s=1, \dots, l(l)} K(P\_1 \mathbf{I} \mathbf{j}\_s \mathbf{J} \cdot P\_2 \mathbf{I} \mathbf{j}\_s \mathbf{J}) \\\ i &= [i\_1 \ i\_2 \ \dots \ i\_n \ i\_1 \ i\_1 \le i\_2 \le \dots \le i\_n] \\\ d(I) &= i\_n - i\_1 + 1 \\ l(i) &= n \end{aligned} \tag{3}$$

In Equation 3, *P <sup>i</sup>* .*c* represents the child nodes of the specific node (*P <sup>i</sup> .p*). *SSK*(*P* 1.*c*, *P* 2.*c*, i, j) is the function for calculating subsequence similarity between child nodes (*P <sup>i</sup>* .*c*) of *P* 1.*p* and *P* 2.*p*. Here, i is the index representing all the subsequences of the child nodes of *P <sup>i</sup>* .*p*. *d* (*i*) , 0< <1 is the weight factor depending on the length of the child node subsequences. This variable determines how many subsequences of the specific child node are contained in the entire child nodes in order to lower the similarity in case of multiple matching subsequen‐ ces. *d*(i) represents the distance between the first and last nodes of the currently processing subsequence i, and *l*(i) represents the number of the nodes of i. The kernel function between the two trees is defined as follows.

$$K(P\_1, P\_2) = \begin{vmatrix} 0, & \text{if } t(P\_1, p\_1, P\_2, p) = 0 \\ k(P\_1, p\_1, P\_2, p) + K\_c(P\_1, c, P\_2, c), & \text{otherwise} \end{vmatrix} \tag{4}$$

*Pi* represents the tree to be compared. The similarity between the two trees is calculated by adding up the similarity function *k* (Equation 2) for the current node (parent node) and the similarity calculation function *K <sup>c</sup>* (Equation 3) between the child nodes. The kernel calcula‐ tion process for the following parse trees will be described in detail later for intuitive under‐ standing of the kernel function.

**Figure 3.** Two sample parse trees for illustrating kernel calculation process.

*t*(*P*1.*p*, *P*2.*p*)={

8 Theory and Applications for Advanced Text Mining Text Mining

word and entity type information; and *P <sup>i</sup>*

*Kc*(*P*1.*c*, *<sup>P</sup>*2.*c*)= ∑

*i* ={*i* 1, *i* 2, ..., *i <sup>n</sup>* |*i* <sup>1</sup> ≤*i*

*d*(*i*)=*i <sup>n</sup>* −*i* <sup>1</sup> + 1

*l*(*i*)=*n*

the two trees is defined as follows.

*K*(*P*1, *P*2)={

In Equation 3, *P <sup>i</sup>*

*d* (*i*)

*i*, *j*,*l*(*i*)=*l*( *j*)

*SSK*(*P*1.*c*, *<sup>P</sup>*2.*c*, *<sup>i</sup>*, *<sup>j</sup>*)=*<sup>λ</sup> <sup>d</sup>* (*i*)

*<sup>K</sup>*(*P*<sup>1</sup> *<sup>i</sup>* , *<sup>P</sup>*<sup>2</sup> *<sup>j</sup>* )=∑*<sup>s</sup>*=1,...,*l*(*i*) *<sup>K</sup>*(*P*<sup>1</sup> *<sup>i</sup>*

<sup>2</sup> ≤...≤*i n*}

In this equation, *P <sup>i</sup>*

1, if *P*1.*Type* = *P*2.*Type* and *P*1.*Role* =*P*2.*Role*

1, if *P*1.*Text* =*P*2.*Text*

*<sup>K</sup>*(*P*<sup>1</sup> *<sup>i</sup>* , *<sup>P</sup>*<sup>2</sup> *<sup>j</sup>* ) ∏

.*c* represents the child nodes of the specific node (*P <sup>i</sup> .p*). *SSK*(*P* 1.*c*, *P* 2.*c*, i, j)

*<sup>s</sup>* , *P*<sup>2</sup> *j s* ) *s*=1,...,*l*(*i*)

*t*(*P*<sup>1</sup> *i*

*<sup>s</sup>* , *P*<sup>2</sup> *j s* )

.*p* represents a specific parent node in the parsing tree; *P <sup>i</sup>*

two entities. Equation 1 is called *matching function*, and is used for comparing between the part-of-speeches, entity type and relation type information for each node. If both type and

Equation 2 represents the function for deciding whether two nodes comprise the same words or entities. (Zelenco et al., 2003) named this *similarity function*. The recursive kernel function *K <sup>c</sup>* for the child node of a specific parent node, is defined as follows on the basis of the two func‐ tions. Although all the functions above do not use "Head" field in each node for simplicity in

role are the same as in binary comparison, 1 is returned and otherwise 0.

(Zelenco et al., 2003), it would be valuable to use the field for better performance.

*SSK*(*P*1.*c*, *P*2.*c*, *i*, *j*)

*λ <sup>d</sup>* (*i*)

is the function for calculating subsequence similarity between child nodes (*P <sup>i</sup>*

*P* 2.*p*. Here, i is the index representing all the subsequences of the child nodes of *P <sup>i</sup>*

, 0< <1 is the weight factor depending on the length of the child node subsequences. This variable determines how many subsequences of the specific child node are contained in the entire child nodes in order to lower the similarity in case of multiple matching subsequen‐ ces. *d*(i) represents the distance between the first and last nodes of the currently processing subsequence i, and *l*(i) represents the number of the nodes of i. The kernel function between

0, if *t*(*P*1.*p*, *P*2.*p*)=0

*<sup>k</sup>*(*P*1.*p*, *<sup>P</sup>*2.*p*) <sup>+</sup> *Kc*(*P*1.*c*, *<sup>P</sup>*2.*c*), otherwise (4)

*k*(*P*1.*p*, *P*2.*p*)={

0, otherwise (1)

.*Role* represents relation information between the

0, otherwise (2)

.*Type* represents

(3)

.*p*.

.*c*) of *P* 1.*p* and

The original sentence of the parse tree on the left side is "*John Smith is a chief scientist of Hard‐ com C*.", and the original sentence of the tree on the right side is "*James Brown is a scientist at the University of Illinois*." For convenience, the left-side tree is referred to as *P* 1, and the rightside tree as *P* 2. For easy explanation, "*chief*" and "*University of Illinois*" are removed or ab‐ breviated. The kernel function for the two trees is primarily expressed as in the following.

$$\begin{aligned} \{K(\mathbf{P}\_1, \mathbf{P}\_2) &= k(\mathbf{P}\_1.Sentence, \mathbf{p}\_r.P\_2.Sentence, \mathbf{p}) \\ &+ K\_c(\mathbf{P}\_1.Person, \mathbf{P}\_1.Verb, \mathbf{P}\_1.PNP\mathbf{J}) \{P\_2.Person, \mathbf{P}\_2.Verb, \mathbf{P}\_2.PNP\mathbf{J}\} \\ &= k(\mathbf{P}\_1.Sentence, \mathbf{p}\_r.P\_2.Sentence, \mathbf{p}) \\ &+ \sum\_{i,l,l(0)=l(j)} \mathcal{SSK}(\mathbf{P}\_1.Person, \mathbf{P}\_1.Verb, \mathbf{P}\_1.PNP\mathbf{J}\_1.P\_2.Person, \mathbf{P}\_2.Verb, \mathbf{P}\_2.PNP\mathbf{J}\_1.i, \mathbf{j}) \end{aligned} \tag{5}$$

Equation 4 is used to calculate the tree kernel between the two trees *P* 1 and *P* 2, Here *P i* .*Sentence*.*p* represents the root node of *i*-th tree. Equation 3 is used to calculate *K <sup>c</sup>* for each child node of the root node to expand it into the sum of the *SSK* function. The Figure 4 be‐ low shows the process of calculating the kernel of the *SSK* function.

Figure 4 shows the process of calculating the kernel function between the second-level child nodes of the two trees. Since all nodes at this level have unexpectedly the matching node type as shown in Figure 3, kernel similarity between the subsequences of each matching node as shown in the equations on the right side of Figure 3. Since matching only between subsequences of the same length is implemented as in Equation 3, non-matching subsequen‐ ces are excluded from kernel calculation through conformity check among subsequences of which the length is 1, 2 and 3 respectively. The result of kernel calculation is as follows.

Figure 5 shows the process and method of calculating the kernel function. Basically, for the tree kernel, *Breadth First Search* is carried out. In calculating similarity between the two trees, trees to be compared are primarily those with the same node type and role. Since the kernel value is 0 when the text is different, nodes with the same text are substantially compared. In

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 11

Zelenco, et al., (2003) divides tree kernel calculation into two types. One is the *sparse subtree kernel* described above, and the other one is the *continuous subtree kernel*, which is discussed below. First, the sparse subtree kernel includes two node subsequences for comparison, al‐ though two node subsequences are not continuously connected. For example, "*Person*, *PNP*" on the left side and "*Person*, *PNP*" on the right side in the middle of Figure 4 are node se‐ quences, separated in the parse tree. The sparse subtree kernel includes such subsequences for comparison. On the contrary, the continuous subtree kernel does not approve such sub‐ sequences and excludes them from comparison. The Figure 6 below shows additional exem‐

Figure 6 shows the parsing result for "*John White, a well-known scientist at the University of Illinois, led the discussion*." Unlike the sentences discussed above, this one has an independent inserted phrase in apposition, and comprises the same contents as the second sentence of Figure 3 "*James Brown is a scientist at the University of Illinois*." If these two sentences are com‐ pared by means of the continuous subtree kernel, a very low kernel similarity will be de‐ rived because there is almost no continuous node on the parse tree although they include similar contents. The sparse subtree kernel is used overcome this deficiency. Figure 7 shows

Figure 7 shows the process of calculating *K*([*Person*, *Verb*, *PNP*], [*Person*, *Punc*, *PNP*, *Verb*, *BNP*]) by means of sparse subtree kernel. When continuous subtree kernel is used, the similarity be‐

plary sentence for comparison and describing the effect of two tree kernels.

Figure 5, these are "*be*" and "*scientist*" nodes.

**Figure 6.** Additional sample sentence and parsing result.

a part of the process of calculating kernel values for two sentences.

{ } 12 1 2 1 11 2 2 2 2 1 2 12 1 2 ( , ) ( . ., . .) ([ . , . , . ],[ . , . , . ]) ( . , . ) ( . , . ) ( . , . ) ( ( ) 1, *c K P P k P Sentence p P Sentence p K P Person P Verb P PNP P Person P Verb P PNP* l *K P Person P Person K P Verb P Verb K P PNP P PNP l d* = + = + + ® = **i** { } { } 4 1 2 12 1 2 6 1 2 1 2 ( ) 1) ( . , . ) 2 ( . , . ) ( . , . ) ( ( ) 2, ( ) 2) ( . , . ) ( . , . ) *K P Person P Person K P Verb P Verb K P PNP P PNP l d K P Person P Person K P PNP P PNP* l l = + + + ®= = + + **i i i** { } { } { } 6 1 2 12 1 2 2 4 1 2 1 2 6 1 ( ( ) 2, ( ) 3) ( . , . ) ( . , . ) ( . , . ) ( ( ) 3, ( ) 3) 01 (. , . ) 0 2 (. , . ) ( *l d K P Person P Person K P Verb P Verb K P PNP P PNP l d K P PNP P PNP K P PNP P PNP K P* l l l l ®= = + + + ®= = = ++ + ++ + **i i i i** { } { } 6 2 1 2 2 46 24 6 1 2 . , . ) 1 (. , . ) 2 2 (. , . ) *PNP P PNP K P PNP P PNP K P PNP P PNP* l l ll ll l + + =+ ++ ++ (6)

**Figure 4.** Executing Subsequence Similarity Kernel (SSK) function.

As Equation 6 shows, only the Equation expressed on the basis of is left, other than *K*(*P* 1.PNP, *P* 2.PNP). The kernel function recursively compares child node subsequences at the third level to calculate resulting kernel similarity.

**Figure 5.** Process and method of calculating the tree kernel.

Figure 5 shows the process and method of calculating the kernel function. Basically, for the tree kernel, *Breadth First Search* is carried out. In calculating similarity between the two trees, trees to be compared are primarily those with the same node type and role. Since the kernel value is 0 when the text is different, nodes with the same text are substantially compared. In Figure 5, these are "*be*" and "*scientist*" nodes.

Zelenco, et al., (2003) divides tree kernel calculation into two types. One is the *sparse subtree kernel* described above, and the other one is the *continuous subtree kernel*, which is discussed below. First, the sparse subtree kernel includes two node subsequences for comparison, al‐ though two node subsequences are not continuously connected. For example, "*Person*, *PNP*" on the left side and "*Person*, *PNP*" on the right side in the middle of Figure 4 are node se‐ quences, separated in the parse tree. The sparse subtree kernel includes such subsequences for comparison. On the contrary, the continuous subtree kernel does not approve such sub‐ sequences and excludes them from comparison. The Figure 6 below shows additional exem‐ plary sentence for comparison and describing the effect of two tree kernels.

**Figure 6.** Additional sample sentence and parsing result.

ces are excluded from kernel calculation through conformity check among subsequences of which the length is 1, 2 and 3 respectively. The result of kernel calculation is as follows.

{ }

1 2 12 1 2

{ }

1 2 12 1 2

1 2 1 2

*K P PNP P PNP K P PNP P PNP*

{ } { }

 l

*K P Person P Person K P Verb P Verb K P PNP P PNP l d*

*K P Person P Person K P Verb P Verb K P PNP P PNP l d*

*K P Person P Person K P Verb P Verb K P PNP P PNP l d*

As Equation 6 shows, only the Equation expressed on the basis of is left, other than *K*(*P* 1.PNP, *P* 2.PNP). The kernel function recursively compares child node subsequences at the

+ + + ®= =

+ + + ®= =

( . , . ) 2 ( . , . ) ( . , . ) ( ( ) 2, ( ) 2)

( . , . ) ( . , . ) ( . , . ) ( ( ) 3, ( ) 3)

{ }

1 2

*K P PNP P PNP*

( ) 1)

( ( ) 2, ( ) 3)

(6)

*l d*

®= =

**i i i i**

**i i i**

=

= + + ® = **i** { }

1 2 12 1 2

( . , . ) ( . , . ) ( . , . ) ( ( ) 1,

1 11 2 2 2

{ }

6 2 1 2

+ +

 l

l

. , . ) 1 (. , . )

*PNP P PNP K P PNP P PNP*

*K P Person P Person K P PNP P PNP*

1 2 1 2

01 (. , . ) 0 2 (. , . )

{ }

( . , . ) ( . , . )

= ++ + ++

2 4

2 2 (. , . )

**Figure 4.** Executing Subsequence Similarity Kernel (SSK) function.

third level to calculate resulting kernel similarity.

**Figure 5.** Process and method of calculating the tree kernel.

2 46 24 6

=+ ++ ++

 ll

+ +

*K P Person P Verb P PNP P Person P Verb P PNP*

([ . , . , . ],[ . , . , . ])

12 1 2

( , ) ( . ., . .)

10 Theory and Applications for Advanced Text Mining Text Mining

*K P P k P Sentence p P Sentence p*

2

l

= +

*c*

4

l

l

l

l

+

l ll

6

6

6 1

l

*K P*

(

Figure 6 shows the parsing result for "*John White, a well-known scientist at the University of Illinois, led the discussion*." Unlike the sentences discussed above, this one has an independent inserted phrase in apposition, and comprises the same contents as the second sentence of Figure 3 "*James Brown is a scientist at the University of Illinois*." If these two sentences are com‐ pared by means of the continuous subtree kernel, a very low kernel similarity will be de‐ rived because there is almost no continuous node on the parse tree although they include similar contents. The sparse subtree kernel is used overcome this deficiency. Figure 7 shows a part of the process of calculating kernel values for two sentences.

Figure 7 shows the process of calculating *K*([*Person*, *Verb*, *PNP*], [*Person*, *Punc*, *PNP*, *Verb*, *BNP*]) by means of sparse subtree kernel. When continuous subtree kernel is used, the similarity be‐ tween the two sentences is very low. A better similarity value is revealed by two pairs of matching subsequences in the subsequence of which the length is 2, as shown in Figure 7.

*jects are dependent on verbs*", "*adjectives*" are dependent on nouns they describe, etc. In order to improve analysis result, however, this study uses even more complex node features than (Zelenco et al., 2003). The hypernym extracted from WordNet (Fellbaum, 1998) is applied to expand the matching function between nodes. The composite kernel is constructed to apply

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 13

Figure 8 shows a sample augmented dependency tree used in this study. The root of the tree is the verb "*advanced*", and the resultant subject and the preposition are the child nodes of the root. The object of the preposition "*Tikrit*" is the child node of the preposition "*near*". Each node

has 8 types of node feature information. The Table below outlines each node feature.

General POS (5) Noun, Verb, Adjective

Entity Type person, geo-political-entity Entity Level name, nominal, pronoun

Chunking Information NP, VP, ADJP

WordNet hypernyms social group, city Relation Argument ARG\_A, ARG\_B

**Table 1.** Node features in the augmented dependency tree.

**Feature Example** Words troops, Tikrit Detailed POS (24) NN, NNP

it to relation extraction for the first time.

**Figure 8.** Sample augmented dependency tree ("*Troops advanced near Tikrit*").

**Figure 7.** Process of calculating K([Person, Verb, PNP], [Person, Punc, PNP, Verb, BNP]).

For measuring the performance of the proposed two tree kernels, (Zelenco et al., 2003) used 60% of data manually constituted as a learning collection, and carried out 10-fold cross valida‐ tion. Only two relations, that is, "*Person-Affiliation*" and "*Organization-Location*", were tested. The test revealed that the kernel-based method offers better performance than the featurebased method, and the continuous subtree kernel excels the sparse subtree kernel. In particu‐ lar, the tree kernel proposed in their study inspired many new tree kernel researchers.

Their study is generally recognized as an important contribution for devising kernels for ef‐ ficient measuring of similarity between very complex tree-type structure entities to be used later for relation extraction. Since various information other than the syntactic information is still required, the method highly depends on the performance of the REES system for creat‐ ing parse trees. Because the quantity of data was not enough for the test and the binary clas‐ sification test for only two types of relation was carried out, scalability or generality of the proposed kernel was not analyzed in detail.

#### **4.2. Dependency tree kernel-based method (Culotta & Sorensen, 2004)**

As the ACE collection was constituted and distributed from 2004, relation extraction has been fully studied. Culotta and Sorensen (2004) proposed a kernel-based relation extraction method, which uses the dependency parse tree structure on the basis of the tree kernel pro‐ posed by Zelenco, et al., (2003) described in section 4.1. This special parse tree, called an *Augmented Dependency Tree*, uses MXPOST (Ratnaparkhi, 1996) to carry out parsing, and then modifies some syntactic rules on the basis of the result. Exemplary rules include "*sub‐* *jects are dependent on verbs*", "*adjectives*" are dependent on nouns they describe, etc. In order to improve analysis result, however, this study uses even more complex node features than (Zelenco et al., 2003). The hypernym extracted from WordNet (Fellbaum, 1998) is applied to expand the matching function between nodes. The composite kernel is constructed to apply it to relation extraction for the first time.

**Figure 8.** Sample augmented dependency tree ("*Troops advanced near Tikrit*").

tween the two sentences is very low. A better similarity value is revealed by two pairs of matching subsequences in the subsequence of which the length is 2, as shown in Figure 7.

For measuring the performance of the proposed two tree kernels, (Zelenco et al., 2003) used 60% of data manually constituted as a learning collection, and carried out 10-fold cross valida‐ tion. Only two relations, that is, "*Person-Affiliation*" and "*Organization-Location*", were tested. The test revealed that the kernel-based method offers better performance than the featurebased method, and the continuous subtree kernel excels the sparse subtree kernel. In particu‐

Their study is generally recognized as an important contribution for devising kernels for ef‐ ficient measuring of similarity between very complex tree-type structure entities to be used later for relation extraction. Since various information other than the syntactic information is still required, the method highly depends on the performance of the REES system for creat‐ ing parse trees. Because the quantity of data was not enough for the test and the binary clas‐ sification test for only two types of relation was carried out, scalability or generality of the

As the ACE collection was constituted and distributed from 2004, relation extraction has been fully studied. Culotta and Sorensen (2004) proposed a kernel-based relation extraction method, which uses the dependency parse tree structure on the basis of the tree kernel pro‐ posed by Zelenco, et al., (2003) described in section 4.1. This special parse tree, called an *Augmented Dependency Tree*, uses MXPOST (Ratnaparkhi, 1996) to carry out parsing, and then modifies some syntactic rules on the basis of the result. Exemplary rules include "*sub‐*

lar, the tree kernel proposed in their study inspired many new tree kernel researchers.

**4.2. Dependency tree kernel-based method (Culotta & Sorensen, 2004)**

**Figure 7.** Process of calculating K([Person, Verb, PNP], [Person, Punc, PNP, Verb, BNP]).

proposed kernel was not analyzed in detail.

12 Theory and Applications for Advanced Text Mining Text Mining

Figure 8 shows a sample augmented dependency tree used in this study. The root of the tree is the verb "*advanced*", and the resultant subject and the preposition are the child nodes of the root. The object of the preposition "*Tikrit*" is the child node of the preposition "*near*". Each node has 8 types of node feature information. The Table below outlines each node feature.


**Table 1.** Node features in the augmented dependency tree.

In Table 1, the first four features (words, part-of-speech information, and phrase informa‐ tion) are the information obtained from parsing, and the rest are named entity features from the ACE collection. Among them, the WordNet hypernym is the result of extracting the highest node for corresponding word from the WordNet database.

As discussed above, the tree kernel defined by (Zelenco et al., 2003) is used in this method. Since the features of each node are added, the matching function (Equaiton 1) and the simi‐ larity function (Equation 2) defined by Zelenco, et al., (2003) are accordingly modified into and applied. In detail, the features to be applied to the matching function and the features to be applied to the similarity function from among 8 features were dynamically divided to de‐ vise the following models to be applied.

> *ti* :feature vector representing the node i. *tj* :feature vector representing the node j. *ti <sup>m</sup>* : subset of *ti* used in matching function *ti <sup>s</sup>* : subset of *ti* used in similarity function *m*(*ti* , *tj* )={1 if *ti <sup>m</sup>* <sup>=</sup>*tj m* 0 otherwise *s*(*ti* , *tj* )= ∑ *vq*∈*ti s* ∑ *vr*∈*tj s C*(*vq*, *vr*) (7)

is higher, the reason for that was not clearly described, and the advantage of using the de‐

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 15

In section 4.2, we have discussed relation extraction using dependency parse trees for the tree kernel proposed by Zelenco, et al., (2003). Bunescu & Mooney (2005) have studied the dependency path between the two named entities in the dependency parse tree with a view to proposing the shortest possible path dependency kernel for relation extraction. There is always a dependency path between two named entities in a sentence and Bunescu & Moon‐ ey (2005) argued that the performance of relation extraction is improved by using the syn‐

pendency tree instead of the full parse tree was not demonstrated in the experiment.

tactic paths. The Figure 9 below shows the dependency graph for a sample sentence.

**Figure 9.** Dependency graph and dependency syntax pair list for the sample sentence.

path as shown in Figure 10.

The red node in Figure 9 represents the named entity specified in the ACE collection. Sepa‐ ration of the entire dependency graph results in 10 dependency syntax pairs. It is possible to select pairs, which include named entities in the syntax pairs to construct the dependency

**Figure 10.** Extracting dependency path including named entities from dependency syntax pair collection.

**4.3. Shortest path dependency tree kernel method (Bunescu & Mooney, 2005)**

Where, *m* is the matching function; *s* is the similarity function; and *t <sup>i</sup>* is a feature collection showing the node *i*. *C*(,) is a function for comparing two feature values on the basis of approximate matching, not simple perfect matching. For example, recognition of "*NN*" and "*NP*" in the particular part-of-speech information of Table 1 as the same part-of-speeches is implemented by modifying the internal rule of the function. Equations 3 and 4 in section 4.1 are applied as tree kernel functions for comparing the similarity of two augmented depend‐ ency trees on the basis of two basic functions.

For the evaluation, the initial ACE collection version (2002) released in 2003 was used. This collection defines 5 entity types and 24 types of relations. Culotta & Sorensen (2004) tested relation extraction only for the higher 5 types of relation collections ("*AT*", "*NEAR*", "*PART*", "*ROLE*", "*SOCIAL*"). The tested kernels were the sparse subtree kernel (*K* 0), the continuous subtree kernel (*K* 1), and the bag-of-words kernel (*K* <sup>2</sup>). In addition, two compo‐ site kernels for which the tree kernel was combined with the bag-of-word kernel, that is *K* 3=*K* 0+*K* 2, *K* 4=*K* 1+*K* 2, were further constituted. The test consisting of two steps of relation detection4 and relation classification5 revealed that all tree kernel methods, including the composite kernel, show better performance than the bag-of-words kernel. Unlike the evalua‐ tion result by (Zelenco et al., 2003), although the performance of continuous subtree kernel

<sup>4</sup> Binary classification for identifying possible relation between two named entities.

<sup>5</sup> Relation extraction for all instances with relations in the result of relation identification.

is higher, the reason for that was not clearly described, and the advantage of using the de‐ pendency tree instead of the full parse tree was not demonstrated in the experiment.

## **4.3. Shortest path dependency tree kernel method (Bunescu & Mooney, 2005)**

In Table 1, the first four features (words, part-of-speech information, and phrase informa‐ tion) are the information obtained from parsing, and the rest are named entity features from the ACE collection. Among them, the WordNet hypernym is the result of extracting the

As discussed above, the tree kernel defined by (Zelenco et al., 2003) is used in this method. Since the features of each node are added, the matching function (Equaiton 1) and the simi‐ larity function (Equation 2) defined by Zelenco, et al., (2003) are accordingly modified into and applied. In detail, the features to be applied to the matching function and the features to be applied to the similarity function from among 8 features were dynamically divided to de‐

> *ti* :feature vector representing the node i. *tj* :feature vector representing the node j.

*<sup>m</sup>* : subset of *ti* used in matching function

*<sup>s</sup>* : subset of *ti* used in similarity function

(7)

*<sup>m</sup>* <sup>=</sup>*tj m*

*C*(*vq*, *vr*)

Where, *m* is the matching function; *s* is the similarity function; and *t <sup>i</sup>* is a feature collection showing the node *i*. *C*(,) is a function for comparing two feature values on the basis of approximate matching, not simple perfect matching. For example, recognition of "*NN*" and "*NP*" in the particular part-of-speech information of Table 1 as the same part-of-speeches is implemented by modifying the internal rule of the function. Equations 3 and 4 in section 4.1 are applied as tree kernel functions for comparing the similarity of two augmented depend‐

For the evaluation, the initial ACE collection version (2002) released in 2003 was used. This collection defines 5 entity types and 24 types of relations. Culotta & Sorensen (2004) tested relation extraction only for the higher 5 types of relation collections ("*AT*", "*NEAR*", "*PART*", "*ROLE*", "*SOCIAL*"). The tested kernels were the sparse subtree kernel (*K* 0), the continuous subtree kernel (*K* 1), and the bag-of-words kernel (*K* <sup>2</sup>). In addition, two compo‐ site kernels for which the tree kernel was combined with the bag-of-word kernel, that is *K* 3=*K* 0+*K* 2, *K* 4=*K* 1+*K* 2, were further constituted. The test consisting of two steps of relation

composite kernel, show better performance than the bag-of-words kernel. Unlike the evalua‐ tion result by (Zelenco et al., 2003), although the performance of continuous subtree kernel

revealed that all tree kernel methods, including the

0 otherwise

highest node for corresponding word from the WordNet database.

)={1 if *ti*

vise the following models to be applied.

14 Theory and Applications for Advanced Text Mining Text Mining

*ti*

*ti*

*m*(*ti* , *tj*

*s*(*ti* , *tj* )= ∑ *vq*∈*ti s* ∑ *vr*∈*tj s*

ency trees on the basis of two basic functions.

and relation classification5

4 Binary classification for identifying possible relation between two named entities. 5 Relation extraction for all instances with relations in the result of relation identification.

detection4

In section 4.2, we have discussed relation extraction using dependency parse trees for the tree kernel proposed by Zelenco, et al., (2003). Bunescu & Mooney (2005) have studied the dependency path between the two named entities in the dependency parse tree with a view to proposing the shortest possible path dependency kernel for relation extraction. There is always a dependency path between two named entities in a sentence and Bunescu & Moon‐ ey (2005) argued that the performance of relation extraction is improved by using the syn‐ tactic paths. The Figure 9 below shows the dependency graph for a sample sentence.

**Figure 9.** Dependency graph and dependency syntax pair list for the sample sentence.

The red node in Figure 9 represents the named entity specified in the ACE collection. Sepa‐ ration of the entire dependency graph results in 10 dependency syntax pairs. It is possible to select pairs, which include named entities in the syntax pairs to construct the dependency path as shown in Figure 10.

**Figure 10.** Extracting dependency path including named entities from dependency syntax pair collection.

As one can see from Figure 10, it is possible to construct the dependency path between the named entities, "*Protesters*" and "*stations*," and the dependency path between "*workers*" and "*stations*". As discussed above, the dependency path for connecting two named entities in this sentence can be extended infinitely. Bunescu & Mooney (2005) estimated that the short‐ est path among them contributes the most to establishing the relation between two entities. Therefore, it is possible to use kernel-based learning for estimating the relation between the two named entities connected by means of dependency path. For example, for estimating the relation for the path of "*protesters seized stations*", the relation is estimated that the PERSON entity ("*protesters*") did a specific behavior ("*seized*") for the FACILITY entity ("*sta‐ tions*"), through which PERSON ("*protesters*") is located at FACILITY ("*stations*") ("LOCAT‐ ED\_AT"). At another complex path of "*workers holding protesters seized stations*", it is possible to estimate the relation that PERSON ("*workers*") is located at FACILITY ("*sta‐ tions*") ("LOCATED\_AT") if PERSON ("*protesters*") did some behavior ("*holding*") to PER‐ SON ("*workers*"), and PERSON ("*protesters*") did some behavior ("*seized*") to FACILITY ("*stations*"). As such, with the dependency relation path, it is possible to identify semantic relation between two entities more intuitively.

As shown in Figure 11, with more information available for individual nodes, new 48 de‐ pendency paths can be created through Cartesian product of the node values. Here, rela‐ tion extraction is carried out by applying the dependency path kernel for calculating redundancy of the information included in each node rather than comparing all newly

0, *m*≠*n*

In the above Equation 8, *x* and *y* represent extended individual instances; *m* and *n* denote the lengths of the dependency path; *K*(,) presents the dependency path kernel; and *c*(,) is a function for calculating the level of information element redundancy between the two nodes. The Figure below shows the process of calculating the kernel value on the basis of Equation 8.

As shown in Figure 12, the process of comparing two dependency paths is very simple. If the length of two paths is different, the kernel function simply returns zero (0). Otherwise, the level of information redundancy is then calculated for each node with respect to two paths. Since all the corresponding values are identical in the first node ("his", "PRP" and "Person"), the output is set to 3. As one matches in the second node, 1 is returned. By expo‐

On the basis of the same test environment as the collection used by Culotta & Sorensen (2004), two parsing systems, the CCG parser (Hockenmaier & Steedman, 2002) and the CFG parser (Collins, 1997) have been used to construct the shortest dependency path. The test in‐ cluded *K* 4 (bag-of-words kernel + continuous subtree kernel ) that demonstrated the best performance by Culotta & Sorensen (2004) for comparing performance. The test revealed

nentiating all the calculated values, the kernel value is found to be 18.

) *m*=*n*

(8)

17

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005

*x* = *x*1*x*2⋯ *xm*, *y* = *y*1*y*2⋯ *yn*

∏*<sup>i</sup>*=1 *<sup>n</sup> <sup>c</sup>*(*xi* , *yi*

) = | *xi* ∩ *yi* |

*K*(*x*, *y*) ={

*c*(*xi* , *yi*

created paths, as shown below.

**Figure 12.** Calculating dependency path kernel.

For learning, Bunescu & Mooney (2005) have extracted the shortest dependency paths, in‐ cluding two entities from individual training instances as shown in the Table below.


**Table 2.** Shortest path dependency tree-based sample relation extraction instance.

As shown in Table 2, each relation instance is expressed as a dependency path whose both ends are named entities. In terms of learning, however, it is not easy to extract sufficient fea‐ tures from such instances. Therefore, as discussed in section 4.2, various supplementary in‐ formation is created, such as part-of-speech, entity type and WordNet synset. As a result, individual nodes which make up the dependency path comprise a plurality of information elements, and a variety of new paths are finally created as shown in Figure 11.

$$
\begin{bmatrix}
\text{protester} \\
\text{NNS} \\
\text{NON} \\
\text{Noun} \\
\text{Person}
\end{bmatrix} \times \begin{bmatrix}
\text{seized} \\
\text{VBD} \\
\text{Verb} \\
\text{Verb}
\end{bmatrix} \times \begin{bmatrix}
\text{stations} \\
\text{NNS} \\
\text{NON} \\
\text{Faculty}
\end{bmatrix}
$$

**Figure 11.** New dependency path information created in a single instance.

As shown in Figure 11, with more information available for individual nodes, new 48 de‐ pendency paths can be created through Cartesian product of the node values. Here, rela‐ tion extraction is carried out by applying the dependency path kernel for calculating redundancy of the information included in each node rather than comparing all newly created paths, as shown below.

$$\begin{aligned} \mathbf{x} &= \mathbf{x}\_1 \mathbf{x}\_2 \cdots \mathbf{x}\_{m'} & \quad \mathbf{y} = y\_1 y\_2 \cdots y\_n\\ K(\mathbf{x}\_i \mid \mathbf{y}) &= \begin{cases} 0, & m \neq n\\ \prod\_{i=1}^n c(\mathbf{x}\_{i'} \mid \mathbf{y}\_i) & m = n \end{cases} \\\ c(\mathbf{x}\_{i'} \mid \mathbf{y}\_i) &= \mid \mathbf{x}\_i \cap \mathbf{y}\_i \mid \end{aligned} \tag{8}$$

In the above Equation 8, *x* and *y* represent extended individual instances; *m* and *n* denote the lengths of the dependency path; *K*(,) presents the dependency path kernel; and *c*(,) is a function for calculating the level of information element redundancy between the two nodes. The Figure below shows the process of calculating the kernel value on the basis of Equation 8.


**Figure 12.** Calculating dependency path kernel.

As one can see from Figure 10, it is possible to construct the dependency path between the named entities, "*Protesters*" and "*stations*," and the dependency path between "*workers*" and "*stations*". As discussed above, the dependency path for connecting two named entities in this sentence can be extended infinitely. Bunescu & Mooney (2005) estimated that the short‐ est path among them contributes the most to establishing the relation between two entities. Therefore, it is possible to use kernel-based learning for estimating the relation between the two named entities connected by means of dependency path. For example, for estimating the relation for the path of "*protesters seized stations*", the relation is estimated that the PERSON entity ("*protesters*") did a specific behavior ("*seized*") for the FACILITY entity ("*sta‐ tions*"), through which PERSON ("*protesters*") is located at FACILITY ("*stations*") ("LOCAT‐ ED\_AT"). At another complex path of "*workers holding protesters seized stations*", it is possible to estimate the relation that PERSON ("*workers*") is located at FACILITY ("*sta‐ tions*") ("LOCATED\_AT") if PERSON ("*protesters*") did some behavior ("*holding*") to PER‐ SON ("*workers*"), and PERSON ("*protesters*") did some behavior ("*seized*") to FACILITY ("*stations*"). As such, with the dependency relation path, it is possible to identify semantic

For learning, Bunescu & Mooney (2005) have extracted the shortest dependency paths, in‐

As shown in Table 2, each relation instance is expressed as a dependency path whose both ends are named entities. In terms of learning, however, it is not easy to extract sufficient fea‐ tures from such instances. Therefore, as discussed in section 4.2, various supplementary in‐ formation is created, such as part-of-speech, entity type and WordNet synset. As a result, individual nodes which make up the dependency path comprise a plurality of information

[ ] [ ]

é ù é ù é ù ê ú ê ú ê ú ê ú ´®´ ´¬´ ê ú ê ú ê ú ê ú ê ú ê ú ë û ê ú ë û ë û

protesters stations seized NNS NNS VBD Noun Noun Verb Person Facility

cluding two entities from individual training instances as shown in the Table below.

LOCATED\_AT **workers** holding **protesters** seized **stations** LOCATED\_AT **detainees** abusing **Jelisic** created at **camp**

elements, and a variety of new paths are finally created as shown in Figure 11.

relation between two entities more intuitively.

16 Theory and Applications for Advanced Text Mining Text Mining

**Relations Relation Instances**

LOCATED\_AT **protesters** seized **stations**

**Table 2.** Shortest path dependency tree-based sample relation extraction instance.

**Figure 11.** New dependency path information created in a single instance.

As shown in Figure 12, the process of comparing two dependency paths is very simple. If the length of two paths is different, the kernel function simply returns zero (0). Otherwise, the level of information redundancy is then calculated for each node with respect to two paths. Since all the corresponding values are identical in the first node ("his", "PRP" and "Person"), the output is set to 3. As one matches in the second node, 1 is returned. By expo‐ nentiating all the calculated values, the kernel value is found to be 18.

On the basis of the same test environment as the collection used by Culotta & Sorensen (2004), two parsing systems, the CCG parser (Hockenmaier & Steedman, 2002) and the CFG parser (Collins, 1997) have been used to construct the shortest dependency path. The test in‐ cluded *K* 4 (bag-of-words kernel + continuous subtree kernel ) that demonstrated the best performance by Culotta & Sorensen (2004) for comparing performance. The test revealed that the CFG-based shortest dependency path kernel offers better performance by using the CFG parser than the CCG parser for the same kernel.

tioned from the entire sequences. In order to calculate the weighted length, equation 9 selects among s and t only n-length subsequences, which exist in both sequences. For easy description, the following two sentences and base phrase analysis result will be used to ex‐

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 19

**Figure 14.** Two sentences and base phrase analysis result to illustrate the process of calculating subsequence kernel.

As shown in Figure 14, each of the nodes that consist of analysis result has 8 types of lexical information (word, part-of-speech, base phrase type, entity type, etc.) The kernel value, *K* 3(s, t, 0.5), of two analysis sequences is calculated according to the process shown in Figure

There are three subsequence pairs which are decided that the node of all subsequence is at least 0 by means of the homogeneity decision function *c*(,), among the subsequences in s and t. Scores for each of matching subsequences are derived by calculating the cumulative factorial of *c*(,) for each of them and then multiplying it by the weight. For example, the similarity of 0.84375 is obtained for "*troops advanced near*" and "*forces moved toward*". On the contrary, the similarity of "*troops advanced …Tikrit*" and "*forces moved …Bagdhad*" is 0.21093. This results from the lowered weight because the two subsequences are positioned apart. At

last, the similarities of the subsequences are introduced to Equation 8 that gives 1.477.

15, with features of the subsequence of which the length is 3.

plain the process of calculating the kernel value.

With regard to relation extraction, the shortest dependency path information is considered to be very useful, and is highly likely to be used in various fields. However, the kernel struc‐ ture is too simple. Yet another limitation is that only the paths of the same length are includ‐ ed in calculating similarity of two dependency paths.

#### **4.4. Subsequence kernel-based method (Bunescu & Mooney, 2006)**

The tree kernel presented by Zelenco, et al., (2003) is to compare two sibling nodes basically at the same level and uses the subsequence kernel. Bunescu & Mooney (2006) introduced the sub‐ sequence kernel and attempted relation extraction only with the base phrase analysis (chunk‐ ing), without applying the syntactic structure. Since kernel input is not of a complex syntactic structure, but base phrase sequences, the assumption was that the feature space can be divided into 3 types to comprise maximum 4 words for each type of features as follows, by using the ad‐ vantage of easy selection of contextual information essential for relation extraction.

**Figure 13.** Contextual location information for feature extraction.

In Figure 13, [FB] represents the words positioned before and between the two entities; [B] means only the word between them; and [BA], accordingly, means word collections between and after. The 3 types of feature collections can accept individual relation expressions, respec‐ tively. Furthermore, various types of supplementary word information (part-of-speech, entity type, WordNet synset, etc.) are used to expand them as in the methods described above.

Zelenco et al., (2003) described how to calculate the subsequence kernel, which will be de‐ scribed in detail again later. The kernel calculation function *K <sup>n</sup>*(*s*, *t*) is defined as shown be‐ low based on all *n*-length subsequences included in two sequences *s*, *t*.

$$K\_n(s, \ t, \ \lambda) = \sum\_{\substack{l \colon 1 \ l \mid \!= u}} \sum\_{\substack{j \colon 1 \ j \mid \!= u}} \lambda^{l(\{i\} + l(\{j\}))} \prod\_{k=1}^n c(s\_{i\_k \prime}, t\_{j\_k}) \tag{9}$$

Where, i and j represent subsequences contained in *s*, *t* respectively; *c*(,) is a function for deciding the homogeneity of the two inputs; and λ is a weight given to matching subse‐ quences. *l*(i) and *l*(j) are the values indicating how far each relevant subsequence is posi‐ tioned from the entire sequences. In order to calculate the weighted length, equation 9 selects among s and t only n-length subsequences, which exist in both sequences. For easy description, the following two sentences and base phrase analysis result will be used to ex‐ plain the process of calculating the kernel value.

that the CFG-based shortest dependency path kernel offers better performance by using the

With regard to relation extraction, the shortest dependency path information is considered to be very useful, and is highly likely to be used in various fields. However, the kernel struc‐ ture is too simple. Yet another limitation is that only the paths of the same length are includ‐

The tree kernel presented by Zelenco, et al., (2003) is to compare two sibling nodes basically at the same level and uses the subsequence kernel. Bunescu & Mooney (2006) introduced the sub‐ sequence kernel and attempted relation extraction only with the base phrase analysis (chunk‐ ing), without applying the syntactic structure. Since kernel input is not of a complex syntactic structure, but base phrase sequences, the assumption was that the feature space can be divided into 3 types to comprise maximum 4 words for each type of features as follows, by using the ad‐

In Figure 13, [FB] represents the words positioned before and between the two entities; [B] means only the word between them; and [BA], accordingly, means word collections between and after. The 3 types of feature collections can accept individual relation expressions, respec‐ tively. Furthermore, various types of supplementary word information (part-of-speech, entity type, WordNet synset, etc.) are used to expand them as in the methods described above.

Zelenco et al., (2003) described how to calculate the subsequence kernel, which will be de‐ scribed in detail again later. The kernel calculation function *K <sup>n</sup>*(*s*, *t*) is defined as shown be‐

*λ <sup>l</sup>*(*i*)+*l*( *<sup>j</sup>*)

Where, i and j represent subsequences contained in *s*, *t* respectively; *c*(,) is a function for deciding the homogeneity of the two inputs; and λ is a weight given to matching subse‐ quences. *l*(i) and *l*(j) are the values indicating how far each relevant subsequence is posi‐

∏ *k*=1 *n c*(*si k* , *t <sup>j</sup> k*

) (9)

low based on all *n*-length subsequences included in two sequences *s*, *t*.

*i*:|*i*|=*n*

∑ *j*:| *j*|=*n*

*Kn*(*s*, *<sup>t</sup>*, *<sup>λ</sup>*)= ∑

vantage of easy selection of contextual information essential for relation extraction.

CFG parser than the CCG parser for the same kernel.

18 Theory and Applications for Advanced Text Mining Text Mining

ed in calculating similarity of two dependency paths.

**Figure 13.** Contextual location information for feature extraction.

**4.4. Subsequence kernel-based method (Bunescu & Mooney, 2006)**

**Figure 14.** Two sentences and base phrase analysis result to illustrate the process of calculating subsequence kernel.

As shown in Figure 14, each of the nodes that consist of analysis result has 8 types of lexical information (word, part-of-speech, base phrase type, entity type, etc.) The kernel value, *K* 3(s, t, 0.5), of two analysis sequences is calculated according to the process shown in Figure 15, with features of the subsequence of which the length is 3.

There are three subsequence pairs which are decided that the node of all subsequence is at least 0 by means of the homogeneity decision function *c*(,), among the subsequences in s and t. Scores for each of matching subsequences are derived by calculating the cumulative factorial of *c*(,) for each of them and then multiplying it by the weight. For example, the similarity of 0.84375 is obtained for "*troops advanced near*" and "*forces moved toward*". On the contrary, the similarity of "*troops advanced …Tikrit*" and "*forces moved …Bagdhad*" is 0.21093. This results from the lowered weight because the two subsequences are positioned apart. At last, the similarities of the subsequences are introduced to Equation 8 that gives 1.477.

In Figure 16, *x <sup>i</sup>*

and *y <sup>i</sup>*

*Kn* '

improving performance.

represent named entities; *s <sup>f</sup>*

two sequences s and t is defined with the following Equation 10.

starts to the end of the entire sequence, and is defined as follows.

*i*:|*i*|=*n*

**4.5. Composite kernel-based method (Zhang, Zhang, Su, et al., 2006)**

feature kernel. Equation 11 gives the entity kernel definition.

*C*( *f* 1, *f* 2)={

*KL* (*R*1, *<sup>R</sup>*2)=∑*<sup>i</sup>*=1,2 *KE*(*R*1.*Ei*

*KE*(*E*1, *<sup>E</sup>*2)=∑*<sup>i</sup> <sup>C</sup>*(*E*1. *<sup>f</sup> <sup>i</sup>*

∑ *j*:| *j*|=*n*

context "Between-After" as seen in the figure 13. *s*¯ and *t*

(*s*, *t*, *λ*)= ∑

named entities; and *s <sup>a</sup>* and *t <sup>a</sup>* presents contextual word collections after two entities. *sb*

represent contextual information including two entities. Thus, the subsequence kernel for

The subsequence kernel *K*(*s*, *t*) consists of the sum of the contextual kernel before and be‐ tween entity pairs, *K fb*, the intermediate contextual kernel *K <sup>b</sup>* of the entities, and the contex‐ tual kernel between and after entity pairs *K ba*. *fb max* is the length of the target context "Fore-Between" and *b max* is the length of "Between" context. Also, *ba max* is the target length of the

of *s* and *t* respectively. The definition of the individual contextual kernel is described from the third to the fifth line of Equation 10. Here, *K'n* is the same as *Kn*, with the exception that it specifies the length of the relevant subsequence from the location where the subsequence

*<sup>λ</sup>* <sup>|</sup>*s*|+|*t*|−*<sup>i</sup>*

In Equation 11, *i* 1 and *j* <sup>1</sup> represent the starting positions of subsequences i and j respective‐ ly. The individual contextual kernel calculates the similarity between the two sequences for the subsequences in the location divided on the basis of the locations specified in Figure 16,

The performance evaluation in the same test environment as used in Sections 4.2 and 4.3 shows increased performance, even without complicated pre-processing, such as parsing, and without any syntactic information. In conclusion, the evaluation shows that this method is very fast in terms of learning speed and is an approach with a variety of potentials for

The release of ACE 2003 and 2004 versions contributed to full-scale study on relation extrac‐ tion. In particular, the collection is characterized by even richer information for tagged enti‐ ties. For example, ACE 2003 provides various entity features, e.g., entity headwords, entity type, and entity subtype for a specific named entity, and the features have been used as an important clue for determining the relation between two entities in a specific sentence. In this context, Zhang, Zhang, Su, et al., (2006) have built a composite kernel for which the con‐ volution parse tree kernel proposed by Collins & Duffy, (2001) is combined with the entity

> , *R*2.*Ei* )

, *E*2. *f <sup>i</sup>* )

1, if *f* <sup>1</sup> = *f* <sup>2</sup> 0, otherwise

with Equation 10, and totalizes the kernel values to calculate resulting kernel values.

<sup>1</sup>− *j* 1+2 ∏ *k*=1 *n c*(*si k* , *t <sup>j</sup> k*

and *t <sup>f</sup>*

denote the word lists before

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005

¯ are the reverse versions of strings

) (11)

*'*

and*tb '* 21

(12)

**Figure 15.** Process of calculating K3(s, t, 0.5).

As described above, it is possible to construct the subsequence kernel function based on the subsequences of all lengths by using contextual location information and Equation 8 as described above. Figure 16 shows surrounding contextual location where named enti‐ ties occur in the sentence.

**Figure 16.** Specifying contextual information depending on named entity location and defining variables.

$$\begin{aligned} \mathbf{K}(\mathbf{s},t) &= \mathbf{K}\_{fb}(\mathbf{s},t) + \mathbf{K}\_{b}(\mathbf{s},t) + \mathbf{K}\_{ha}(\mathbf{s},t) \\ \mathbf{K}\_{b,i}(\mathbf{s},t) &= \mathbf{K}\_{i}(\mathbf{s}\_{b\prime},t\_{b\prime},1) \cdot \mathbf{c}(\mathbf{x}\_{1\prime},y\_{1}) \cdot \mathbf{c}(\mathbf{x}\_{2\prime},y\_{2}) \cdot \boldsymbol{\lambda}^{\prime(\mathbf{s}\cdot\mathbf{s}) + l(\mathbf{t}\cdot\mathbf{t})} \\ \mathbf{K}\_{fb}(\mathbf{s},t) &= \sum\_{i \ge 1, i \ge 1, i \ne \prime h\_{\text{max}}} \mathbf{K}\_{b,i}(\mathbf{s},t) \cdot \mathbf{K}^{\prime}, (\mathbf{s}\_{f},t\_{f}) \\ \mathbf{K}\_{b}(\mathbf{s},t) &= \sum\_{i \ge 1, i \ne \prime h\_{\text{max}}} \mathbf{K}\_{b,i}(\mathbf{s},t) \\ \mathbf{K}\_{ha}(\mathbf{s},t) &= \sum\_{i \ge 1, i \ne \prime h\_{\text{max}}} \mathbf{K}\_{b,i}(\mathbf{s},t) \cdot \mathbf{K}^{\prime}, (\mathbf{\bar{s}}\_{f},\mathbf{\bar{t}}\_{f}) \end{aligned} \tag{10}$$

In Figure 16, *x <sup>i</sup>* and *y <sup>i</sup>* represent named entities; *s <sup>f</sup>* and *t <sup>f</sup>* denote the word lists before named entities; and *s <sup>a</sup>* and *t <sup>a</sup>* presents contextual word collections after two entities. *sb '* and*tb '* represent contextual information including two entities. Thus, the subsequence kernel for two sequences s and t is defined with the following Equation 10.

The subsequence kernel *K*(*s*, *t*) consists of the sum of the contextual kernel before and be‐ tween entity pairs, *K fb*, the intermediate contextual kernel *K <sup>b</sup>* of the entities, and the contex‐ tual kernel between and after entity pairs *K ba*. *fb max* is the length of the target context "Fore-Between" and *b max* is the length of "Between" context. Also, *ba max* is the target length of the context "Between-After" as seen in the figure 13. *s*¯ and *t* ¯ are the reverse versions of strings of *s* and *t* respectively. The definition of the individual contextual kernel is described from the third to the fifth line of Equation 10. Here, *K'n* is the same as *Kn*, with the exception that it specifies the length of the relevant subsequence from the location where the subsequence starts to the end of the entire sequence, and is defined as follows.

$$\mathcal{K}\_n^{\cdot}(\mathbf{s}\_{\prime}, t\_{\prime}, \lambda) = \sum\_{\substack{i:\, l:\, l \vdash n}} \sum\_{j:\, l^{\prime} \vdash n} \lambda^{\prime \mid s \vdash l \vdash i\_1 \cdots i\_l \cdot 2} \prod\_{k=1}^n c(\mathbf{s}\_{i\_{k^{\prime}}}, t\_{j\_k}) \tag{11}$$

In Equation 11, *i* 1 and *j* <sup>1</sup> represent the starting positions of subsequences i and j respective‐ ly. The individual contextual kernel calculates the similarity between the two sequences for the subsequences in the location divided on the basis of the locations specified in Figure 16, with Equation 10, and totalizes the kernel values to calculate resulting kernel values.

The performance evaluation in the same test environment as used in Sections 4.2 and 4.3 shows increased performance, even without complicated pre-processing, such as parsing, and without any syntactic information. In conclusion, the evaluation shows that this method is very fast in terms of learning speed and is an approach with a variety of potentials for improving performance.

#### **4.5. Composite kernel-based method (Zhang, Zhang, Su, et al., 2006)**

**Figure 15.** Process of calculating K3(s, t, 0.5).

20 Theory and Applications for Advanced Text Mining Text Mining

*Kb*,*<sup>i</sup>*

(*s*, *t*)= *Ki*

*<sup>K</sup> fb*(*s*, *<sup>t</sup>*)= ∑

*Kba*(*s*, *<sup>t</sup>*)= ∑

*Kb*(*s*, *<sup>t</sup>*)= ∑

ties occur in the sentence.

As described above, it is possible to construct the subsequence kernel function based on the subsequences of all lengths by using contextual location information and Equation 8 as described above. Figure 16 shows surrounding contextual location where named enti‐

**Figure 16.** Specifying contextual information depending on named entity location and defining variables.

*Kb*,*<sup>i</sup>*

*Kb*,*<sup>i</sup>*

(*sb*, *tb*, 1) <sup>⋅</sup> *<sup>c</sup>*(*x*1, *<sup>y</sup>*1) <sup>⋅</sup> *<sup>c</sup>*(*x*2, *<sup>y</sup>*2) <sup>⋅</sup>*<sup>λ</sup> <sup>l</sup>*(*<sup>s</sup>* ′

(*s*, *t*) ⋅ *K* ′

(*s*, *t*) ⋅ *K* ′

*<sup>j</sup>*(*sf* , *tf* )

*<sup>j</sup>*(*s*¯ *<sup>f</sup>* , *t* ¯ *<sup>f</sup>* ) *<sup>b</sup>*)+*l*(*t* ′ *b*)

(10)

*K*(*s*, *t*)= *K fb*(*s*, *t*) + *Kb*(*s*, *t*) + *Kba*(*s*, *t*)

*i*≥1, *j*≥1,*i*+ *j f b*max

*i*≥1, *j*≥1,*i*+ *jba*max

*Kb*,*<sup>i</sup>* (*s*, *t*)

1≤*i*≤*b*max

The release of ACE 2003 and 2004 versions contributed to full-scale study on relation extrac‐ tion. In particular, the collection is characterized by even richer information for tagged enti‐ ties. For example, ACE 2003 provides various entity features, e.g., entity headwords, entity type, and entity subtype for a specific named entity, and the features have been used as an important clue for determining the relation between two entities in a specific sentence. In this context, Zhang, Zhang, Su, et al., (2006) have built a composite kernel for which the con‐ volution parse tree kernel proposed by Collins & Duffy, (2001) is combined with the entity feature kernel. Equation 11 gives the entity kernel definition.

$$\begin{aligned} K\_L \ (R\_{1\prime}, R\_2) &= \sum\_{i=1,2} K\_E (R\_1, E\_{i\prime}, R\_2, E\_i) \\ K\_E \ (E\_{1\prime}, E\_2) &= \sum\_i \mathcal{C}(E\_1, f\_{i\prime}, E\_2, f\_i) \\ \mathcal{C}(f\_{1\prime}, f\_2) &= \begin{vmatrix} 1\_{\prime} & \text{if} & f\_1 = f\_2 \\ 0\_{\prime} & \text{otherwise} \end{vmatrix} \end{aligned} \tag{12}$$

In Equation 12, *R <sup>i</sup>* represents the relation instance; and *R <sup>i</sup>* .*E <sup>j</sup>* are the j-th entity of Ri . Ei .fj represents the j-th entity feature of entity Ei ; and C(,) is a homogeneity function for the two entities. It is possible to calculate the entity kernel KL by summation on the basis of fea‐ ture redundancy decision kernel KE for a pair of entities.

Second, the convolution parse tree kernel expresses one parse tree as an occurrence frequen‐ cy vector of a subtree as follows so as to measure the similarity between the two parse trees.

$$\phi(T) = \{\text{\textquotedbl{}subtree\textquotedbl{}}\_{1}(T), \dots, \text{\textquotedbl{}subtree\textquotedbl{}}\_{i}(T), \dots, \text{\textquotedbl{}}\text{\textquotedbl{}}\_{n}(T)\}\tag{13}$$

In Equation 13, #subtreei (T) represents the occurrence frequency of the i-th subtree. All parse trees are expressed with the vector as described above, and the kernel function is calculated as the inner product of two vectors as follows.

$$K(T\_1 \; \; \; T\_2) = \{ \phi(T\_1), \phi(T\_2) \} \tag{14}$$

tree as with Equation 13. The following Equation 15, proposed by Collins & Duffy (2001), is

(*T*2)

(*n*1)) <sup>⋅</sup> (∑*<sup>n</sup>*2∈*N*<sup>2</sup> *Isubtreei*

(*n*2))

represents all node collections of

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005

(15)

23

)=*n*

(*n*2)

(*T*1) ⋅ # *subtreei*

*N*1, *N*2→the set of nodes in trees *T*<sup>1</sup> and *T*2.

0 otherwise

represents a specific parse tree, and Ni

1 if R*ΟΟΤ*(*subtreei*

(*n*1) ⋅ *Isubtreei*

the parse tree. Ist(n) is the function for checking whether the node n is the root node of the specific subtree st. The most time-consuming calculation in Equation 15 falls on calculating ∅(*n*1, *n*2)*.*To enhance this, Collins & Duffy (2001) came up with the following algorithm.

The function ∅(*n*1, *n*2) defined in (3) of Figure 18 compares the child nodes of the input node to calculate the frequency of subtrees contained in both parse trees and the product thereof, until the end conditions defined in (1) and (2) are satisfied. In this case, the decay factor , which is a variable for limiting large subtrees so as to address the issue that larger subtrees among the subtrees of the parse tree comprise another subtrees therein, can be ap‐

plied repeatedly for calculating the inner product of the subtree vector.

used to calculate efficiently the similarity of two parse trees

<sup>=</sup>∑*<sup>i</sup>* # *subtreei*

(*n*)={

*<sup>Δ</sup>*(*n*1, *<sup>n</sup>*2)=∑*<sup>i</sup> Isubtreei*

*Isubtreei*

**Figure 18.** Algorithm for calculating ∅(*n*1, *n*2)

In Equation 15, Ti

*K*(*T*1, *T*2)= *ϕ*(*T*1), *ϕ*(*T*2)

<sup>=</sup>∑*<sup>i</sup>* (∑*<sup>n</sup>*1∈*N*<sup>1</sup> *Isubtreei*

<sup>=</sup>∑*<sup>n</sup>*1∈*N*<sup>1</sup> ∑*<sup>n</sup>*2∈*N*<sup>2</sup> *<sup>Δ</sup>*(*n*1, *<sup>n</sup>*2)

**Figure 17.** Parsing tree and its subtree collection.

Figure 17 shows all subtrees of a specific parse tree. There are nine subtrees in the figure al‐ together, and each subtree is an axis of the vector, which expresses the left side parse tree. If the number of all unified parse trees that can be extracted for *N* parse trees is *M*, each of extracted subtrees can be expressed as an *M*-dimension vector.

As shown in Figure 17, there are two constraints for a subtree of a specific parse tree. First, the number of nodes of the subtree must be at least 2, and the subtree should comply with production rules used by syntactic parser to generate parse trees of sentences (Collins & Duffy, 2001). For example, [VP VBD "*got*"] cannot become a subtree.

It is necessary to investigate all subtrees in the tree T, and calculate their frequency so as to build the vector for each of parse trees. This process is quite inefficient, however. Since we need only to compute the similarity of two parse trees in the kernel-based method, we can come up with indirect kernel functions without building the subtree vector for each parse tree as with Equation 13. The following Equation 15, proposed by Collins & Duffy (2001), is used to calculate efficiently the similarity of two parse trees

$$\begin{aligned} K(T\_1, T\_2) &= \{\phi(T\_1), \phi(T\_2)\} \\ &= \sum\_i \# subtree\_i(T\_1) \cdot \# subtree\_i(T\_2) \\ &= \sum\_i \left(\sum\_{n\_i \in N\_1} I\_{subtree\_i}(n\_1)\right) \cdot \left(\sum\_{n\_2 \in N\_2} I\_{subtree\_i}(n\_2)\right) \\ &= \sum\_{n\_i \in N\_1} \sum\_{n\_2 \in N\_2} \Lambda(n\_1, n\_2) \\ &N\_1, N\_2 \rightharpoonup \text{the set of nodes in trees } T\_1 \text{ and } T\_2. \end{aligned} \tag{15}$$
 
$$I\_{subtree\_i}(n) = \begin{cases} 1 & \text{if } \text{ROOT}(subtree\_i) = n \\ 0 & \text{otherwise} \end{cases}$$

$$\Delta(n\_1, n\_2) = \sum\_i I\_{subtree\_i}(n\_1) \cdot I\_{subtree\_i}(n\_2)$$

In Equation 15, Ti represents a specific parse tree, and Ni represents all node collections of the parse tree. Ist(n) is the function for checking whether the node n is the root node of the specific subtree st. The most time-consuming calculation in Equation 15 falls on calculating ∅(*n*1, *n*2)*.*To enhance this, Collins & Duffy (2001) came up with the following algorithm.

$$\begin{aligned} \text{1.} \quad &\text{If CFP Production rules of } n\_1 \text{ and } n\_2 \text{ are different each other,} \\ &\Delta(n\_1, n\_2) = 0 \\ \text{2.} \quad &\text{If both } n\_1 \text{ and } n\_2 \text{ are pre-terminals (POS-tag)}, \\ &\Delta(n\_1, n\_2) = 1 \times \mathcal{X} \\ \text{3.} \quad &\text{Otherwise,} \\ &\Delta(n\_1, n\_2) = \mathcal{A} \prod\_{j=1}^{nc(n\_1)} \Big( 1 + \Delta\Big( ch(n\_1, j), ch(n\_2, j) \big) \Big) \\ &\qquad n c(n\_1) \to \text{the child number of } n\_1, \\ &\qquad ch(n, j) \to j^{\text{st}} \text{ child of node } n \\ &\qquad \mathcal{A} \to \text{decay factor} \end{aligned}$$

**Figure 18.** Algorithm for calculating ∅(*n*1, *n*2)

In Equation 12, *R <sup>i</sup>*

In Equation 13, #subtreei

represents the j-th entity feature of entity Ei

22 Theory and Applications for Advanced Text Mining Text Mining

as the inner product of two vectors as follows.

**Figure 17.** Parsing tree and its subtree collection.

extracted subtrees can be expressed as an *M*-dimension vector.

Duffy, 2001). For example, [VP VBD "*got*"] cannot become a subtree.

ture redundancy decision kernel KE for a pair of entities.

*ϕ*(*T* )=( # *subtree*1(*T* ), ..., # *subtreei*

represents the relation instance; and *R <sup>i</sup>*

two entities. It is possible to calculate the entity kernel KL by summation on the basis of fea‐

Second, the convolution parse tree kernel expresses one parse tree as an occurrence frequen‐ cy vector of a subtree as follows so as to measure the similarity between the two parse trees.

trees are expressed with the vector as described above, and the kernel function is calculated

Figure 17 shows all subtrees of a specific parse tree. There are nine subtrees in the figure al‐ together, and each subtree is an axis of the vector, which expresses the left side parse tree. If the number of all unified parse trees that can be extracted for *N* parse trees is *M*, each of

As shown in Figure 17, there are two constraints for a subtree of a specific parse tree. First, the number of nodes of the subtree must be at least 2, and the subtree should comply with production rules used by syntactic parser to generate parse trees of sentences (Collins &

It is necessary to investigate all subtrees in the tree T, and calculate their frequency so as to build the vector for each of parse trees. This process is quite inefficient, however. Since we need only to compute the similarity of two parse trees in the kernel-based method, we can come up with indirect kernel functions without building the subtree vector for each parse

.*E <sup>j</sup>* are the j-th entity of Ri

; and C(,) is a homogeneity function for the

(*T* ), ..., # *subtreen*(*T* )) (13)

(T) represents the occurrence frequency of the i-th subtree. All parse

*K*(*T*1, *T*2)= *ϕ*(*T*1), *ϕ*(*T*2) (14)

. Ei .fj

> The function ∅(*n*1, *n*2) defined in (3) of Figure 18 compares the child nodes of the input node to calculate the frequency of subtrees contained in both parse trees and the product thereof, until the end conditions defined in (1) and (2) are satisfied. In this case, the decay factor , which is a variable for limiting large subtrees so as to address the issue that larger subtrees among the subtrees of the parse tree comprise another subtrees therein, can be ap‐ plied repeatedly for calculating the inner product of the subtree vector.

Two kernels built as described above, that is, the entity kernel and the convolution parse tree kernel, are combined in the following two manners.

$$K\_{1}(R\_{1}, R\_{2}) = \alpha \cdot \frac{K\_{L}\left(R\_{1'}, R\_{2}\right)}{\sqrt{K\_{L}\left(R\_{1'}, R\_{1}\right) \cdot K\_{L}\left(R\_{2'}, R\_{2}\right)}} + (1 - \alpha) \frac{K(T\_{1'}, T\_{2})}{\sqrt{K(T\_{1'}, T\_{1}) \cdot K(T\_{2'}, T\_{2})}} \tag{16-1}$$

The test shows that the composite kernel features better performance than a single syntactic kernel. The combination of quadratic polynomial type shows performance between the two kernels. This means that flat feature (entity type feature) and structural feature (syntactic feature) can be organically combined as a single kernel function. In consideration that the Path-enclosed Tree method shows the best performance among all relation instance pruning methods, it is possible to achieve the effect only with core related syntactic information, so

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 25

Choi, et al., (2009) have constructed and tested a composite kernel where various lexical and contextual features are added by expanding the existing composite kernel. In addition to the syntactic feature, called flat feature, they extended the combination range of lexical feature from the entity feature to the contextual feature in order to achieve high performance. Mintz, Bills, Snow, and Jurafsky (2009) proposed a new method of using Freebase, which is a semantic database for thousands relation collections, to gather exemplary sentences for a specific rela‐ tion and making relation extraction on the basis of the obtained exemplary sentences. In the test, the collection of 10,000 instances corresponding to 102 types of relations has shown the ac‐ curacy of 67.6%. In addition, T.-V. T. Nguyen, Moschitti, and Riccardi (2009) have designed a new kernel, which extends the existing convolution parse tree kernel, and Reichartz, et al., (2009) proposed a method, which extends the dependency tree kernel. As described above,

most studies published so far are based on the kernels described in Sections 4.1to 4.5.

Section 5.4 sums up the advantages and disadvantages of each of the method.

In the previous section, five types of kernel-based relation extraction have been analyzed in detail. Here, we discuss the results of comparison and analysis of these methods. Section 5.1 will briefly describe the criteria for comparison and analysis of the methods. Section 5.2 compares characteristics of the methods. Section 5.3 covers performance results in detail.

Generally, a large variety of criteria can be used for comparing kernel-based relation extrac‐ tion methods. The following 6 criteria, however, have been selected and used in this study. First, (1) linguistic analysis and pre-processing method means the pre-processing analysis methods and types for individual instances which are composed of learning collections and evaluation collections, e.g., the type of parsing method or the parsing system used. (2) The level of linguistic analysis, which is the criterion related to the method (1), is a reference to what level the linguistic analysis will be carried out in pre-processing and analyzing instan‐ ces. Exemplary levels include part-of-speech tagging, base phrase analysis, dependency parsing or full parsing. In addition, (3) the method of selecting a feature space is a reference for deciding if the substantial input of the kernel function is an entire sentence or a part

as to estimate the relation of two entities in a specific sentence.

**4.6. Other recent studies**

**5. Comparison and analysis**

**5.1. Criteria for comparison and analysis**

$$K\_{2}(\text{R}\_{1\prime},\text{R}\_{2}) = \alpha \cdot \left(\frac{K\_{L}\text{ (R}\_{1\prime}\text{ R}\_{2}\text{)}}{\sqrt{K\_{L}\text{ (R}\_{1\prime}\text{ R}\_{1}\text{)} \cdot K\_{L}\text{ (R}\_{2\prime}\text{ R}\_{2}\text{)}} + 1} + 1\right)^{2} + (1 - \alpha) \frac{K(T\_{1\prime},T\_{2})}{\sqrt{K(T\_{1\prime},T\_{1}) \cdot K(T\_{2\prime},T\_{2})}}\tag{16.2}$$

In the above equations 16-1 and 16-2, KL represents the entity kernel and K stands for the convolution parse tree kernel. Equation 16-1 shows the composite kernel being a linear com‐ bination of the two kernels, and Equation 16-2 defines the composite kernel constructed us‐ ing quadratic polynomial combination.

Furthermore, Zhang, Zhang, Su, et al., (2006) proposed the method for pruning relation instance by leaving a part of the parse tree and removing the rest, so as to improve simi‐ larity measurement performance of the kernel function, and to exclude unnecessary con‐ textual information in learning.


**Table 3.** Relation instance pruning (Zhang, Zhang, Su, et al., 2006; Zhang et al., 2008).

For the evaluation, Zhang, Zhang, Su, et al., (2006) used both ACE 2003 and ACE 2004. They parsed all available relation instances with Charniak's Parser (Charniak, 2001), and on the basis of the parsing result carried out instance conversion using the method described in Table 3. To this end, Moschitti (2004) has developed a kernel tool, while SVMLight (Joa‐ chims, 1998) was used for learning and classification.

The test shows that the composite kernel features better performance than a single syntactic kernel. The combination of quadratic polynomial type shows performance between the two kernels. This means that flat feature (entity type feature) and structural feature (syntactic feature) can be organically combined as a single kernel function. In consideration that the Path-enclosed Tree method shows the best performance among all relation instance pruning methods, it is possible to achieve the effect only with core related syntactic information, so as to estimate the relation of two entities in a specific sentence.

#### **4.6. Other recent studies**

Two kernels built as described above, that is, the entity kernel and the convolution parse

+ (1−*α*)

+ (1−*α*)

+ 1) 2

In the above equations 16-1 and 16-2, KL represents the entity kernel and K stands for the convolution parse tree kernel. Equation 16-1 shows the composite kernel being a linear com‐ bination of the two kernels, and Equation 16-2 defines the composite kernel constructed us‐

Furthermore, Zhang, Zhang, Su, et al., (2006) proposed the method for pruning relation instance by leaving a part of the parse tree and removing the rest, so as to improve simi‐ larity measurement performance of the kernel function, and to exclude unnecessary con‐

Minimum Complete Tree (MCT) Minimum complete sub-tree encompassing two entities Path-enclosed Tree (PT) Sub-tree belong to the shortest path in between two entities

Chunking Tree(CT) Sub-tree generated by discarding all the internal nodes excpet

Context-sensitive PT(CPT) Sub-tree generated by adding two additional terminal nodes

Context-sensitive CT(CCT) Sub-tree generated by adding two additional terminal nodes

Flattened PT(FPT) Sub-tree generated by discarding all the nodes having only one paraent and child node from PT

Flattened CPT(FCPT) Sub-tree generated by discarding all the nodes having only one paraent and child node from CT

For the evaluation, Zhang, Zhang, Su, et al., (2006) used both ACE 2003 and ACE 2004. They parsed all available relation instances with Charniak's Parser (Charniak, 2001), and on the basis of the parsing result carried out instance conversion using the method described in Table 3. To this end, Moschitti (2004) has developed a kernel tool, while SVMLight (Joa‐

outside PT

outside CT

**Table 3.** Relation instance pruning (Zhang, Zhang, Su, et al., 2006; Zhang et al., 2008).

chims, 1998) was used for learning and classification.

nodes for base phrases and POS from PT

*K*(*T*1, *T*2)

*K*(*T*1, *T*2)

*<sup>K</sup>*(*T*1, *<sup>T</sup>*1) <sup>⋅</sup> *<sup>K</sup>*(*T*2, *<sup>T</sup>*2) (16-1)

*<sup>K</sup>*(*T*1, *<sup>T</sup>*1) <sup>⋅</sup> *<sup>K</sup>*(*T*2, *<sup>T</sup>*2) (16-2)

tree kernel, are combined in the following two manners.

*KL* (*R*1, *R*1) ⋅ *KL* (*R*2, *R*2)

*KL* (*R*1, *R*1) ⋅ *KL* (*R*2, *R*2)

**Tree Pruning Methods Details**

*<sup>K</sup>*1(*R*1, *<sup>R</sup>*2)=*<sup>α</sup>* <sup>⋅</sup> *KL* (*R*1, *<sup>R</sup>*2)

*<sup>K</sup>*2(*R*1, *<sup>R</sup>*2)=*<sup>α</sup>* <sup>⋅</sup> ( *KL* (*R*1, *<sup>R</sup>*2)

24 Theory and Applications for Advanced Text Mining Text Mining

ing quadratic polynomial combination.

textual information in learning.

Choi, et al., (2009) have constructed and tested a composite kernel where various lexical and contextual features are added by expanding the existing composite kernel. In addition to the syntactic feature, called flat feature, they extended the combination range of lexical feature from the entity feature to the contextual feature in order to achieve high performance. Mintz, Bills, Snow, and Jurafsky (2009) proposed a new method of using Freebase, which is a semantic database for thousands relation collections, to gather exemplary sentences for a specific rela‐ tion and making relation extraction on the basis of the obtained exemplary sentences. In the test, the collection of 10,000 instances corresponding to 102 types of relations has shown the ac‐ curacy of 67.6%. In addition, T.-V. T. Nguyen, Moschitti, and Riccardi (2009) have designed a new kernel, which extends the existing convolution parse tree kernel, and Reichartz, et al., (2009) proposed a method, which extends the dependency tree kernel. As described above, most studies published so far are based on the kernels described in Sections 4.1to 4.5.

## **5. Comparison and analysis**

In the previous section, five types of kernel-based relation extraction have been analyzed in detail. Here, we discuss the results of comparison and analysis of these methods. Section 5.1 will briefly describe the criteria for comparison and analysis of the methods. Section 5.2 compares characteristics of the methods. Section 5.3 covers performance results in detail. Section 5.4 sums up the advantages and disadvantages of each of the method.

#### **5.1. Criteria for comparison and analysis**

Generally, a large variety of criteria can be used for comparing kernel-based relation extrac‐ tion methods. The following 6 criteria, however, have been selected and used in this study. First, (1) linguistic analysis and pre-processing method means the pre-processing analysis methods and types for individual instances which are composed of learning collections and evaluation collections, e.g., the type of parsing method or the parsing system used. (2) The level of linguistic analysis, which is the criterion related to the method (1), is a reference to what level the linguistic analysis will be carried out in pre-processing and analyzing instan‐ ces. Exemplary levels include part-of-speech tagging, base phrase analysis, dependency parsing or full parsing. In addition, (3) the method of selecting a feature space is a reference for deciding if the substantial input of the kernel function is an entire sentence or a part thereof. Also, (4) the applied lexical and supplementary feature information means various supplementary feature information used for addressing the issue of sparse data. (5) The re‐ lation extraction method is a practical relation extraction method based on learning models already constituted. Exemplary relation extraction methods include multi-class classification at a time and a single mode method of separating instances with relations from those with‐ out relations by means of processing multiple classifications at a time or binary classifica‐ tion, then to carry out relation classification only for the instances with relations. (6) The manual work requirement is a reference to decide if the entire process is fully automatically carried out or manual work is required only at some step. The aforementioned 6 criteria were used to analyze the kernel-based relation extraction methods and the result of the anal‐ ysis is shown in the following Table 6.

**5.2. Comparison of characteristics**

analysis criteria described in 5.1.

**Dependency Tree Kernel (DTK)**

**Shortest Path Dependency Kernel (SPDK)**

> **Subsequence Kernel (SK)**

**Composite Kernel (CK)**

Table 5 summarizes each of kernel functions with respect to the concept, before analyzing the kernel-based relation extraction method, in conformity to 6 types of comparison and

**Kernel types Description of concept**

same level.

values.

similarity.

(sibling node).

created as such.

gle kernel function, but is separated from the composite kernel and applied.

**Table 5.** Summary of kernel-based relation extraction methods.

**Tree Kernel (TK)** Compares each node which consists of two trees to be compared.

Based on BFS, applies subsequence kernel to the child nodes located at the

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 27

The decay factor is adjusted and applied to the similarity depending on the

Compares each element of the two paths to cumulatively calculate the common values for the elements in order to compute the similarity by multiplying all

For the example of measuring similarity of two words, extracts only the subsequences which exist in both words, and expresses the two words with a vector (Φ(x)) by using the subsequences, among all of the subsequences which

Afterwards, obtains the inner product of the two vectors to calculate the

Generalizes and uses SSK to compare planar information of the tree kernel

Finds all subtrees in the typical CFG-type syntactic tree, and establishes them as

In this case, the following constraints hold true: (1) the number of nodes must be at least 2, and (2) the subtree should comply with the CFG creation rule. Since there can be multiple subtrees, each coordinate value can be at least 1, and similarity is calculated by obtaining the inner product of the two vectors

a coordinate axis to represent the parse tree as a vector (Φ(x)).

Table 6 shows comparison and analysis of the kernel-based relation extraction methods. As it is closer to the right side, the method is more recent one. The characteristic found in all of the above methods is that various feature information in addition to the syntactic informa‐ tion is used as well. Such heterogeneous information was first combined and used in a sin‐

With respect to selecting the feature space, most of sentences or a part of the parse tree are applied other than the tree kernel. Manual work was initially required for extracting relation instances and building the parse tree. The recently developed methods, however, offer full

length of subsequences at the same level or on the level itself.

Similarity is 0 if the length of two paths is different.

belong to two words and of which the length is n.

In addition, for the purpose of describing the characteristics of the kernel function, the de‐ scription in section 4 will be summarized, and each structure of factors to be input of the kernel function will be described. Modification of the kernel function for optimized speed will be included in the analysis criteria and described. For performance comparison of the individual methods, types and scale of the tested collections and tested relations will be ana‐ lyzed and described in detail. The following Table 4 describes the ACE collection generally used among the tested collections for relation extraction developed so far.


#### **Table 4.** Description of ACE Collection.

As shown in Table 4, the ACE collection is generally used and can be divided into 3 types. ACE-2002, however, is not widely used because of consistency and quality problems. There are 5 to 7 types of entities, e.g., Person, Organization, Facility, Location, Geo-Political Entity, etc. For relations, all collections are structured to be at two levels, consisting of 23 to 24 types of particular relations corresponding to 5 types of Role, Part, Located, Near, and Social. As the method of constructing those collections is advanced and the quality thereof is improved, the tendency is that the scale of training instances is reduced. Although subsequent collections have already been constituted, they are not publicized according to the principle of non-disclo‐ sure, which is the policy of ACE Workshop. This should be improved for active studies.

## **5.2. Comparison of characteristics**

thereof. Also, (4) the applied lexical and supplementary feature information means various supplementary feature information used for addressing the issue of sparse data. (5) The re‐ lation extraction method is a practical relation extraction method based on learning models already constituted. Exemplary relation extraction methods include multi-class classification at a time and a single mode method of separating instances with relations from those with‐ out relations by means of processing multiple classifications at a time or binary classifica‐ tion, then to carry out relation classification only for the instances with relations. (6) The manual work requirement is a reference to decide if the entire process is fully automatically carried out or manual work is required only at some step. The aforementioned 6 criteria were used to analyze the kernel-based relation extraction methods and the result of the anal‐

In addition, for the purpose of describing the characteristics of the kernel function, the de‐ scription in section 4 will be summarized, and each structure of factors to be input of the kernel function will be described. Modification of the kernel function for optimized speed will be included in the analysis criteria and described. For performance comparison of the individual methods, types and scale of the tested collections and tested relations will be ana‐ lyzed and described in detail. The following Table 4 describes the ACE collection generally

> **Items ACE-2002 ACE-2003 ACE-2004** # training documents 422 674 451 # training relation instances 6,156 9,683 5,702 # test documents 97 97 N/A # test relation instances 1,490 1,386 N/A # entity types 5 5 7 # major relation types 5 5 7 # relation sub-types 24 24 23

As shown in Table 4, the ACE collection is generally used and can be divided into 3 types. ACE-2002, however, is not widely used because of consistency and quality problems. There are 5 to 7 types of entities, e.g., Person, Organization, Facility, Location, Geo-Political Entity, etc. For relations, all collections are structured to be at two levels, consisting of 23 to 24 types of particular relations corresponding to 5 types of Role, Part, Located, Near, and Social. As the method of constructing those collections is advanced and the quality thereof is improved, the tendency is that the scale of training instances is reduced. Although subsequent collections have already been constituted, they are not publicized according to the principle of non-disclo‐

sure, which is the policy of ACE Workshop. This should be improved for active studies.

used among the tested collections for relation extraction developed so far.

ysis is shown in the following Table 6.

26 Theory and Applications for Advanced Text Mining Text Mining

**Table 4.** Description of ACE Collection.

Table 5 summarizes each of kernel functions with respect to the concept, before analyzing the kernel-based relation extraction method, in conformity to 6 types of comparison and analysis criteria described in 5.1.


**Table 5.** Summary of kernel-based relation extraction methods.

Table 6 shows comparison and analysis of the kernel-based relation extraction methods. As it is closer to the right side, the method is more recent one. The characteristic found in all of the above methods is that various feature information in addition to the syntactic informa‐ tion is used as well. Such heterogeneous information was first combined and used in a sin‐ gle kernel function, but is separated from the composite kernel and applied.

With respect to selecting the feature space, most of sentences or a part of the parse tree are applied other than the tree kernel. Manual work was initially required for extracting relation instances and building the parse tree. The recently developed methods, however, offer full automation. In the relation extraction methods, multi-class classification is used, in which the case with no relation is included as one relation.

**Kernels Parameter Structure Time Complexity**

SSTK : O(|Ni,1|\*|Ni,2|


n : subsequence length |N1| : #(1st input's nodes) |N2| : #(2nd input's nodes)


It should be noted that the complexity shown in Table 7 is just kernel calculation complexi‐ ty. The overall complexity of relation extraction can be much higher when processing time

(Zhou et al., 2005) <sup>2005</sup> SVM ACE-2003 (5-major relations)

(Zhao & Grishman, 2005) 2005 CK ACE-2004 (7-major relations) **70.4** (Bunescu & Mooney, 2006) 2006 SK ACE-2002 (5-major relations) **47.7**

(Jiang & Zhai, 2007) 2007 ME/SVM ACE-2004 (7-major relations) **72.9**

**Articles Year Methods Test Collection F1** (Zelenco et al., 2003) 2003 TK 200 News Articles (2-relations) **85.0** (Culotta & Sorensen, 2004) 2004 DTK ACE-2002 (5-major relations) **45.8** (Kambhatla, 2004) 2004 ME ACE-2003 (24-relation sub-types) **52.8** (Bunescu & Mooney, 2005) 2005 SPDK ACE-2002 (5-major relations) **52.5**

ACE-2003 (24-relation sub-types)

ACE-2003 (5-major relations) ACE-2003 (24-relation sub-types) ACE-2004 (7-major relations) ACE-2004 (23-relation sub-types)

ACE-2003 (5-major relations) ACE-2003 (24-relation sub-types) ACE-2004 (7-major relations) ACE-2004 (23-relation sub-types) **68.0 55.5**

**70.9 57.2 72.1 63.6**

**74.1 59.6 75.8 66.0**

3) |N1| : #(1st input's nodes in level i) |N2| : #(2nd input's nodes in level i)

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 29

**TK** Shallow parse trees CSTK : O(|Ni,1|\*|Ni,2|)

**DTK** Dependency trees

**SPDK** Dependency paths O(|N1|)

**SK** Chunking results O(n\*|N1|\*|N2|)

**CK** Full parse trees O(|N1| \*|N2|)

**Table 7.** Parameter structure and calculation complexity of each kernel.

for parsing and learning is also considered.

(Zhang, Zhang, Su, et al., 2006) 2006 CK

(Zhou et al., 2007) 2007 CK

**Table 8.** Comparison of performance of each model of kernel-based relation extraction.


**Table 6.** Comparison of characteristics of kernel-based relation extraction.

#### **5.3. Comparison of performance**

Table 7 shows the parameter type of each kernel function and computation complexity in calculating the similarity of two inputs. Most of them show complexity of O(N2 ), but SPDK exceptionally demonstrates the complexity of the order of O(N) and can be considered as the most efficient kernel.


**Table 7.** Parameter structure and calculation complexity of each kernel.

automation. In the relation extraction methods, multi-class classification is used, in which

Statistical Parserx (MXPOST)

Selects small sub-tree including two entities from the entire dependency tree

Word POS

Chunking Info. Entity Type Entity Level WordNet Hypernym Relation Parameter

Cascade Phase (Relation Detection and Classification)

Necessary (Dependency Tree)

calculating the similarity of two inputs. Most of them show complexity of O(N2

**TK DTK SPDK SK CK**

Chunker (OpenNLP)

Before Entities After Entities Between Entities

Word POS

Chunking Info. Entity Type Chunk Headword

Single Phase (Multiclass SVM)

N/A N/A

Statistical Parser (Charniak's Parser)

Entity Headword Entity Type Mention Type LDC mention Type Chunking Info.

Single Phase (Multiclass SVM)

), but SPDK

MCT PT CPT FPT FCPT

Statistical Parser (Collins' Parser)

Parsing Parsing Parsing Chunking Parsing

Selects the shortest dependency path which starts with one entity and ends with the other one

Word POS Chunking Info. Entity Type

Single Phase Cascade Phase

Necessary (Dependency Path)

Table 7 shows the parameter type of each kernel function and computation complexity in

exceptionally demonstrates the complexity of the order of O(N) and can be considered as

the case with no relation is included as one relation.

Shallow Parserx (REES)

PLO Tagging

Extracts features from parse trees manually

Entity Headword Entity Role Entity Text

Single Phase (Multiclass SVM)

Necessary (Instance Extraction)

**Table 6.** Comparison of characteristics of kernel-based relation extraction.

**Language Processor**

28 Theory and Applications for Advanced Text Mining Text Mining

**Level of Language Processing**

**Feature Selection Methods**

**Features used in Kernel Computation**

**Relation Extraction Methods**

**Manual Process**

the most efficient kernel.

**5.3. Comparison of performance**

It should be noted that the complexity shown in Table 7 is just kernel calculation complexi‐ ty. The overall complexity of relation extraction can be much higher when processing time for parsing and learning is also considered.


**Table 8.** Comparison of performance of each model of kernel-based relation extraction.


**Table 9.** Comparison of performance of each kernel-based relation extraction method. ACE-2002, which is the first version of ACE collection, had the issue with data consistency. In the subsequent versions this problem has been continuously addressed, and finally resolved in version ACE-2003. Starting from 52.8% achieved by Kambhatla (2004) on the basis of the per‐ formance of ACE-2003 with respect to 24 relation collections, the performance was improved up to 59.6% recently announced by Zhou, et al., (2007). Similarly, the maximum relation extrac‐ tion performance for 23 particular relations on the ACE-2004 collection is currently 66%.

Although each model has different performance in differently sized relation collections, the composite kernel generally shows better results. In particular, Zhou, et al., (2007) have dem‐ onstrated high performance for all collections or relation collections based on extended models initially proposed by Zhang, Zhang, Su, et al., (2006). As described above, it is con‐ sidered that various features for relation extraction, that is, the syntactic structure and the

Although the described research results do not represent all studies on relation extraction, there are many parts not evaluated yet although a lot of study results have been derived so far as seen in Table 9. It is necessary to carry out evaluation on the basis of various collec‐ tions for comprehensive performance evaluation of a specific relation extraction model, but this is a challenge that more studies should be done. In particular, the key issue is to check whether the performance of relation extraction is achieved as high as described in the above without the characteristics of ACE collections, in that they provide supplementary informa‐

In Section 5.4, advantages and disadvantages of five kernel-based relation extraction meth‐

A lot of effort is required for feature

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 31

Performance can be improved only through feature combination.

Very limited use of structural information

Slow similarity calculation speed in spite

dependency tree are emphasized only by means of decay factors (low emphasis

extraction and selection.

(syntactic relations)

of optimized speed

capability)

Predicates and key words in a

**Method Advantage Disadvantage**

vocabulary, can be efficiently combined in a composite kernel for better performance.

tion (entity type information) of a considerable scale for relation extraction.

Applies the typical automatic sentence classification without modification. Performance can be further improved by applying various feature information.

Calculates particular similarity between

Uses both structural (parenthood) and planar information (brotherhood). Optimization for speed improvement.

Addressed the issue of insufficient use of structural information, which is a

Relatively high speed

shallow parse trees.

disadvantage of TK.

ods are discussed and outlined in Table 10.

Feature-based SVM/ME

> Tree Kernel (TK)

Dependency TK (DK)

**5.4. Comparison and analysis of advantages and disadvantages of each method**

ACE-2002, which is the first version of ACE collection, had the issue with data consistency. In the subsequent versions this problem has been continuously addressed, and finally resolved in version ACE-2003. Starting from 52.8% achieved by Kambhatla (2004) on the basis of the per‐ formance of ACE-2003 with respect to 24 relation collections, the performance was improved up to 59.6% recently announced by Zhou, et al., (2007). Similarly, the maximum relation extrac‐ tion performance for 23 particular relations on the ACE-2004 collection is currently 66%.

Although each model has different performance in differently sized relation collections, the composite kernel generally shows better results. In particular, Zhou, et al., (2007) have dem‐ onstrated high performance for all collections or relation collections based on extended models initially proposed by Zhang, Zhang, Su, et al., (2006). As described above, it is con‐ sidered that various features for relation extraction, that is, the syntactic structure and the vocabulary, can be efficiently combined in a composite kernel for better performance.

Although the described research results do not represent all studies on relation extraction, there are many parts not evaluated yet although a lot of study results have been derived so far as seen in Table 9. It is necessary to carry out evaluation on the basis of various collec‐ tions for comprehensive performance evaluation of a specific relation extraction model, but this is a challenge that more studies should be done. In particular, the key issue is to check whether the performance of relation extraction is achieved as high as described in the above without the characteristics of ACE collections, in that they provide supplementary informa‐ tion (entity type information) of a considerable scale for relation extraction.

## **5.4. Comparison and analysis of advantages and disadvantages of each method**

In Section 5.4, advantages and disadvantages of five kernel-based relation extraction meth‐ ods are discussed and outlined in Table 10.


**Articles and Approaches**

**Relation Sets** **Precision/Recall/F-measure**

(Zelenco et al., 2003)

(Culotta & Sorensen, 2004)

(Kambhatla, 2004)

(Bunescu & Mooney, 2005)

(Zhou et al., 2005)

(Zhao & Grishman, 2005)

(Bunescu & Mooney, 2006)

(Zhang, Zhang, Su, et al., 2006)

(Zhou et al., 2007)

(Jiang & Zhai, 2007)

CK ME/

SVM

**Table 9.** Comparison of performance of each kernel-based relation extraction method.

CK

SK

73.9 35.2

47.7

77.3 65.6 70.9 64.9 51.2 80.8 68.4 74.1 65.2 54.9

57.2 59.6

82.2 70.2 75.8 70.3 62.2 66.0

74.6 71.3 72.9

76.1 68.4 72.1 68.6 59.3 63.6

CK

SVM

SPDK

65.5 43.8 52.5 77.2 60.7 68.0 63.1 49.5

55.5

69.2 70.5 70.4

ME

DTK

67.1 35.0 45.8

TK

91.6

79.5 85.0

**P**

**R**

**F**

**P**

**R**

**F**

**P**

**R**

**F**

**P**

**R**

**F**

**P** 63.5 45.2

52.8

**R**

**F**

**P**

**R**

**F**

**P**

**R**

**F**

30 Theory and Applications for Advanced Text Mining Text Mining

**200 News**

**PLO-2**

**Main-5**

**Sub-24**

**Main-5**

**Sub-24**

**Main-7**

**Sub-23**

**ACE-2002**

**Test Collection**

**ACE-2003**

**ACE-2004**


information (WordNet Synset, thesaurus, ontology, part-of-speech tag, thematic role informa‐ tion, etc.), so as to ensure general comparison between subtrees in the composite kernel. The performance can be improved by replacing the current simple linear kernel with the subse‐ quence or another composite kernel and applying all sorts of supplementary feature informa‐

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 33

In this chapter, we analyzed kernel-based relation extraction method, which is considered the most efficient approach so far. Previous case studies did not fully covered specific opera‐ tion principles of the kernel-based relation extraction models, just cited contents of individu‐ al studies or made an analysis in a limited range. This chapter, however, closely examines operation principles and individual characteristics of five kernel-based relation extraction methods, starting from the original kernel-based relation extraction studies (Zelenco et al., 2003), to composite kernel (Choi et al., 2009; Zhang, Zhang, Su, et al., 2006), which is consid‐ ered the most advanced kernel-based method. The overall performance of each method was compared using ACE collections, and particular advantages and disadvantages of each method were summarized. This study will contribute to researchers' kernel study for rela‐ tion extraction of higher performance and to general kernel studies of high level for linguis‐

tion in order to address the shortcomings of the shortest path dependency kernel.

, Sung-Pil Choi, Seungwoo Lee and Sa-Kwang Song

[1] ACE. (2009). *Automatic Content Extraction*, Retrieved from, http://

[2] Agichtein, E., & Gravano, L. (2000). Snowball: extracting relations from large plaintext collections. *Proceedings of the fifth ACM conference on Digital libraries*, 85-94, New

[3] Bach, N., & Badaskar, S. (2007). A Survey on Relation Extraction. *Literature review for*

[4] Brin, S. (1999). Extracting Patterns and Relations from the World Wide Web. Lecture

**6. Conclusion**

tic processing and text mining.

\*Address all correspondence to: jhm@kisti.re.kr

www.itl.nist.gov/iad/mig//tests/ace/.

*Language and Statistics II*.

Korea Institute of Science and Technology Information, Korea

York, NY, USA, ACM, doi:10.1145/336597.336644.

Notes in Computer Science, , 1590, 172-183.

**Author details**

Hanmin Jung\*

**References**

**Table 10.** Analysis of advantages and disadvantages of kernel-based relation extraction.

As one can see in Table 10, each method has some advantages and disadvantages. A lot of ef‐ forts are required for the process of feature selection in general feature-based relation extrac‐ tion. The kernel-based method does not have this disadvantage, but has various limitations instead. For example, although the shortest dependency path kernel includes a variety of po‐ tentials, it showed low performance due to the overly simple structure of the kernel function. Since the composite kernel constitutes and compares subtree features only on the basis of partof-speech information and vocabulary information of each node, generality of similarity meas‐ urement is not high. A scheme to get over this is to use word classes or semantic information.

A scheme can be suggested for designing a new kernel in order to overcome the above short‐ comings. For example, a scheme may be used for interworking various supplementary feature information (WordNet Synset, thesaurus, ontology, part-of-speech tag, thematic role informa‐ tion, etc.), so as to ensure general comparison between subtrees in the composite kernel. The performance can be improved by replacing the current simple linear kernel with the subse‐ quence or another composite kernel and applying all sorts of supplementary feature informa‐ tion in order to address the shortcomings of the shortest path dependency kernel.

## **6. Conclusion**

**Method Advantage Disadvantage**

Slow similarity calculation speed

Too simple structure of the kernel

Too strong constraints because the similarity is 0 if the length of two input

Can include many unnecessary features

Comparison is carried out only on the basis of sentence component information of each node (phrase info.) (kernel calculation is required on the basis of composite feature information with reference to word class, semantic info,

function

etc.)

paths is different.

Uses key words which are the core of relation expression in the sentence, as feature information, on the basis of the structure that the predicate node is raised to a higher status, which is a structural characteristic of a dependency tree.

Creates a path between two named entities by means of dependency relation to reduce noise not related to relation

Shows very fast computation speed because the kernel input is not trees, but paths, different from previous inputs. Adds various types of supplementary feature information to improve the performance of similarity measurement, thanks to the simple structure of paths.

Very efficient because syntactic analysis

Adds various supplementary feature information to improve the performance.

Makes all of constituent subtrees of a parse tree have a feature, to perfectly use structure information in calculating

Optimized for improved speed

**Table 10.** Analysis of advantages and disadvantages of kernel-based relation extraction.

As one can see in Table 10, each method has some advantages and disadvantages. A lot of ef‐ forts are required for the process of feature selection in general feature-based relation extrac‐ tion. The kernel-based method does not have this disadvantage, but has various limitations instead. For example, although the shortest dependency path kernel includes a variety of po‐ tentials, it showed low performance due to the overly simple structure of the kernel function. Since the composite kernel constitutes and compares subtree features only on the basis of partof-speech information and vocabulary information of each node, generality of similarity meas‐ urement is not high. A scheme to get over this is to use word classes or semantic information.

A scheme can be suggested for designing a new kernel in order to overcome the above short‐ comings. For example, a scheme may be used for interworking various supplementary feature

information is not used.

similarity.

expression.

Shortest Path DTK (SPTK)

32 Theory and Applications for Advanced Text Mining Text Mining

Subsequence Kernel (SK)

Composite Kernel (CK)

In this chapter, we analyzed kernel-based relation extraction method, which is considered the most efficient approach so far. Previous case studies did not fully covered specific opera‐ tion principles of the kernel-based relation extraction models, just cited contents of individu‐ al studies or made an analysis in a limited range. This chapter, however, closely examines operation principles and individual characteristics of five kernel-based relation extraction methods, starting from the original kernel-based relation extraction studies (Zelenco et al., 2003), to composite kernel (Choi et al., 2009; Zhang, Zhang, Su, et al., 2006), which is consid‐ ered the most advanced kernel-based method. The overall performance of each method was compared using ACE collections, and particular advantages and disadvantages of each method were summarized. This study will contribute to researchers' kernel study for rela‐ tion extraction of higher performance and to general kernel studies of high level for linguis‐ tic processing and text mining.

## **Author details**

Hanmin Jung\* , Sung-Pil Choi, Seungwoo Lee and Sa-Kwang Song

\*Address all correspondence to: jhm@kisti.re.kr

Korea Institute of Science and Technology Information, Korea

## **References**


[5] Bunescu, R., & Mooney, R. J. (2005). A Shortest Path Dependency Kernel for Relation Extraction. *Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing*, 724-731.

[19] Hockenmaier, J., & Steedman, M. (2002). Generative Models for Statistical Parsing with Combinatory Categorial Grammar. Philadelphia, PA. *Proceedings of 40th Annual*

Survey on Kernel-Based Relation Extraction http://dx.doi.org/10.5772/51005 35

[20] Jiang, J., & Zhai, C. (2007). A Systematic Exploration of the Feature Space for Relation

[21] Joachims, T. (1998). Text Categorization with Support Vecor Machine: learning with

[22] Kambhatla, N. (2004). Combining lexical, syntactic and semantic features with Maxi‐

[23] Li, J., Zhang, Z., Li, X., & Chen, H. (2008). Kernel-based learning for biomedical rela‐ tion extraction. *Journal of the American Society for Information Science and Technology*,

[24] Li, W., Zhang, P., Wei, F., Hou, Y., & Lu, Q. (2008). A novel feature-based approach to Chinese entity relation extraction. *Proceedings of the 46th Annual Meeting of the Asso‐ ciation for Computational Linguistics on Human Language Technologies: Short Papers*,

[25] Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. *Journal of Machine Learning Research*, 2, 419-444. [26] MUC. (2001). *The NIST MUC Website*, Retrieved from, http://www.itl.nist.gov/iaui/

[27] Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. *Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Proc‐ essing of the AFNLP*, 2, 1003-1011, Stroudsburg, PA, USA, Association for Computa‐ tional Linguistics, Retrieved from, http://dl.acm.org/citation.cfm?id=1690219.1690287.

[28] Moncecchi, G., Minel, J. L., & Wonsever, D. (2010). A survey of kernel methods for relation extraction. *Workshop on NLP and Web-based technologies (IBERAMIA 2010)*. [29] Moschitti, A. (2004). A Study on Convolution Kernels for Shallow Semantic Parsing.

[30] Nguyen, D. P. T., Matsuo, Y., & Ishizuka, M. (2007). Exploiting syntactic and seman‐ tic information for relation extraction from wikipedia. *IJCAI Workshop on Text-Mining*

[31] Nguyen, T.-V. T., Moschitti, A., & Riccardi, G. (2009). Convolution kernels on constit‐ uent, dependency and sequential structures for relation extraction. *Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing*, 3, 1378-1387. [32] Ratnaparkhi, A. (1996). A Maximum Entropy Part-Of-Speech Tagger. *Proceedings of the Empirical Methods in Natural Language Processing Conference*, Retrieved from, http://

onlinelibrary.wiley.com/doi/10.1002/cbdv.200490137/abstract.

*Meeting of the Association for Computational Linguistics*.

mum Entropy models for extracting relations. *ACL-2004*.

Extraction. *NAACL HLT*.

89-92.

*ACL-2004*.

many relevant features. *ECML-1998*.

59(5), 756-769, Wiley Online Library.

894.02/related\_projects/muc/.

*& Link-Analysis (TextLink 2007)*.


[19] Hockenmaier, J., & Steedman, M. (2002). Generative Models for Statistical Parsing with Combinatory Categorial Grammar. Philadelphia, PA. *Proceedings of 40th Annual Meeting of the Association for Computational Linguistics*.

[5] Bunescu, R., & Mooney, R. J. (2005). A Shortest Path Dependency Kernel for Relation Extraction. *Proceedings of the Human Language Technology Conference and Conference on*

[6] Bunescu, R., & Mooney, R. J. (2006). Subsequence Kernels for Relation Extraction. *Proceeding of the Ninth Conference on Natural Language Learning (CoNLL-2005)*, Ann Ar‐ bor, MI, Retrieved from, http://www.cs.utexas.edu/users/ai-lab/pub-view.php?Pu‐

[7] Charniak, E. (2001). Immediate-head Parsing for Language Models. *Proceedings of the*

[8] Choi, S.-P., Jeong, C.-H., Choi, Y.-S., & Myaeng, S.-H. (2009). Relation Extraction based on Extended Composite Kernel using Flat Lexical Features. *Journal of KIISE :*

[9] Collins, M. (1997). Three Generative, Lexicalised Models for Statistical Parsing. Ma‐ drid. *Proceedings of the 35th Annual Meeting of the ACL (jointly with the 8th Conference of*

[10] Collins, M., & Duffy, N. (2001). Convolution Kernels for Natural Language.

[11] Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. *Machine Learning*, 20(3), 273-297, Hingham, MA, USA, Kluwer Academic Publishers, doi:10.1023/A:

[12] Cristianini, N., & Shawe-Taylor, J. (2000). *An Introduction to Support Vector Machines*

[13] Culotta, A., & Sorensen, J. (2004). Dependency Tree Kernels for Relation Extraction. *Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics*.

[14] Etzioni, O., Cafarella, M., Downey, D., Popescu, A., , M., Shaked, T., Soderland, S., Weld, D. S., et al. (2005). Unsupervised named-entity extraction from the Web: An experimental study. *Artificial Intelligence*, 165(1), 91-134, Retrieved from, http://

[15] Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press Cam‐

[16] Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. *Machine learning*, 37(3), 277, 296, Retrieved from, http://www.springer‐

[17] Fundel, K., Küffner, R., & Zimmer, R. (2007). RelEx-Relation extraction using de‐

[18] Gartner, T., Flach, P., & Wrobel, S. (2003). On graph kernels: Hardness results and ef‐

*and Other Kernel-based Learning Methods*, Cambridge University Press.

www.sciencedirect.com/science/article/pii/S0004370205000366.

ficient alternatives. *Learning Theory and Kernel Machines*, 129-143.

link.com/index/q3003163876k7h81.pdf.

pendency parse trees. *Bioinformatics*, 23(3), 365-371.

*39th Annual Meeting of the Association for Computational Linguistics*.

*Empirical Methods in Natural Language Processing*, 724-731.

bID=51413.

34 Theory and Applications for Advanced Text Mining Text Mining

*the EACL)*.

*NIPS-2001*.

1022627411411.

bridge, MA

*Software and Applications*, 36(8).


[33] Reichartz, F., Korte, H., & Paass, G. (2009). Dependency tree kernels for relation ex‐ traction from natural language text. *Machine Learning and Knowledge Discovery in Da‐ tabases*, 270-285, Springer.

**Chapter 2**

**Provisional chapter**

**Analysis for Finding Innovative Concepts Based on**

**Analysis for Finding Innovative Concepts Based on**

In recent years, information systems in every field have developed rapidly, and the amount of electrically stored data has increased day after day. Electrical document data are also stored in such systems mainly for recording and for holding the facts. As for the medical field, documents are also accumulated not only in clinical situations, but also in worldwide repositories by various medical studies. Such data now provide valuable information to medical researchers, doctors, engineers, and related workers by retrieving the documents depending to their expertise. They want to know the up-to-date knowledge for providing better care to their patients. Hence, the detection of novel, important, and remarkable phrases and words has become very important to aware valuable evidences in the documents. However, the detection is greatly depending on their skills for finding the good evidences. Besides, with respect to biomedical research documents, the MeSH [1] vocabulary provides overall concepts and terms for describing them in a simple and an accurate way. The structured vocabulary is maintained by NIH for reflecting some novel findings and interests on each specific field, considering amount of published documents and other factors based on the studies. Through such consideration, new concepts, which appear as new concepts every year, are usually added to the vocabulary if the concepts are useful. One criterion for adding new concepts is related to how attention paid to them by the researchers appears as an emergent pattern in published documents. As the fact, around few hundred of new concepts are added every year, and the maintenance of the concepts and their related structure has been done by manually. Thus, MeSH has another aspect as an important knowledge base for the biomedical research field. However, the relationships between particular data-driven

By clarifying the relationship between such maintained vocabulary and the trends of term usages, readers in the field can detect the important terms for understanding the up-to-date trends in his/her field more clearly. Under the above-mentioned motivation, I developed a

and reproduction in any medium, provided the original work is properly cited.

©2012 Abe, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2012 Abe; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

distribution, and reproduction in any medium, provided the original work is properly cited.

**Temporal Patterns of Terms in Documents**

**Temporal Patterns of Terms in Documents**

Additional information is available at the end of the chapter

trends and the newly added concepts did not be clarified.

Additional information is available at the end of the chapter

Hidenao Abe

**1. Introduction**

Hidenao Abe

http://dx.doi.org/10.5772/52210


**Provisional chapter**

## **Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents Temporal Patterns of Terms in Documents**

**Analysis for Finding Innovative Concepts Based on**

Hidenao Abe Additional information is available at the end of the chapter

Hidenao Abe

[33] Reichartz, F., Korte, H., & Paass, G. (2009). Dependency tree kernels for relation ex‐ traction from natural language text. *Machine Learning and Knowledge Discovery in Da‐*

[34] Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage

[36] Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. *Proceedings of the 33rd annual meeting on Association for Computational Lin‐ guistics*, 189-196, Morristown, NJ, USA, Association for Computational Linguistics,

[37] Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., & Soderland, S. (2007). TextRunner: open information extraction on the web. Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations Stroudsburg, PA, USA: Association for Computational Linguistics Retrieved from http://dl.acm.org/

[38] Zelenco, D., Aone, C., & Richardella, A. (2003). Kernel Methods for Relation Extrac‐

[39] Zhang, M., Zhang, J., & Su, J. (2006). Exploring syntactic features for relation extrac‐ tion using a convolution tree kernel. *Proceedings of the main conference on Human Lan‐ guage Technology Conference of the North American Chapter of the Association of Computational Linguistics*, 288-295, Stroudsburg, PA, USA, Association for Computa‐

[40] Zhang, M., Zhang, J., Su, J., & Zhou, G. (2006). A Composite Kernel to Extract Rela‐ tions between Entities with both Flat and Structured Features. *21st International Con‐ ference on Computational Linguistics and 44th Annual Meeting of the ACL*, 825-832.

[41] Zhang, M., Zhou, G., & Aiti, A. (2008). Exploring syntactic structured features over parse trees for relation extraction using kernel methods. *Information processing & man‐*

[42] Zhao, S., & Grishman, R. (2005). Extracting Relations with Integrated Information Us‐

[43] Zhou, G., Su, J., Zhang, J., & Zhang, M. (2005). Exploring Various Knowledge in Rela‐

[44] Zhou, G., Zhang, M., Ji, D., & Zhu, Q. (2007). Tree Kernel-based Relation Extraction with Context-Sensitive Structured Parse Tree Information. *The 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Lan‐*

[35] TAC. (2012). *Text Analysis Conference*, Retrieved from, http://www.nist.gov/tac/.

and Organization in the Brain. *Psychological Review*, 65(6), 386-408.

*tabases*, 270-285, Springer.

36 Theory and Applications for Advanced Text Mining Text Mining

doi:10.3115/981658.981684.

citation.cfm?id=1614164.1614177 , 25-26.

tion. *Journal of Machine Leanring Research*, 3, 1083-1106.

tional Linguistics, doi:10.3115/1220835.1220872.

*agement*, 44(2), 687-701, Elsevier.

ing Kernel Methods. *ACL-2005*.

tion Extraction. *ACL-2005*.

*guage Learning*, 728-736.

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/52210

## **1. Introduction**

In recent years, information systems in every field have developed rapidly, and the amount of electrically stored data has increased day after day. Electrical document data are also stored in such systems mainly for recording and for holding the facts. As for the medical field, documents are also accumulated not only in clinical situations, but also in worldwide repositories by various medical studies. Such data now provide valuable information to medical researchers, doctors, engineers, and related workers by retrieving the documents depending to their expertise. They want to know the up-to-date knowledge for providing better care to their patients. Hence, the detection of novel, important, and remarkable phrases and words has become very important to aware valuable evidences in the documents. However, the detection is greatly depending on their skills for finding the good evidences.

Besides, with respect to biomedical research documents, the MeSH [1] vocabulary provides overall concepts and terms for describing them in a simple and an accurate way. The structured vocabulary is maintained by NIH for reflecting some novel findings and interests on each specific field, considering amount of published documents and other factors based on the studies. Through such consideration, new concepts, which appear as new concepts every year, are usually added to the vocabulary if the concepts are useful. One criterion for adding new concepts is related to how attention paid to them by the researchers appears as an emergent pattern in published documents. As the fact, around few hundred of new concepts are added every year, and the maintenance of the concepts and their related structure has been done by manually. Thus, MeSH has another aspect as an important knowledge base for the biomedical research field. However, the relationships between particular data-driven trends and the newly added concepts did not be clarified.

By clarifying the relationship between such maintained vocabulary and the trends of term usages, readers in the field can detect the important terms for understanding the up-to-date trends in his/her field more clearly. Under the above-mentioned motivation, I developed a

©2012 Abe, licensee InTech. This is an open access chapter distributed under the terms of the Creative

method for analyzing the similarity of terms on the structured taxonomy and the trend of a data-driven index of the terms [2]. In this chapter, I describe a result of the analysis by using the method for identifying similar terms based on the temporal behavior of usages of each term. The temporal pattern extraction method on the basis of term usage index consists of automatic term extraction methods, term importance indices, and temporal clustering in the next section. Then, in Section 4, a case study is carried out for showing the differences between similar terms detected by the temporal patterns of medical terms related to migraine drug therapy in MEDLINE documents. Finally, I conclude the analysis result in Section 6.

*2.1.1. Automatic term extraction in a given corpus*

*FLR*(*CN*) = *f*(*CN*) × (

candidate compound noun *CN*:

*FLR*(*data mining*) is calculated as follows.

period, is denoted as *Dperiod*.

*FLR*(*data mining*) = 1 ×

for the documents in each period, *Dperiod*, as follows:

*TFIDF*(*termi*,*Dperiod*) =

or more, nouns.

Firstly, a system determines terms in a given corpus. Considering the difficulties of constructing particular dictionaries on each domain, term extraction without any dictionary is required. As for the representative method for extracting terms automatically, a term extraction method [3] based on the adjacent frequency of compound nouns is selected. This method involves the detection of technical terms by using the following values for a

> *L* ∏ *i*=1

where *f*(*CN*) means frequency of a candidate compound noun *CN* separately, and *FL*(*Ni*) and *FR*(*Ni*) indicate the frequencies of different words on the right and the left of each noun *Ni* in *bi*-grams included in each *CN*. Each compound noun *CN* is constructed *L*(*L* ≥ 1), one

For example, there is a set of compound nouns *S* = {*data mining*, *text mining*, *mining method*} from a corpus, and they appear just one time in the corpus. Then, we want to now the FLR score of *datamining*, *FLR*(*data mining*). The left frequency of 'data' is 0, because of *FL*(*data*) = 0. The right frequency of 'data' is 1, because 'mining' appears just one time on the right of 'mining'. So *FR*(*data*) is 1. As the same way, the frequencies of 'mining' are *FL*(*mining*) = 2 and *FR*(*mining*) = 1. Then, the

After determining terms in the given corpus, the system calculates importance indices of these terms in the documents in each time period for representing the usages of the terms as the values. For the temporally published corpora, users can set up a period optionally. Most of the cases, the period is set up yearly, monthly, and daily, because the published documents are given timestamps. In this framework, each set of documents, that are published in each

Some importance indices for words and phrases in a corpus are well known. Term frequency divided by inverse document frequency (tf-idf) is one of the popular indices used for measuring the importance of terms [4]. The tf-idf value for each term *termi* can be defined

*t f*(*termi*, *Dperiod*) <sup>×</sup> *log* <sup>|</sup>*Dperiod*<sup>|</sup>

*d f*(*termi*, *Dperiod*)

*2.1.2. Calculation of data-driven indices for each term in each set of documents*

(*FL*(*Ni*) + <sup>1</sup>)(*FR*(*Ni*) + <sup>1</sup>)) <sup>1</sup>

Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents

(0 + 1)(1 + 1) × (2 + 1)(1 + 1) = 3.464 ···

2*L*

http://dx.doi.org/10.5772/52210

39

## **2. The method for analyzing distances on taxonomy and temporal patterns of term usage**

In this section, I describe a method for detecting various trends of words and phrases in temporally published corpora. In order to analyze the relationships between usages of words and phrases in temporally published documents and the difference on a taxonomy, this trend detection method use a temporal pattern extraction method based on data-driven indices [2]. By using the similar terms identified on the basis of temporal patterns of the indices, the method measures their similarities between each term on the taxonomy that can be assumed as the sets of tree structures of concepts on a particular domain. One of the reasons why the method uses temporal patterns is for detecting various trends. The important aim of the method is not only detecting particular trend that a user set, but also detecting various trends based on the nature of the given corpora. The other reason is for finding representing terms, called as 'keywords' in each specific field, on the basis of the trends. Considering these two aims, the method uses the temporal pattern extraction process based on temporal behaviors of terms by measuring an importance index.

Then, on the basis on the temporal behavioral similarity of each index, the distances between the similar terms, which are the members of each temporal pattern, are measured. By using the distance on the structured vocabulary, the averages of the distances between the terms included in temporal patterns is compared for analyzing the relationship between the trends of temporal patterns and the similarities of terms on the vocabulary.

In the following sections, the method for detecting temporally similar terms based on each importance index is described firstly. Subsequently, the distance measure on the structured vocabulary is explained.

## **2.1. Obtaining temporal patterns of data-driven indices related to term usages**

In order to discover various trends related to usages of the terms in temporally published corpus, the framework [2] is developed as a method for obtaining temporal patterns of an importance index. This framework obtains some temporal patterns based on the importance index from the given temporally published sets of documents. It consists of the following processes.


#### *2.1.1. Automatic term extraction in a given corpus*

2 Text Mining

**patterns of term usage**

vocabulary is explained.

processes.

of terms by measuring an importance index.

• Automatic term extraction in overall documents

• Obtaining temporal clusters for each importance index

• Assignment of some meanings for the obtained temporal patterns

• Calculation of importance indices

method for analyzing the similarity of terms on the structured taxonomy and the trend of a data-driven index of the terms [2]. In this chapter, I describe a result of the analysis by using the method for identifying similar terms based on the temporal behavior of usages of each term. The temporal pattern extraction method on the basis of term usage index consists of automatic term extraction methods, term importance indices, and temporal clustering in the next section. Then, in Section 4, a case study is carried out for showing the differences between similar terms detected by the temporal patterns of medical terms related to migraine drug therapy in MEDLINE documents. Finally, I conclude the analysis result in Section 6.

**2. The method for analyzing distances on taxonomy and temporal**

In this section, I describe a method for detecting various trends of words and phrases in temporally published corpora. In order to analyze the relationships between usages of words and phrases in temporally published documents and the difference on a taxonomy, this trend detection method use a temporal pattern extraction method based on data-driven indices [2]. By using the similar terms identified on the basis of temporal patterns of the indices, the method measures their similarities between each term on the taxonomy that can be assumed as the sets of tree structures of concepts on a particular domain. One of the reasons why the method uses temporal patterns is for detecting various trends. The important aim of the method is not only detecting particular trend that a user set, but also detecting various trends based on the nature of the given corpora. The other reason is for finding representing terms, called as 'keywords' in each specific field, on the basis of the trends. Considering these two aims, the method uses the temporal pattern extraction process based on temporal behaviors

Then, on the basis on the temporal behavioral similarity of each index, the distances between the similar terms, which are the members of each temporal pattern, are measured. By using the distance on the structured vocabulary, the averages of the distances between the terms included in temporal patterns is compared for analyzing the relationship between the trends

In the following sections, the method for detecting temporally similar terms based on each importance index is described firstly. Subsequently, the distance measure on the structured

**2.1. Obtaining temporal patterns of data-driven indices related to term usages** In order to discover various trends related to usages of the terms in temporally published corpus, the framework [2] is developed as a method for obtaining temporal patterns of an importance index. This framework obtains some temporal patterns based on the importance index from the given temporally published sets of documents. It consists of the following

of temporal patterns and the similarities of terms on the vocabulary.

Firstly, a system determines terms in a given corpus. Considering the difficulties of constructing particular dictionaries on each domain, term extraction without any dictionary is required. As for the representative method for extracting terms automatically, a term extraction method [3] based on the adjacent frequency of compound nouns is selected. This method involves the detection of technical terms by using the following values for a candidate compound noun *CN*:

$$FLR(\text{CN}) = f(\text{CN}) \times \left(\prod\_{i=1}^{L} (FL(N\_i) + 1)(FR(N\_i) + 1)\right)^{\frac{1}{2L}}$$

where *f*(*CN*) means frequency of a candidate compound noun *CN* separately, and *FL*(*Ni*) and *FR*(*Ni*) indicate the frequencies of different words on the right and the left of each noun *Ni* in *bi*-grams included in each *CN*. Each compound noun *CN* is constructed *L*(*L* ≥ 1), one or more, nouns.

For example, there is a set of compound nouns *S* = {*data mining*, *text mining*, *mining method*} from a corpus, and they appear just one time in the corpus. Then, we want to now the FLR score of *datamining*, *FLR*(*data mining*). The left frequency of 'data' is 0, because of *FL*(*data*) = 0. The right frequency of 'data' is 1, because 'mining' appears just one time on the right of 'mining'. So *FR*(*data*) is 1. As the same way, the frequencies of 'mining' are *FL*(*mining*) = 2 and *FR*(*mining*) = 1. Then, the *FLR*(*data mining*) is calculated as follows.

$$\text{FLR}(data\text{ mining}) = 1 \times \sqrt{(0+1)(1+1) \times (2+1)(1+1)} = 3.464\dots$$

#### *2.1.2. Calculation of data-driven indices for each term in each set of documents*

After determining terms in the given corpus, the system calculates importance indices of these terms in the documents in each time period for representing the usages of the terms as the values. For the temporally published corpora, users can set up a period optionally. Most of the cases, the period is set up yearly, monthly, and daily, because the published documents are given timestamps. In this framework, each set of documents, that are published in each period, is denoted as *Dperiod*.

Some importance indices for words and phrases in a corpus are well known. Term frequency divided by inverse document frequency (tf-idf) is one of the popular indices used for measuring the importance of terms [4]. The tf-idf value for each term *termi* can be defined for the documents in each period, *Dperiod*, as follows:

$$\text{TFIDF}(term\_{i\prime}D\_{period}) = \newline t f(term\_{i\prime}D\_{period}) \times \log \frac{|D\_{period}|}{df(term\_{i\prime}D\_{period})}$$

where *t f*(*termi*, *Dperiod*) is the frequency of each term *termi* in a corpus with |*Dperiod*| documents. Here, |*Dperiod*| is the number of documents included in each period, and *d f*(*termi*, *Dperiod*) is the frequency of documents containing term.

where *x*¯ is the average of *tj* − *t*<sup>1</sup> for M time points and *y*¯ is the average of the values *cj*. Each value of the centroid, *cj*, is a representative value of the importance index values of assigned terms in the pattern as *Index*(*termi* ∈ *ck*, *Dperiod*). Each time point *tj* corresponds to each

Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents

http://dx.doi.org/10.5772/52210

41

*Int*(*c*) = *y*¯ − *Deg*(*c*)*x*¯

Then, by using the two linear trend criteria, users assigne some meanings of the temporal

In this chapter, a tree structure of concepts that are defined with a relation such as is-a as 'structured taxonomy' is used. In the biomedical domain, MeSH (Medical Subjects Headings) [1] is one of the important structured taxonomy for representing key concepts of biomedical research articles. MeSH consists of 16 categories including not only proper categories for biomedicine but also general categories such as information science. It contains 25,588 concepts as 'Descriptor', and 464,282 terms as 'Entry Terms' in the version of 2001. Each concept has one or more entry terms and the tree numbers as the identifier in the

For this structure, the similarity of each pair of terms represented by using distance in the

For example, when the distance between the two terms, *termi*<sup>1</sup> and *termi*2, denotes as *Dist*(*termi*1, *termi*2), the distance between 'migraine' and 'sharp headache', *Dist*(*migraine*,*sharpheadache*), is calculated as 8 or 9. By using this distance, the similarity

1 + *Dist*(*termi*1, *termi*2)

*Sim*(*termi*1, *termi*2)

*Sim*(*termi*1, *termi*2) = <sup>1</sup>

where the similarity can be calculated when the both terms have tree numbers in MeSH.

*numPair* ∑

For overall terms belonging to some group *g*, representative values are also defined their

*termi*∈*g*

where *numPair* is the number of matched pairs of the terms included in the group *g*. The definition is *numPair* =*<sup>m</sup> C*2, where the number of appeared terms *m* is *m* = |*termi* ∈

Simultaneously, the system calculates the intercept *Int*(*c*) of each pattern *ck* as follows:

period, and the first period assigns to the first time point as *t*1.

**2.2. Defining similarity of terms on a structured taxonomy**

tree structure of MeSH is defined, as shown in Figure 2.

between each pair of terms is defined as the following:

averaged similarity in the group as the following:

*g* ∩ *hasTreeNumber*(*termi*)|.

*Avg*.*Sim*(*g*) = <sup>1</sup>

patterns related to the usages of the terms.

hierarchy structure.

In the proposed framework, the method suggests treating these indices explicitly as a temporal dataset. This dataset consists of the values of the terms for each time point by using each index *Index*(·, *Dperiod*) as the features. Figure 1 shows an example of such a dataset consisting of an importance index for each period. The value of the term *termi* is described as *Index*(*termi*, *Dperiod*) in Figure 1.


**Figure 1.** Example of dataset consisting of an importance index.

#### *2.1.3. Generating temporal patterns by using temporal clustering*

After obtaining the dataset, the framework provides the choice of an adequate trend extraction method to the dataset. A survey of the literature shows that many conventional methods for extracting useful time-series patterns have been developed [5, 6]. Users can apply an adequate time-series analysis method and identify important patterns by processing the values in the rows of Figure 1. By considering these patterns with temporal information, users can understand the trends related to the terms such as transition of technological development with technical terms. The temporal patterns as the clusters also provide information about similarities between the terms at the same time. The system denotes the similar terms based on the temporal cluster assignments as *termi* ∈ *ck*.

#### *2.1.4. Assigning meanings of the trends of the obtained temporal patterns*

After obtaining the temporal patterns *ck*, in order to identify the meanings of each pattern by using trends of the extracted terms for each importance index, the system applies linear regression analysis. The degree of the centroid of a temporal pattern *c* is calculated as follows:

$$\text{Deg}(c) = \frac{\sum\_{j=1}^{M} (c\_j - \varepsilon)(\boldsymbol{x}\_j - \boldsymbol{x})}{\sum\_{j=1}^{M} (\boldsymbol{x}\_j - \bar{\boldsymbol{x}})^2}$$

where *x*¯ is the average of *tj* − *t*<sup>1</sup> for M time points and *y*¯ is the average of the values *cj*. Each value of the centroid, *cj*, is a representative value of the importance index values of assigned terms in the pattern as *Index*(*termi* ∈ *ck*, *Dperiod*). Each time point *tj* corresponds to each period, and the first period assigns to the first time point as *t*1.

Simultaneously, the system calculates the intercept *Int*(*c*) of each pattern *ck* as follows:

$$\operatorname{Int}(c) = \bar{y} - \operatorname{Deg}(c)\bar{x}$$

Then, by using the two linear trend criteria, users assigne some meanings of the temporal patterns related to the usages of the terms.

#### **2.2. Defining similarity of terms on a structured taxonomy**

4 Text Mining

where *t f*(*termi*, *Dperiod*) is the frequency of each term *termi* in a corpus with |*Dperiod*| documents. Here, |*Dperiod*| is the number of documents included in each period, and

In the proposed framework, the method suggests treating these indices explicitly as a temporal dataset. This dataset consists of the values of the terms for each time point by using each index *Index*(·, *Dperiod*) as the features. Figure 1 shows an example of such a dataset consisting of an importance index for each period. The value of the term *termi* is

After obtaining the dataset, the framework provides the choice of an adequate trend extraction method to the dataset. A survey of the literature shows that many conventional methods for extracting useful time-series patterns have been developed [5, 6]. Users can apply an adequate time-series analysis method and identify important patterns by processing the values in the rows of Figure 1. By considering these patterns with temporal information, users can understand the trends related to the terms such as transition of technological development with technical terms. The temporal patterns as the clusters also provide information about similarities between the terms at the same time. The system denotes

After obtaining the temporal patterns *ck*, in order to identify the meanings of each pattern by using trends of the extracted terms for each importance index, the system applies linear regression analysis. The degree of the centroid of a temporal pattern *c* is calculated as follows:

∑*<sup>M</sup>*

*<sup>j</sup>*=1(*cj* − *c*¯)(*xj* − *x*¯)

*<sup>j</sup>*=1(*xj* − *x*¯)<sup>2</sup>

*d f*(*termi*, *Dperiod*) is the frequency of documents containing term.

described as *Index*(*termi*, *Dperiod*) in Figure 1.

**Figure 1.** Example of dataset consisting of an importance index.

*2.1.3. Generating temporal patterns by using temporal clustering*

the similar terms based on the temporal cluster assignments as *termi* ∈ *ck*.

*Deg*(*c*) = <sup>∑</sup>*<sup>M</sup>*

*2.1.4. Assigning meanings of the trends of the obtained temporal patterns*

In this chapter, a tree structure of concepts that are defined with a relation such as is-a as 'structured taxonomy' is used. In the biomedical domain, MeSH (Medical Subjects Headings) [1] is one of the important structured taxonomy for representing key concepts of biomedical research articles. MeSH consists of 16 categories including not only proper categories for biomedicine but also general categories such as information science. It contains 25,588 concepts as 'Descriptor', and 464,282 terms as 'Entry Terms' in the version of 2001. Each concept has one or more entry terms and the tree numbers as the identifier in the hierarchy structure.

For this structure, the similarity of each pair of terms represented by using distance in the tree structure of MeSH is defined, as shown in Figure 2.

For example, when the distance between the two terms, *termi*<sup>1</sup> and *termi*2, denotes as *Dist*(*termi*1, *termi*2), the distance between 'migraine' and 'sharp headache', *Dist*(*migraine*,*sharpheadache*), is calculated as 8 or 9. By using this distance, the similarity between each pair of terms is defined as the following:

$$Sim(term\_{i1}, term\_{i2}) = \frac{1}{1 + Dist(term\_{i1}, term\_{i2})}$$

where the similarity can be calculated when the both terms have tree numbers in MeSH.

For overall terms belonging to some group *g*, representative values are also defined their averaged similarity in the group as the following:

$$Av \otimes Sim(\mathcal{g}) = \frac{1}{num \, Pair} \sum\_{term\_i \in \mathcal{g}} Sim(term\_{i1}, term\_{i2})$$

where *numPair* is the number of matched pairs of the terms included in the group *g*. The definition is *numPair* =*<sup>m</sup> C*2, where the number of appeared terms *m* is *m* = |*termi* ∈ *g* ∩ *hasTreeNumber*(*termi*)|.

therapy [MH:NOEXP] AND YYYY [DP] AND clinical trial [PT] AND english [LA]" through PubMed. The string "YYYY" is replaced with the four digits necessary for retrieving articles published each year. The retrieval for PubMed can be performed trough their WebAPI written in Perl [8]. By iterating the query by updating the years, the script can retrieve the published research documents in the field depending on the query string. In this example, the temporal sets of documents year by year are gathered on the field related to the drug

Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents

http://dx.doi.org/10.5772/52210

43

With this search query, we obtain articles published between 1982 and 2009 with the abstract mode of PubMed1. Figure 3 shows the numbers of article titles and abstracts retrieved by the

**Figure 3.** Numbers of documents with titles and abstracts related to hepatitis drug therapy published from 1982 to 2009.

terms. As for the titles, the method extracted 1,428 terms.

**therapy studies**

each index year by year for each term.

<sup>1</sup> The current heading of chronic hepatitis has introduced 1982.

From all of the retrieved abstracts, the automatic term extraction method identifies 12,194

**3.2. Obtaining temporal patterns of medical terms about chronic hepatitis drug**

By calculating the document frequency and the tf-idf values as the importance indices for each year on titles and abstracts respectively. By using the document sets and the index, the system obtained the dataset to obtain temporal clusters that consist of temporal behavior of

As for the clustering algorithm, the k-means clustering algorithm implemented in Weka [9](Weka-3-6-2) are applied. Since the implementation search better cluster

query. In this study, each abstract is assumed as text to be one document.

therapy for chronic hepatitis.

**Figure 2.** Example of MeSH hierarchy structure for migraine disorders and headache.

## **3. A case study for detecting trends of terms by obtaining temporal patterns**

In this section, I describe a case study for analyzing similarity of terms detected some temporal patterns in medical research documents. For obtaining the temporal patterns, I used an importance index of the terms in each set of documents that ware published year by year. The medical research documents are retrieved from MEDLINE by using a search scenario over time. The scenario is related to migraine drug therapy similar to the first one in a previous paper on MeSHmap [7].

In this case study, I consider the search scenario and three meanings of trends as temporal clusters by using the degrees and intercepts of the trend lines for each term as follows. As for the meanings, the following two trends are assigned; "emergent" to ascending trend lines with negative intercepts, "subsiding" to decending trend lines with positive intercepts, and "popular" to ascending trend lines with positive intercepts.

## **3.1. Analysis of a disease over time**

In this scenario, a user may be interested in exploring the progression of ideas in a particular domain, say, corresponding to a particular disease. By performing the search such that the disease is represented according to the year, one may obtain a temporal assessment of the changes in the field.

Let us assume that the user wants to explore the evolution of ideas about drugs used to treat chronic hepatitis. The user performs a search for abstracts of articles "chronic hepatitis/drug therapy [MH:NOEXP] AND YYYY [DP] AND clinical trial [PT] AND english [LA]" through PubMed. The string "YYYY" is replaced with the four digits necessary for retrieving articles published each year. The retrieval for PubMed can be performed trough their WebAPI written in Perl [8]. By iterating the query by updating the years, the script can retrieve the published research documents in the field depending on the query string. In this example, the temporal sets of documents year by year are gathered on the field related to the drug therapy for chronic hepatitis.

6 Text Mining

**patterns**

in a previous paper on MeSHmap [7].

**3.1. Analysis of a disease over time**

changes in the field.

**Figure 2.** Example of MeSH hierarchy structure for migraine disorders and headache.

"popular" to ascending trend lines with positive intercepts.

**3. A case study for detecting trends of terms by obtaining temporal**

In this section, I describe a case study for analyzing similarity of terms detected some temporal patterns in medical research documents. For obtaining the temporal patterns, I used an importance index of the terms in each set of documents that ware published year by year. The medical research documents are retrieved from MEDLINE by using a search scenario over time. The scenario is related to migraine drug therapy similar to the first one

In this case study, I consider the search scenario and three meanings of trends as temporal clusters by using the degrees and intercepts of the trend lines for each term as follows. As for the meanings, the following two trends are assigned; "emergent" to ascending trend lines with negative intercepts, "subsiding" to decending trend lines with positive intercepts, and

In this scenario, a user may be interested in exploring the progression of ideas in a particular domain, say, corresponding to a particular disease. By performing the search such that the disease is represented according to the year, one may obtain a temporal assessment of the

Let us assume that the user wants to explore the evolution of ideas about drugs used to treat chronic hepatitis. The user performs a search for abstracts of articles "chronic hepatitis/drug With this search query, we obtain articles published between 1982 and 2009 with the abstract mode of PubMed1. Figure 3 shows the numbers of article titles and abstracts retrieved by the query. In this study, each abstract is assumed as text to be one document.

**Figure 3.** Numbers of documents with titles and abstracts related to hepatitis drug therapy published from 1982 to 2009.

From all of the retrieved abstracts, the automatic term extraction method identifies 12,194 terms. As for the titles, the method extracted 1,428 terms.

## **3.2. Obtaining temporal patterns of medical terms about chronic hepatitis drug therapy studies**

By calculating the document frequency and the tf-idf values as the importance indices for each year on titles and abstracts respectively. By using the document sets and the index, the system obtained the dataset to obtain temporal clusters that consist of temporal behavior of each index year by year for each term.

As for the clustering algorithm, the k-means clustering algorithm implemented in Weka [9](Weka-3-6-2) are applied. Since the implementation search better cluster

<sup>1</sup> The current heading of chronic hepatitis has introduced 1982.

assignments by minimizing the sum of squared errors (SSE), the upper limits of the number of clusters are set up 1% of the number s of terms. And the maximum iteration to search better assignment is set up 500 times.

show the trends and the similar group at the same time, the meaning of each group indicates

Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents

http://dx.doi.org/10.5772/52210

45

In this example, the sets of documents published year by year on the field related to the drug therapy for migraine are gathered. Let us assume that the user wants to explore the evolution of ideas about drugs used to treat migraine2. The user performs a search for abstracts of articles "migraine/drug therapy [MH:NOEXP] AND YYYY [DP] AND clinical trial [PT] AND english [LA]" through PubMed. As same as the retrieval for chronic hepatitis

With this search query, articles published between 1980 and 2009 with the abstract mode of PubMed are retrieved. Figure 5 shows the numbers of article titles and abstracts retrieved by

**Figure 5.** Numbers of documents with titles and abstracts related to migraine drug therapy published from 1980 to 2009. By assuming the abstracts and the titles as each corpus, the automatic term extraction is applied. From all of the retrieved abstracts, the automatic term extraction method identifies

<sup>2</sup> Migraine causes sharp and severe headaches to people. People who have migraine exist commonly in world-wide. The severe headaches give economical disadvantages not only to the patients, but also socially sometimes.

61,936 terms. Similarly, from all of the titles, the system extracts 6,470 terms.

**4. Analyzing temporal trends of terms and the similarities on MeSH**

should be evaluated by medical experts.

**structure**

in Section 3.

the query.


Table 1 shows the result of the k-means clustering on the sets of documents.

**Table 1.** Overall result of temporal clustering on titles and abstracts about chronic hepatitis drug therapy by using the three importance indices.

Figure 4 shows the centroid values of the temporal clusters and the representative terms of each temporal pattern on the title corpus based on the temporl tf-idf. The centroid values mean the averages of the yearly values of the terms in each cluster. The representative terms are selected with their FLR scores, that are the highest in each temporal custer. The cluster is selected with the following conditions: including phrases, highest linear degree with minimum intercepts to y-axis by sorting the average degrees and the average intercepts of the 14 clusters.

**Figure 4.** The representative terms and values of tf-idf temporal patterns on the titles of the chronic hepatitis articles.

As shown in Figure 4, the method can detect the trends based on the temporal behaviors of terms. Although the temporal patterns and the similar terms as the member of the clusters show the trends and the similar group at the same time, the meaning of each group indicates should be evaluated by medical experts.

## **4. Analyzing temporal trends of terms and the similarities on MeSH structure**

8 Text Mining

importance indices.

of the 14 clusters.

better assignment is set up 500 times.

assignments by minimizing the sum of squared errors (SSE), the upper limits of the number of clusters are set up 1% of the number s of terms. And the maximum iteration to search

**Table 1.** Overall result of temporal clustering on titles and abstracts about chronic hepatitis drug therapy by using the three

Figure 4 shows the centroid values of the temporal clusters and the representative terms of each temporal pattern on the title corpus based on the temporl tf-idf. The centroid values mean the averages of the yearly values of the terms in each cluster. The representative terms are selected with their FLR scores, that are the highest in each temporal custer. The cluster is selected with the following conditions: including phrases, highest linear degree with minimum intercepts to y-axis by sorting the average degrees and the average intercepts

**Figure 4.** The representative terms and values of tf-idf temporal patterns on the titles of the chronic hepatitis articles.

As shown in Figure 4, the method can detect the trends based on the temporal behaviors of terms. Although the temporal patterns and the similar terms as the member of the clusters

Table 1 shows the result of the k-means clustering on the sets of documents.

In this example, the sets of documents published year by year on the field related to the drug therapy for migraine are gathered. Let us assume that the user wants to explore the evolution of ideas about drugs used to treat migraine2. The user performs a search for abstracts of articles "migraine/drug therapy [MH:NOEXP] AND YYYY [DP] AND clinical trial [PT] AND english [LA]" through PubMed. As same as the retrieval for chronic hepatitis in Section 3.

With this search query, articles published between 1980 and 2009 with the abstract mode of PubMed are retrieved. Figure 5 shows the numbers of article titles and abstracts retrieved by the query.

**Figure 5.** Numbers of documents with titles and abstracts related to migraine drug therapy published from 1980 to 2009.

By assuming the abstracts and the titles as each corpus, the automatic term extraction is applied. From all of the retrieved abstracts, the automatic term extraction method identifies 61,936 terms. Similarly, from all of the titles, the system extracts 6,470 terms.

<sup>2</sup> Migraine causes sharp and severe headaches to people. People who have migraine exist commonly in world-wide. The severe headaches give economical disadvantages not only to the patients, but also socially sometimes.

## **4.1. Obtaining temporal patterns of medical terms about migraine drug therapy studies**

approved later 1990s in US and European countries, and early 2000s in Japan. Based on the result, the method obtained the temporal patterns related to the topics that attract interests of researchers in this field. In addition, the degree of the increasing and the shapes of the temporal patterns of each index show some aspects the movements of the research issue.

Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents

http://dx.doi.org/10.5772/52210

47

By using the similarity measure as described in Section 2, the averaged similarity of the medical terms included in each temporal pattern are calculated. In order to analyze the relationship between the trends and the similarities, a comparison is performed with the

As shown in Table 3, the similarities for each temporal pattern are calculated. Smaller similarity value means that the terms included in the temporal pattern are defined separately on the MeSH structure. Besides, greater similarity value means that the similar terms on the

**Table 3.** Temporal patterns obtained for the tf-idf dataset on the titles and the similarities of the terms in each temporal

Then, for clarifying the relationships between the temporal patterns and the similarity on the taxonomy, the difference of the similarity values by separating the meanings of the linear trends is compared on the two trends; emergent or not emergent. As for the first representative values, the two groups of the average values are tested by using t-test. Then, the representative values of the two groups of the similarity values are compared by using Wilcoxon rank sum test. Table 4 shows the averages and the medians of the similarity values. The similarity values around 0.13 means that the pair of terms defined in each place with from 6 to 7 paths. By testing the difference between the two groups based on the linear trends, for the abstracts, the similarities of the terms included in the emergent temporal patterns are significantly smaller than the terms included in the popular patterns based on the tf-idf values. This result indicates that the tf-idf index detects new combinations

pattern.

**4.2. Similarity of the terms in obtained temporal patterns on MeSH**

representative values of the averaged similarities of the term.

temporal pattern are also defined similarly on the MeSH structure.

By calculating the document frequency and the tf-idf values as the importance indices for each year on titles and abstracts respectively. Then, the temporal clusters that consist of temporal behavior of each index year by year for each term are obtained. The clustering algorithm that is used in the following experiment is also the same as the setting in Section 3.


Table 2 shows the result of the k-means clustering on the sets of documents.

**Table 2.** Overall result of temporal clustering on titles and abstracts about migraine drug therapy by using the three importance indices.

Figure 6 shows the emergent cluster centroid and the top ten emergent terms on the abstracts on the basis of tf-idf. The cluster is selected with the following conditions: including phrases, highest linear degree with minimum intercepts to y-axis by sorting the average degrees and the average intercepts of the 14 clusters.

**Figure 6.** The detailed tf-idf temporal values included in the emergent temporal pattern (Cluster #14).

As shown in Figure 6, the method detected the emergent terms included in the emergent pattern that related to triptans drug therapy. The cluster also includes some terms related to the time for the therapy. The drugs including triptans, which are appeared in this pattern, are approved later 1990s in US and European countries, and early 2000s in Japan. Based on the result, the method obtained the temporal patterns related to the topics that attract interests of researchers in this field. In addition, the degree of the increasing and the shapes of the temporal patterns of each index show some aspects the movements of the research issue.

## **4.2. Similarity of the terms in obtained temporal patterns on MeSH**

10 Text Mining

**studies**

importance indices.

the average intercepts of the 14 clusters.

**4.1. Obtaining temporal patterns of medical terms about migraine drug therapy**

By calculating the document frequency and the tf-idf values as the importance indices for each year on titles and abstracts respectively. Then, the temporal clusters that consist of temporal behavior of each index year by year for each term are obtained. The clustering algorithm that is used in the following experiment is also the same as the setting in Section 3.

Table 2 shows the result of the k-means clustering on the sets of documents.

**Table 2.** Overall result of temporal clustering on titles and abstracts about migraine drug therapy by using the three

**Figure 6.** The detailed tf-idf temporal values included in the emergent temporal pattern (Cluster #14).

As shown in Figure 6, the method detected the emergent terms included in the emergent pattern that related to triptans drug therapy. The cluster also includes some terms related to the time for the therapy. The drugs including triptans, which are appeared in this pattern, are

Figure 6 shows the emergent cluster centroid and the top ten emergent terms on the abstracts on the basis of tf-idf. The cluster is selected with the following conditions: including phrases, highest linear degree with minimum intercepts to y-axis by sorting the average degrees and By using the similarity measure as described in Section 2, the averaged similarity of the medical terms included in each temporal pattern are calculated. In order to analyze the relationship between the trends and the similarities, a comparison is performed with the representative values of the averaged similarities of the term.

As shown in Table 3, the similarities for each temporal pattern are calculated. Smaller similarity value means that the terms included in the temporal pattern are defined separately on the MeSH structure. Besides, greater similarity value means that the similar terms on the temporal pattern are also defined similarly on the MeSH structure.


**Table 3.** Temporal patterns obtained for the tf-idf dataset on the titles and the similarities of the terms in each temporal pattern.

Then, for clarifying the relationships between the temporal patterns and the similarity on the taxonomy, the difference of the similarity values by separating the meanings of the linear trends is compared on the two trends; emergent or not emergent. As for the first representative values, the two groups of the average values are tested by using t-test. Then, the representative values of the two groups of the similarity values are compared by using Wilcoxon rank sum test. Table 4 shows the averages and the medians of the similarity values.

The similarity values around 0.13 means that the pair of terms defined in each place with from 6 to 7 paths. By testing the difference between the two groups based on the linear trends, for the abstracts, the similarities of the terms included in the emergent temporal patterns are significantly smaller than the terms included in the popular patterns based on the tf-idf values. This result indicates that the tf-idf index detects new combinations


**5.2. Ontology Learning (OL) methods from domain corpora**

**6. Conclusion**

**Author details**

University, Japan

Hidenao Abe

In order to construct a taxonomy for each domain from a given corpus, the methods for leaning ontologies are proposed [14, 15]. However, they did not consider the difference over times of the structured taxonomy. The maintenances of the structure are majorly depending on the manual works of domain experts. Some support methods for the maintenance of the structured taxonomy are really required to the structured taxonomy useful and up-to-date. For this issue, the advanced method of the proposal of this chapter will provide the support with more objective evidences based on the temporal corpora of each particular domain.

Analysis for Finding Innovative Concepts Based on Temporal Patterns of Terms in Documents

http://dx.doi.org/10.5772/52210

49

In this chapter, I describe the method for detecting trends of terms in the published articles in MEDLINE as the case study on chronic hepatitis studies. With this case study, the result shows that the method can find various trends of terms and similar terms at the same time. In this case study, the similar terms detected by using their temporal behavior of the two importance indices; document frequency and tf-idf index. Then, the temporal patterns of the biomedical terms by using the two importance indices are obtained. The patterns indicate the similar usages of the terms on the biomedical research documents as the temporal corpus. Subsequently, by using migraine drug therapy studies, a comparison of the similarity of the terms between the terms grouped up by our trend detection method and the terms in the structured vocabulary is shown. By using MeSH as the structured taxonomic definition of the medical terms, we compared the averaged similarity based on the distances on the tree structure between the terms included in each temporal pattern. By separating the trends of the temporal patterns based on the linear regression technique, the averaged similarities of the terms in each pattern show significant differences on the larger structured vocabulary. Based on the temporal patterns with the emergent trend of the tf-idf, the terms included in such patterns are not similar compared to the terms included in the popular patterns. This indicates that the novel concepts are obtained from new combination by using the existing concepts widely. Besides, the similarity of the different index detects the opposite

relationship between its trend and the similarity on the taxonomic definition.

<sup>⋆</sup> Address all correspondence to: hidenao@shonan.bunkyo.ac.jp

In the future, more indices for representing various aspects of term usages in a corpus will be introduced and compared. Then, based on the similarities on temporal behavior of each index as the temporal patterns, some predictive models such as numerical prediction models will be introduce for predicting adequate places of new concepts on a structured taxonomy.

Department of Information Systems, Faculty of Information and Communications, Bunkyo

df 0.131∗ 0.125∗ 0.133 0.139 (b) Medians

tf-idf 0.126∗ 0.130∗ 0.132 0.138

**Table 4.** Comparison of the representative values. (a)Averages, (b)Medians. ∗ means significant difference on *α* = 0.05.

of the concepts as its emergent trend. Besides, based on the temporal patterns by using the document frequency, the terms included in the emergent patterns are defined more similarly. More frequently used terms in the recently published documents are defined nearer than the other popular terms. This can be understandable by considering the process for maintaining the structure of the concepts manually.

## **5. Related work**

Related to the method that I described, there are two separated research topics. One is to detect emergent trend in a given temporal corpora. The other is for learning structured taxonomy or ontology from a given corpus. They have not been combined as the method to analyze the relationship between the emergent terms and the place on the structure that the terms should be appeared. This work provides a novel idea not only as a text mining approach, but also for the two separated studies

## **5.1. Emergent Trend Detection (ETD) methods**

In order to detect the emergent trend in a temporally published corpora, the method for detecting emergent trend have been developed [10, 11]. Most of these methods concentrated to find out just one trend at each setting. Moreover, they rather finding terms that represent emergent trends than the emergent trend itself. Thus, the user of these methods should interpret the meaning of the terms that are detected by the ETD method.

Conventional ETD methods are mostly based on the probabilistic transition of the term appearances as shown in the works such as [12, 13]. The method achieved for detecting emergent trend, which is actually a set of terms. However, they did not detect various trends as described in Section 3 and Section 4 at the same time. In addition to the difference, the proposed trend detection method has an availability to visualize both of the representing values of each temporal pattern and the detailed values for each term, using simple time-series charts.

## **5.2. Ontology Learning (OL) methods from domain corpora**

In order to construct a taxonomy for each domain from a given corpus, the methods for leaning ontologies are proposed [14, 15]. However, they did not consider the difference over times of the structured taxonomy. The maintenances of the structure are majorly depending on the manual works of domain experts. Some support methods for the maintenance of the structured taxonomy are really required to the structured taxonomy useful and up-to-date. For this issue, the advanced method of the proposal of this chapter will provide the support with more objective evidences based on the temporal corpora of each particular domain.

## **6. Conclusion**

12 Text Mining

**Abstracts Titles Emergent Not Emergent Emergent Not Emergent**

**Abstracts Titles Emergent Not Emergent Emergent Not Emergent**

tf-idf 0.126∗ 0.130∗ 0.134 0.139 df 0.129∗ 0.125∗ 0.134 0.141 (a) Averages

tf-idf 0.126∗ 0.130∗ 0.132 0.138 df 0.131∗ 0.125∗ 0.133 0.139 (b) Medians

of the concepts as its emergent trend. Besides, based on the temporal patterns by using the document frequency, the terms included in the emergent patterns are defined more similarly. More frequently used terms in the recently published documents are defined nearer than the other popular terms. This can be understandable by considering the process for maintaining

Related to the method that I described, there are two separated research topics. One is to detect emergent trend in a given temporal corpora. The other is for learning structured taxonomy or ontology from a given corpus. They have not been combined as the method to analyze the relationship between the emergent terms and the place on the structure that the terms should be appeared. This work provides a novel idea not only as a text mining

In order to detect the emergent trend in a temporally published corpora, the method for detecting emergent trend have been developed [10, 11]. Most of these methods concentrated to find out just one trend at each setting. Moreover, they rather finding terms that represent emergent trends than the emergent trend itself. Thus, the user of these methods should

Conventional ETD methods are mostly based on the probabilistic transition of the term appearances as shown in the works such as [12, 13]. The method achieved for detecting emergent trend, which is actually a set of terms. However, they did not detect various trends as described in Section 3 and Section 4 at the same time. In addition to the difference, the proposed trend detection method has an availability to visualize both of the representing values of each temporal pattern and the detailed values for each term, using

interpret the meaning of the terms that are detected by the ETD method.

**Table 4.** Comparison of the representative values. (a)Averages, (b)Medians. ∗ means significant difference on *α* = 0.05.

the structure of the concepts manually.

approach, but also for the two separated studies

**5.1. Emergent Trend Detection (ETD) methods**

**5. Related work**

simple time-series charts.

In this chapter, I describe the method for detecting trends of terms in the published articles in MEDLINE as the case study on chronic hepatitis studies. With this case study, the result shows that the method can find various trends of terms and similar terms at the same time. In this case study, the similar terms detected by using their temporal behavior of the two importance indices; document frequency and tf-idf index. Then, the temporal patterns of the biomedical terms by using the two importance indices are obtained. The patterns indicate the similar usages of the terms on the biomedical research documents as the temporal corpus.

Subsequently, by using migraine drug therapy studies, a comparison of the similarity of the terms between the terms grouped up by our trend detection method and the terms in the structured vocabulary is shown. By using MeSH as the structured taxonomic definition of the medical terms, we compared the averaged similarity based on the distances on the tree structure between the terms included in each temporal pattern. By separating the trends of the temporal patterns based on the linear regression technique, the averaged similarities of the terms in each pattern show significant differences on the larger structured vocabulary. Based on the temporal patterns with the emergent trend of the tf-idf, the terms included in such patterns are not similar compared to the terms included in the popular patterns. This indicates that the novel concepts are obtained from new combination by using the existing concepts widely. Besides, the similarity of the different index detects the opposite relationship between its trend and the similarity on the taxonomic definition.

In the future, more indices for representing various aspects of term usages in a corpus will be introduced and compared. Then, based on the similarities on temporal behavior of each index as the temporal patterns, some predictive models such as numerical prediction models will be introduce for predicting adequate places of new concepts on a structured taxonomy.

## **Author details**

Hidenao Abe

<sup>⋆</sup> Address all correspondence to: hidenao@shonan.bunkyo.ac.jp

Department of Information Systems, Faculty of Information and Communications, Bunkyo University, Japan

## **References**


**Chapter 3**

**Text Clumping for Technical Intelligence**

**1. Introduction: Concepts, Purposes, and Approaches**

This development responds to a challenge. Text mining software can conveniently generate very large sets of terms or phrases. Our examples draw from use of VantagePoint (or equiv‐ alently, Thomson Data Analyzer – TDA) software [1] to analyze abstract record sets. A typi‐ cal search on an ST&I topic of interest might yield, say, 5,000 records. One approach is to apply VantagePoint's Natural Language Processing (NLP) to the titles, and also to the ab‐ stracts and/or claims. We also take advantage of available topic-rich fields such as keywords and index terms. Merging these fields could well offer on the order of 100,000 terms and phrases in one field (list). That list, unfortunately, will surely contain much noise and redun‐ dancy. The text clumping aim is to clean and consolidate such a list to provide rich, usable

As described, the text field of interest can contain terms (i.e., single words or unigrams) and/or phrases (i.e., multi-word noun + modifiers term sets). Herein, we focus on such NLP phrases, typically including many single words also. Some of the algorithms pertain espe‐ cially to multi-word phrases, but, in general, many steps can usefully be applied to singleword term sets. Here we focus on analyzing NLP English noun-phrases – to be called

Our larger mission is to generate effective Competitive Technical Intelligence (CTI). We want to answer basic questions of "Who is doing What, Where and When?" In turn, that in‐ formation can be used to build "innovation indicators" that address users' CTI needs [2].

and reproduction in any medium, provided the original work is properly cited.

© 2012 Porter and Zhang; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

**•** Information professionals (compiling most relevant information resources)

**•** Researchers (seeking to learn about the nearby "research landscape")

**•** R&D managers (wanting to invest in the most promising opportunities)

Additional information is available at the end of the chapter

Alan L. Porter and Yi Zhang

http://dx.doi.org/10.5772/50973

content information.

simply "phrases."

Typically, those users might be:


## **Text Clumping for Technical Intelligence**

Alan L. Porter and Yi Zhang

14 Text Mining

**References**

IEEE, 2010.

*Terminology*, 6(2):195–210, 2000.

50 Theory and Applications for Advanced Text Mining Text Mining

pages 1–22. World Scientific, 2003.

38:1857–1874, 2005.

*2001*, pages 642–646, 2001.

*Survey of Text Mining*, 2003.

New York, NY, USA, 2005. ACM.

7(4):373–397, 2003.

[1] Medical subject headings:. http://www.nlm.nih.gov/mesh/.

retrieval. *Document retrieval systems*, pages 132–142, 1988.

[8] E-utilities:. http://www.ncbi.nlm.nih.gov/books/NBK25500/.

*with Java Implementations*. Morgan Kaufmann, 2000.

[2] Hidenao Abe and Shusaku Tsumoto. Trend detection from large text data. In *Proceedings of the 2010 IEEE International Conference on Systems, Man and Cybernetics*, pages 310–315.

[3] Hiroshi Nakagawa. Automatic term recognition based on statistics of compound nouns.

[4] Karen Sparck Jones. A statistical interpretation of term specificity and its application in

[5] Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. Segmenting time series: A survey and novel approach. In *an Edited Volume, Data mining in Time Series Databases.*,

[6] T. Warren Liao. Clustering of time series data: a survey. *Pattern Recognition*,

[7] P. Srinivasan. Meshmap: a text mining tool for medline. In *Proc. of AMAI Symposium*

[9] I. H. Witten and E. Frank. *Data Mining: Practical Machine Learning Tools and Techniques*

[10] Brian Lent, Rakesh Agrawal, and Ramakrishnan Srikant. Discovering trends in text databases. In *KDD '97: Proceedings of the third ACM SIGKDD international conference on*

[11] April Kontostathis, Leon Galitsky, William M. Pottenger, Soma Roy, and Daniel J. Phelps. A survey of emerging trend detection in textual data mining. *A Comprehensive*

[12] Jon M. Kleinberg. Bursty and hierarchical structure in streams. *Data Min. Knowl. Discov.*,

[13] Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In *KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining*, pages 198–207,

[14] Philipp Cimiano, Johanna Völker and Rudi Studer. Ontologies on Demand? - A Description of the State-of-the-Art, Applications, Challenges and Trends for Ontology

[15] Hazman Maryam, Samhaa R. El-Beltagy and Ahmed Rafea A Survey of Ontology Learning Approaches *International Journal of Computer Applications*, 22, 8, 2011

Learning from Text *Information, Wissenschaft und Praxis*, 57, 2006

*Knowledge discovery in data mining*, pages 227–230. AAAI Press, 1997.

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/50973

## **1. Introduction: Concepts, Purposes, and Approaches**

This development responds to a challenge. Text mining software can conveniently generate very large sets of terms or phrases. Our examples draw from use of VantagePoint (or equiv‐ alently, Thomson Data Analyzer – TDA) software [1] to analyze abstract record sets. A typi‐ cal search on an ST&I topic of interest might yield, say, 5,000 records. One approach is to apply VantagePoint's Natural Language Processing (NLP) to the titles, and also to the ab‐ stracts and/or claims. We also take advantage of available topic-rich fields such as keywords and index terms. Merging these fields could well offer on the order of 100,000 terms and phrases in one field (list). That list, unfortunately, will surely contain much noise and redun‐ dancy. The text clumping aim is to clean and consolidate such a list to provide rich, usable content information.

As described, the text field of interest can contain terms (i.e., single words or unigrams) and/or phrases (i.e., multi-word noun + modifiers term sets). Herein, we focus on such NLP phrases, typically including many single words also. Some of the algorithms pertain espe‐ cially to multi-word phrases, but, in general, many steps can usefully be applied to singleword term sets. Here we focus on analyzing NLP English noun-phrases – to be called simply "phrases."

Our larger mission is to generate effective Competitive Technical Intelligence (CTI). We want to answer basic questions of "Who is doing What, Where and When?" In turn, that in‐ formation can be used to build "innovation indicators" that address users' CTI needs [2]. Typically, those users might be:


**•** Science, Technology and Innovation (ST&I) policy-makers (striving to advance their country's competitiveness)

So defined, engagement of experts presents challenges in terms of motivation, time required, and communication of issues so that the domain experts can readily understand and respond to the analyst's needs. Simple, intermediate stage outputs could have value in this regard.

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 53

In summary, this chapter addresses how best to clean and consolidate ST&I phrase lists from abstract record sets. The target is to semi-automate this "inductive" process (i.e., letting the da‐ ta speak without predetermined identification of target terms). We aim toward semi-automa‐ tion because the process should be tailorable to study needs. We are exploring a series of text manipulations to consolidate phrase lists. We are undertaking a series of experiments that vary how technical the content is, which steps are performed, in what sequence, and what statistical

**Figure 1.** Term Clumping for Technical Intelligence

We focus on ST&I information sets, typically in the form of field-structured abstract records retrieved from topical database searches [e.g., Web of Science (WoS), Derwent World Patent Index, Factiva]. These records usually contain a mix of free text portions (e.g., abstracts) and structured text fields (e.g., keywords, publication years). The software uses an import filter to recognize fields (i.e., to know where and how to find the authors and parse their names properly) for particular source sets, such as WoS. VantagePoint can merge multiple datasets from a given source database or from different sources (with guidance on field matching and care in interpreting).

Figure 1 presents our framework for "term clumping." We combine established and rela‐ tively novel bibliometric and text mining techniqueswithin this framework. Itincludesa number of steps to process alarge phrase list. The top portion of the figure indicates choices to be made concerning which data resources to mine and selection criteria for the records to be analyzed. The next tier notes additional choices regarding which content-laden fields to process. The following two blocks contain the major foci of this chapter. "Text Cleanup" in‐ cludes stopword and common term handling, through several steps to consolidate related terms. "Consolidateion of terms into informative topical factors" follows. Here we treat ba‐ sic "inductive methods." The elements of the Figure flagged with an asterisk (\*) are ad‐ dressed in depth herein.

Figure 1 also points towardinterests for future work. These include"purposive methods," wherein our attention focuses on particular terms based on external criteria – e.g., semantic TRIZ (Theory of Inventive Problem Solving) suggests vital functions and actions indicative of technological innovative potential [3, 4]. The idea is to search the target text fields for oc‐ currences of theory-guided terms and adjacent content.

We are also keenly interested in pursuing single word analyses via Topic Modeling (TM) methods to get at themes of the record set under study. These hold appeal in providing tools that will work well in multiple languages and character sets (e.g., Chinese). The main lan‐ guage dependency that we confront is the use of NLP to extract noun phrases (e.g., Vantage‐ Point's NLP is developed for English text).

The bottom portion of Figure 1 indicates interest in how best to engage experts in such topic identification processes. We distinguish three roles:


So defined, engagement of experts presents challenges in terms of motivation, time required, and communication of issues so that the domain experts can readily understand and respond to the analyst's needs. Simple, intermediate stage outputs could have value in this regard.

**Figure 1.** Term Clumping for Technical Intelligence

**•** Science, Technology and Innovation (ST&I) policy-makers (striving to advance their

We focus on ST&I information sets, typically in the form of field-structured abstract records retrieved from topical database searches [e.g., Web of Science (WoS), Derwent World Patent Index, Factiva]. These records usually contain a mix of free text portions (e.g., abstracts) and structured text fields (e.g., keywords, publication years). The software uses an import filter to recognize fields (i.e., to know where and how to find the authors and parse their names properly) for particular source sets, such as WoS. VantagePoint can merge multiple datasets from a given source database or from different sources (with guidance on field matching

Figure 1 presents our framework for "term clumping." We combine established and rela‐ tively novel bibliometric and text mining techniqueswithin this framework. Itincludesa number of steps to process alarge phrase list. The top portion of the figure indicates choices to be made concerning which data resources to mine and selection criteria for the records to be analyzed. The next tier notes additional choices regarding which content-laden fields to process. The following two blocks contain the major foci of this chapter. "Text Cleanup" in‐ cludes stopword and common term handling, through several steps to consolidate related terms. "Consolidateion of terms into informative topical factors" follows. Here we treat ba‐ sic "inductive methods." The elements of the Figure flagged with an asterisk (\*) are ad‐

Figure 1 also points towardinterests for future work. These include"purposive methods," wherein our attention focuses on particular terms based on external criteria – e.g., semantic TRIZ (Theory of Inventive Problem Solving) suggests vital functions and actions indicative of technological innovative potential [3, 4]. The idea is to search the target text fields for oc‐

We are also keenly interested in pursuing single word analyses via Topic Modeling (TM) methods to get at themes of the record set under study. These hold appeal in providing tools that will work well in multiple languages and character sets (e.g., Chinese). The main lan‐ guage dependency that we confront is the use of NLP to extract noun phrases (e.g., Vantage‐

The bottom portion of Figure 1 indicates interest in how best to engage experts in such topic

**•** Analyst: Professionals in data retrieval and analysis, who have analytical skills in han‐

**•** Expert: Professional researchers in the specific domain, knowledgeable over the domain, and able to describe the current status of the domain at both macro and micro levels;

**•** Information & Computer Scientist: Covering a range of skills from in-depth program‐ ming, through preparation of macros, to operating software to accomplish particular text

country's competitiveness)

52 Theory and Applications for Advanced Text Mining Text Mining

and care in interpreting).

dressed in depth herein.

manipulations.

currences of theory-guided terms and adjacent content.

Point's NLP is developed for English text).

identification processes. We distinguish three roles:

dling text, but usually don't have domain knowledge

In summary, this chapter addresses how best to clean and consolidate ST&I phrase lists from abstract record sets. The target is to semi-automate this "inductive" process (i.e., letting the da‐ ta speak without predetermined identification of target terms). We aim toward semi-automa‐ tion because the process should be tailorable to study needs. We are exploring a series of text manipulations to consolidate phrase lists. We are undertaking a series of experiments that vary how technical the content is, which steps are performed, in what sequence, and what statistical approaches are then used to further cluster the phrases or terms. In particular, we also vary and assess the degree of human intervention in the term clumping. That ranges from almost none, to analyst tuning, to active domain expert participation [5-7].

port Vector Machines (SVM), and Topic Modeling (TM) are among the key methods that

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 55

PCA is closely related to LSI. Both use Singular Value Decomposition (SVD) to transform the basic terms by documents matrix to reduce ranks (i.e., to replace a large number of terms by a relatively small number of factors, capturing as much of the information value as possi‐ ble). PCA eigen-decomposes a covariance matrix, whereas LSI does so on the term-docu‐

VantagePoint uses a special variant of PCA developed to facilitate ST&I text analyses (used in the analyses reported here). This PCA routine generates a more balanced factor set than LSI (which extracts a largest variance explaining factor first; then a second that best explains remaining variance, etc.). The VantagePoint factor map routine applies a small-increment Kaiser Varimax Rotation (yielding more attractive results, but running slower, than SPSS PCA in developmental tests). Our colleague, Bob Watts of the U.S. Army, has led develop‐ ment of a more automated version of PCA, with an optimization routine to determine a best solution (maximizing inclusion of records with fewest factors) based on selected parameter settings -- (Principal Components Decomposition – PCD)[21] He has also empirically com‐ pared PCD (inductive) results with a deductive approach based on use of class codes [22].

We apply PCA to term sets to generate co-occurrence based principal components. Because of the familiar use of "clusters," we also use that terminology, although other clustering ap‐ proaches can yield different forms (e.g., K-means, hierarchical clustering). This PCA ap‐

We use the concept, "term clumping," as quite general – entailing various means of text consol‐ idation (e.g., application of thesauri, fuzzy matching, stemming) with noise removal. Book‐ stein, Raita, and collegues offer a somewhat more specialized, but related, interpretation pointing toward the aim of condensing terminology to better identify content-bearing words [23-25]. Term clumping addresses text (not document) "clustering." Any type of text cluster‐ ing is based on co-occurrence of words in records (documents). Clustering, in turn, includes many variations plus additional statistical analyses with considerable commonality -- in par‐ ticular, factor analysis. PCA can be considered as a basic factoring approach; indeed, we call its output principal components "factors. "Similarity among these term grouping approaches arises in that they generally aim to maximize association within clusters and minimize associa‐ tion among clusters. Features to keep in mind include whether terms or phrases being clus‐ tered are allowed to be included in multiple clusters or not; whether algorithms yield the same results on rerun or may change (probabilistic methods); and whether useful visualization are generated. Many further variations are available – e.g., hierarchical or non-hierarchical; build‐ ing up or partitioning down; neural network based approaches (e.g., Kohonen Self-Organizing Maps), and so forth [26]. Research is actively pursuing many refinements, for many objectives, for instance [27]. Our focus is on grouping terms, but we note much complementary activity on grouping documents (based on co-occurrence with particular terms) [26], with special interest

ment matrix. [See wikipedia for basic statistical manipulations.]

proach allows terms to appear in multiple factors

in grouping web sites, for instance [28].

have come forth [20].

## **2. Review of Related Literatures**

Given the scope of Figure 1, several research areas contribute. This chapter does not address the purposive analyses, so we won't treat literatures on importing index terms, or on TRIZ and Technology RoadMapping (TRM) -- of great interest in suggesting high value terms for CTI analyses.

Several of the steps to be elaborated are basic. Removal of "stopwords" needs little theoreti‐ cal framing. It does pose some interesting analytical possibilities, however. For instance, Cunningham found that the most common modifiers provided analytical value in classify‐ ing British science [8]. He conceives of an inverted U shape that emphasizes analyzing mod‐ erately high frequency terms -- excluding both the very high frequency (stopwords and commonly used scientific words, that provide high recall of records, but low precision) and low frequency words (suffering from low recall due to weak coverage, but high precision). Pursuing this notion of culling common scientific words, we remove "common words." In our analyses we apply several stopword lists of several hundred terms (including some stemming), and a common words in academic/scientific writing thesaurus of some 48,000 terms [9]. We are interested in whether removal of these enhances or, possibly, degrades further analytical steps' performance (e.g., Topic Modeling).

To state the obvious -- not all texts behave the same. Language and the venue for the dis‐ course, with its norms, affect usage and text mining. In particular, we focus on ST&I litera‐ ture and patent abstracts, with outreach to business and attendant popular press coverage of topics (e.g., the Factiva database). English ST&I writing differs somewhat from "nor‐ mal" English in structure and content. For instance, scientific discourse tends to include many technical phrases that should be retained, not parsed into separate terms or partphrases by NLP. VantagePoint's NLP routine [1] strives to do that. It also seeks to retain chemical formulas.

A research community has built around bibliometric analyses of ST&I records over the past 60 or so years, see for instance [10-12]. DeBellis nicely summarizes many facets of the data and their analyses [13]. Our group at Georgia Tech has pursued ST&I analyses aimed espe‐ cially at generating Competitive Technical Intelligence (CTI) since the 1970's, with software development to facilitate mining of abstract records since 1993 [1, 2, 14]. We have explored ways to expedite such text analyses, c.f. [15, 16], as have others [17]. We increasingly turn toward extending such "research profiling" to aid in Forecasting Innovation Pathways (FIP), see for example [18].

Over the years many techniques have been used to model content retrieved from ST&I text databases. Latent Semantic Indexing (LSI) [19], Principal Components Analysis (PCA), Sup‐ port Vector Machines (SVM), and Topic Modeling (TM) are among the key methods that have come forth [20].

approaches are then used to further cluster the phrases or terms. In particular, we also vary and assess the degree of human intervention in the term clumping. That ranges from almost none,

Given the scope of Figure 1, several research areas contribute. This chapter does not address the purposive analyses, so we won't treat literatures on importing index terms, or on TRIZ and Technology RoadMapping (TRM) -- of great interest in suggesting high value terms for

Several of the steps to be elaborated are basic. Removal of "stopwords" needs little theoreti‐ cal framing. It does pose some interesting analytical possibilities, however. For instance, Cunningham found that the most common modifiers provided analytical value in classify‐ ing British science [8]. He conceives of an inverted U shape that emphasizes analyzing mod‐ erately high frequency terms -- excluding both the very high frequency (stopwords and commonly used scientific words, that provide high recall of records, but low precision) and low frequency words (suffering from low recall due to weak coverage, but high precision). Pursuing this notion of culling common scientific words, we remove "common words." In our analyses we apply several stopword lists of several hundred terms (including some stemming), and a common words in academic/scientific writing thesaurus of some 48,000 terms [9]. We are interested in whether removal of these enhances or, possibly, degrades

To state the obvious -- not all texts behave the same. Language and the venue for the dis‐ course, with its norms, affect usage and text mining. In particular, we focus on ST&I litera‐ ture and patent abstracts, with outreach to business and attendant popular press coverage of topics (e.g., the Factiva database). English ST&I writing differs somewhat from "nor‐ mal" English in structure and content. For instance, scientific discourse tends to include many technical phrases that should be retained, not parsed into separate terms or partphrases by NLP. VantagePoint's NLP routine [1] strives to do that. It also seeks to retain

A research community has built around bibliometric analyses of ST&I records over the past 60 or so years, see for instance [10-12]. DeBellis nicely summarizes many facets of the data and their analyses [13]. Our group at Georgia Tech has pursued ST&I analyses aimed espe‐ cially at generating Competitive Technical Intelligence (CTI) since the 1970's, with software development to facilitate mining of abstract records since 1993 [1, 2, 14]. We have explored ways to expedite such text analyses, c.f. [15, 16], as have others [17]. We increasingly turn toward extending such "research profiling" to aid in Forecasting Innovation Pathways (FIP),

Over the years many techniques have been used to model content retrieved from ST&I text databases. Latent Semantic Indexing (LSI) [19], Principal Components Analysis (PCA), Sup‐

to analyst tuning, to active domain expert participation [5-7].

further analytical steps' performance (e.g., Topic Modeling).

**2. Review of Related Literatures**

54 Theory and Applications for Advanced Text Mining Text Mining

CTI analyses.

chemical formulas.

see for example [18].

PCA is closely related to LSI. Both use Singular Value Decomposition (SVD) to transform the basic terms by documents matrix to reduce ranks (i.e., to replace a large number of terms by a relatively small number of factors, capturing as much of the information value as possi‐ ble). PCA eigen-decomposes a covariance matrix, whereas LSI does so on the term-docu‐ ment matrix. [See wikipedia for basic statistical manipulations.]

VantagePoint uses a special variant of PCA developed to facilitate ST&I text analyses (used in the analyses reported here). This PCA routine generates a more balanced factor set than LSI (which extracts a largest variance explaining factor first; then a second that best explains remaining variance, etc.). The VantagePoint factor map routine applies a small-increment Kaiser Varimax Rotation (yielding more attractive results, but running slower, than SPSS PCA in developmental tests). Our colleague, Bob Watts of the U.S. Army, has led develop‐ ment of a more automated version of PCA, with an optimization routine to determine a best solution (maximizing inclusion of records with fewest factors) based on selected parameter settings -- (Principal Components Decomposition – PCD)[21] He has also empirically com‐ pared PCD (inductive) results with a deductive approach based on use of class codes [22].

We apply PCA to term sets to generate co-occurrence based principal components. Because of the familiar use of "clusters," we also use that terminology, although other clustering ap‐ proaches can yield different forms (e.g., K-means, hierarchical clustering). This PCA ap‐ proach allows terms to appear in multiple factors

We use the concept, "term clumping," as quite general – entailing various means of text consol‐ idation (e.g., application of thesauri, fuzzy matching, stemming) with noise removal. Book‐ stein, Raita, and collegues offer a somewhat more specialized, but related, interpretation pointing toward the aim of condensing terminology to better identify content-bearing words [23-25]. Term clumping addresses text (not document) "clustering." Any type of text cluster‐ ing is based on co-occurrence of words in records (documents). Clustering, in turn, includes many variations plus additional statistical analyses with considerable commonality -- in par‐ ticular, factor analysis. PCA can be considered as a basic factoring approach; indeed, we call its output principal components "factors. "Similarity among these term grouping approaches arises in that they generally aim to maximize association within clusters and minimize associa‐ tion among clusters. Features to keep in mind include whether terms or phrases being clus‐ tered are allowed to be included in multiple clusters or not; whether algorithms yield the same results on rerun or may change (probabilistic methods); and whether useful visualization are generated. Many further variations are available – e.g., hierarchical or non-hierarchical; build‐ ing up or partitioning down; neural network based approaches (e.g., Kohonen Self-Organizing Maps), and so forth [26]. Research is actively pursuing many refinements, for many objectives, for instance [27]. Our focus is on grouping terms, but we note much complementary activity on grouping documents (based on co-occurrence with particular terms) [26], with special interest in grouping web sites, for instance [28].

Latent Semantic Indexing (LSI) or Latent Semantic Analysis, is a classical indexing method based on a Vector Space Model that introduces Singular-Value Decomposition (SVD) to un‐ cover the underlying semantic structure in the text set. The key feature of LSI is to map those terms that occur in similar contexts into a smaller "semantic space" and to help deter‐ mine the relationships among terms (synonymy and polysemy) [17, 29, 30]. When applied on co-occurrence information for large text sources, there is no need for LSI to import do‐ main literatures or thesauri (what we call "purposive" or aided text clumping). There are al‐ so various extended LSI methods [31]. Researchers are combining LSI with term clumping variations in order to relate synonymous terms from massive content.For example, Maletic and Marcus combine semantic and structural information [32] and Xu et al. seek to associate genes based on text mining of abstracts [30].

and eliminates them so that further author analyses can focus on the senior authors or author team. The macro [available at www.theVantagePoint.com] adds major co-authors into the

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 57

Lastly, we consider various quality assessment approaches. Given that one generates clus‐ tered text in various forms, which are best? We look toward three approaches. First, we want to ask the target users. While appealing, this also confronts issues – e.g., our PCA out‐ put "names" the resulting factors, whereas topic modeling does not. How can we compare these even-handedly? Second are statistical approaches that measure some form of the de‐ gree of coherence within clusters vs. among clusters [52]. Third are record assignment tests – to what extent do alternative text clumping and clustering sequences correctly distinguish

Figure 1 arrays a wide range of possible term clumping actions. As introduced in the previ‐ ous sections, we are interested in many of those, but within the scope of this chapter we fo‐

term name. We incorporate these two routines in the present exercises.

mixed dataset components? Here we seek both high recall and precision.

**3. Empirical Investigation:Two Case Analyses**

cus on many of the following steps and comparisons:

Term Clumping STEPS:

**j.** Topic Modeling

gy) from 1997 through 2012.

datasets:

**a.** Fuzzy matching routines

**b.** Thesauri to reduce common terms

**c.** Human-aided and topic tailored cleaning

**d.** Phrase consolidation macro (different lengths)

**f.** Combine term networks (parent-child) macro

**h.** Term normalization vs. parent database samples

**e.** Pruning of extremely high and low frequency terms

**g.** g.TFIDF (Term Frequency Inverse Document Frequency)

**i.** PCA variations to generate high, medium, and low frequency factors

**k.** Quality assessment of the resulting factors – comparing expert and statistical means

We are running multiple empirical comparisons. Here we compare results on two topical

"MOT" (for Management of Technology) – 5169 records covering abstract records of the PICMET (Portland International Conference on Management of Engineering and Technolo‐

Topic modeling is a suite of algorithms that automatically conforms topical themes from a collection of documents [33, 34]. This stream of research begins with Latent Dirichlet Alloca‐ tion (LDA), which remains the basic algorithm. Topic modelling is an extended LSI method, that treats association probabilistically. Various topic modeling algorithms extend the basic approach, for example [35-44]. Topic modeling is being applied in many contexts – e.g., NLP extension, sentiment analysis, and topic detection.

We are pursuing topic modeling in conjunction with our text clumping development in sev‐ eral ways. We are experimenting to assess whether and which term clumping steps can re‐ fine term or phrase sets as input into topic modeling to enhance generation of meaningful topics. We also compare topic modeling outputs to alternative processes, especially PCA performed on clumped phrases. We additionally want to assess whether some form of text clumping can be applied after topic modeling to enhance topic interpretability.

We have also tried, but are not actively pursuing, Key Graph, a kind of visualization techni‐ que that treats the documents as a building constructed by a series of ideas and then re‐ trieves these ideas and posts as a summary of original points on the segmentation of a graph [45-47]. Usually, Key Graph has 3 major components: (1) Foundations, which are the subgraphs of highly associated and frequent terms; (2) Roofs, which are terms highly related to the foundations; and (3) Columns, which are keywords representing the relationships be‐ tween foundations and roofs.

We are especially interested in term grouping algorithms to refine large phrase sets through a sequence of steps. These typically begin with noise removal and basic cleaning, and end with some form of clustering of the resulting phrases (e.g., PCA). "In-between" we are applying sev‐ eral intermediate stage term consolidation tools. Kongthon has pursued an object oriented as‐ sociation rule mining approach [48], with a "concept grouping" routine [49] and a treestructured network algorithm that associates text parent-child and sibling relationships [50].

Courseault-Trumbach devised a routine to consolidate related phrases, particularly of differ‐ ent term lengths based on term commonality [51]. Webb Myers developed another routine to combine authors. The notion was that, say, we have three papers authored by X. Perhaps two of those are co-authored with Y, and one with Z; and Y and Z never appear as authors on anoth‐ er paper without X. In that case, the operation surmises that Y and Z are likely junior authors, and eliminates them so that further author analyses can focus on the senior authors or author team. The macro [available at www.theVantagePoint.com] adds major co-authors into the term name. We incorporate these two routines in the present exercises.

Lastly, we consider various quality assessment approaches. Given that one generates clus‐ tered text in various forms, which are best? We look toward three approaches. First, we want to ask the target users. While appealing, this also confronts issues – e.g., our PCA out‐ put "names" the resulting factors, whereas topic modeling does not. How can we compare these even-handedly? Second are statistical approaches that measure some form of the de‐ gree of coherence within clusters vs. among clusters [52]. Third are record assignment tests – to what extent do alternative text clumping and clustering sequences correctly distinguish mixed dataset components? Here we seek both high recall and precision.

## **3. Empirical Investigation:Two Case Analyses**

Figure 1 arrays a wide range of possible term clumping actions. As introduced in the previ‐ ous sections, we are interested in many of those, but within the scope of this chapter we fo‐ cus on many of the following steps and comparisons:

Term Clumping STEPS:

Latent Semantic Indexing (LSI) or Latent Semantic Analysis, is a classical indexing method based on a Vector Space Model that introduces Singular-Value Decomposition (SVD) to un‐ cover the underlying semantic structure in the text set. The key feature of LSI is to map those terms that occur in similar contexts into a smaller "semantic space" and to help deter‐ mine the relationships among terms (synonymy and polysemy) [17, 29, 30]. When applied on co-occurrence information for large text sources, there is no need for LSI to import do‐ main literatures or thesauri (what we call "purposive" or aided text clumping). There are al‐ so various extended LSI methods [31]. Researchers are combining LSI with term clumping variations in order to relate synonymous terms from massive content.For example, Maletic and Marcus combine semantic and structural information [32] and Xu et al. seek to associate

Topic modeling is a suite of algorithms that automatically conforms topical themes from a collection of documents [33, 34]. This stream of research begins with Latent Dirichlet Alloca‐ tion (LDA), which remains the basic algorithm. Topic modelling is an extended LSI method, that treats association probabilistically. Various topic modeling algorithms extend the basic approach, for example [35-44]. Topic modeling is being applied in many contexts – e.g., NLP

We are pursuing topic modeling in conjunction with our text clumping development in sev‐ eral ways. We are experimenting to assess whether and which term clumping steps can re‐ fine term or phrase sets as input into topic modeling to enhance generation of meaningful topics. We also compare topic modeling outputs to alternative processes, especially PCA performed on clumped phrases. We additionally want to assess whether some form of text

We have also tried, but are not actively pursuing, Key Graph, a kind of visualization techni‐ que that treats the documents as a building constructed by a series of ideas and then re‐ trieves these ideas and posts as a summary of original points on the segmentation of a graph [45-47]. Usually, Key Graph has 3 major components: (1) Foundations, which are the subgraphs of highly associated and frequent terms; (2) Roofs, which are terms highly related to the foundations; and (3) Columns, which are keywords representing the relationships be‐

We are especially interested in term grouping algorithms to refine large phrase sets through a sequence of steps. These typically begin with noise removal and basic cleaning, and end with some form of clustering of the resulting phrases (e.g., PCA). "In-between" we are applying sev‐ eral intermediate stage term consolidation tools. Kongthon has pursued an object oriented as‐ sociation rule mining approach [48], with a "concept grouping" routine [49] and a treestructured network algorithm that associates text parent-child and sibling relationships [50].

Courseault-Trumbach devised a routine to consolidate related phrases, particularly of differ‐ ent term lengths based on term commonality [51]. Webb Myers developed another routine to combine authors. The notion was that, say, we have three papers authored by X. Perhaps two of those are co-authored with Y, and one with Z; and Y and Z never appear as authors on anoth‐ er paper without X. In that case, the operation surmises that Y and Z are likely junior authors,

clumping can be applied after topic modeling to enhance topic interpretability.

genes based on text mining of abstracts [30].

56 Theory and Applications for Advanced Text Mining Text Mining

extension, sentiment analysis, and topic detection.

tween foundations and roofs.


We are running multiple empirical comparisons. Here we compare results on two topical datasets:

"MOT" (for Management of Technology) – 5169 records covering abstract records of the PICMET (Portland International Conference on Management of Engineering and Technolo‐ gy) from 1997 through 2012.

"DSSCs" (for Dye-Sensitized Solar Cells) – 5784 abstract records compiled from searches for 2001-2010 in WoS and in EI Compendex, merged in VantagePoint

Term Clumping Steps MOT

b-2) Apply common academic/scientific

b-3) multiple tailored cleaning routines - trash term remover.the; topic variations consolidator.the; DSSC data fuzzy matcher

general-85cutoff-95fuzzywordmatch-1ex

d) Apply phrase consolidation macro

e) Prune (remove phrases appearing in

c-1) Apply human-aided and general.fuz

c-2) Manual noise screens (e.g., copyrights,

f) Apply combine term networks (parent-

PCA factors 9 factors (only top

i) Auto-PCA: highest frequency; 2d

c-3) Tuned phrases to 7164; reviewed 15 factors from 204 top phrases; reran to get

**Table 1.** Term Clumping Stepwise Results

the DSSC clumping.

terms thesaurus

results.the\*

a-2) Apply

(different lengths)

stand-alone numbers)

highest; 3d highest

only 1 record)

routine

child) macro

final PCA

act.fuz

5169 PICMET records

g) Apply TFIDF 1999 Applied 16th, reducing 8181 to 2008

tier)

201, 256, 299 203;214;230

\*a compilation of phrase variations that VantagePoint's "List Cleanup" routine suggested combining [e.g. – various singular and plural variations; hyphenation variations; and similar

Some steps are broken out in more detail – e.g., Step a -- Fuzzy matching routines – is split into use of VantagePoint's general matching routine (a-1) and application of a variant tuned‐ for this term clumping (a-2). Note also that some steps appear more than once, especially for

phrases such as "nanostructured TiO2 films" with "nanostructured TiO2 thin films"]

DSSCs

69677 Applied 11th, reducing 74263 to 65379

68062 Applied 4th, reducing 89355 to 86410

13089 Applied 12th, reducing 65379 to 23311

10513 Applied 15th, reducing 20172 to 8181

12 (only top tier)

5784 records (WoS+Compendex), 2001-2010

85960; applied 9th, reducing 82739 to 82701;

Applied 13th, reducing 23311 to 21645

Applied 14th, reducing 21645 to 20172

Applied such actions as 3d-6th steps, reducing 89403 to

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 59

73232 Applied 2d, reducing 89576 to 89403; and applied8th, reducing 84511 to 82739

Elsewhere, we elaborate on these analyses in various ways. Substantive interpretations of the topical MOT thrusts based on the human-selected MOT terms are examined over time and regions [55]. Comparisons of three MOT analyses -- 1) 3-tier, semi-automatic PCA ex‐ traction, 2) PCA based on human-selected MOT terms, and 3) Topic Modeling of unigrams – found notably different factors extracted. Human quality assessment did not yield a clear fa‐ vorite, but the Topic Modeling results edged ahead of the different PCA's [7]. Additional ex‐ plorations of the WoS DSSC data appear in [6], comparing Topic Modeling and term clumping-to-PCA – finding quite different emphases in the extracted factors. Zhang et al. [54] track through a similar sequence of term clumping steps on the combined WoS-Com‐ pendex DSSC dataset.

Here, we focus on stepping through most of the term clumping operations for these two cas‐ es. To avoid undue complexity, we set aside data variations (e.g., stepping through for the WoS DSSC set alone), Topic Modeling comparisons, and quality assessment. As noted, we have done one version of human assessment for the MOT data [7]. We are pursuing addi‐ tional quality assessments via statistical measures [52] and by comparing how well the alter‐ native analytics are able to separate out record sets from a combination of 7 searches. We also intend to pursue Step h – term normalization based on external (e.g., entire database) frequencies. So, here we treat Steps a-g and i, not Steps h, j, or k.

Table 1 provides the stepwise tally of phrases in the merged topical fields undergoing term clumping. It is difficult to balance precision with clarity, so we hope this succeeds. The first column indicates which text analysis action was taken, coresponding to the list of steps just above.The second column shows the results of those actions applied in sequence on the MOT data. Blank cells indicate that particular action was not performed on the MOT (or DSSC) dataset. The last row notes additional human-informed analyses done on the MOT data, but not treated here (to recognize that this is a selective presentation). The third col‐ umn relates the results of application of the steps to the DSSC data, but here we indicate se‐ quence within the column, also showing the resulting term reduction. [So, the Table shows the Term Clumping Steps in the order performed on MOT; this was arbitrary. It could as well have been ordered by the list (above) or in the order done for DSSC data.]



**Table 1.** Term Clumping Stepwise Results

"DSSCs" (for Dye-Sensitized Solar Cells) – 5784 abstract records compiled from searches for

Elsewhere, we elaborate on these analyses in various ways. Substantive interpretations of the topical MOT thrusts based on the human-selected MOT terms are examined over time and regions [55]. Comparisons of three MOT analyses -- 1) 3-tier, semi-automatic PCA ex‐ traction, 2) PCA based on human-selected MOT terms, and 3) Topic Modeling of unigrams – found notably different factors extracted. Human quality assessment did not yield a clear fa‐ vorite, but the Topic Modeling results edged ahead of the different PCA's [7]. Additional ex‐ plorations of the WoS DSSC data appear in [6], comparing Topic Modeling and term clumping-to-PCA – finding quite different emphases in the extracted factors. Zhang et al. [54] track through a similar sequence of term clumping steps on the combined WoS-Com‐

Here, we focus on stepping through most of the term clumping operations for these two cas‐ es. To avoid undue complexity, we set aside data variations (e.g., stepping through for the WoS DSSC set alone), Topic Modeling comparisons, and quality assessment. As noted, we have done one version of human assessment for the MOT data [7]. We are pursuing addi‐ tional quality assessments via statistical measures [52] and by comparing how well the alter‐ native analytics are able to separate out record sets from a combination of 7 searches. We also intend to pursue Step h – term normalization based on external (e.g., entire database)

Table 1 provides the stepwise tally of phrases in the merged topical fields undergoing term clumping. It is difficult to balance precision with clarity, so we hope this succeeds. The first column indicates which text analysis action was taken, coresponding to the list of steps just above.The second column shows the results of those actions applied in sequence on the MOT data. Blank cells indicate that particular action was not performed on the MOT (or DSSC) dataset. The last row notes additional human-informed analyses done on the MOT data, but not treated here (to recognize that this is a selective presentation). The third col‐ umn relates the results of application of the steps to the DSSC data, but here we indicate se‐ quence within the column, also showing the resulting term reduction. [So, the Table shows the Term Clumping Steps in the order performed on MOT; this was arbitrary. It could as

well have been ordered by the list (above) or in the order done for DSSC data.]

5169 PICMET records

a-1) Apply general.fuz routine 76398 Applied 10th, reducing 82701 to 74263

b-1) Apply stopwords thesaurus 76105 Applied 1st, reducing 90980 to 89576) and applied 7th,

phrases

DSSCs

5784 records (WoS+Compendex), 2001-2010

Title&Abstract NLP phrases + keywords

reducing 85960 to 84511

2001-2010 in WoS and in EI Compendex, merged in VantagePoint

frequencies. So, here we treat Steps a-g and i, not Steps h, j, or k.

pendex DSSC dataset.

58 Theory and Applications for Advanced Text Mining Text Mining

Term Clumping Steps MOT

Field selection Title&Abstract NLP

Phrases with which we begin 86014 90980

\*a compilation of phrase variations that VantagePoint's "List Cleanup" routine suggested combining [e.g. – various singular and plural variations; hyphenation variations; and similar phrases such as "nanostructured TiO2 films" with "nanostructured TiO2 thin films"]

Some steps are broken out in more detail – e.g., Step a -- Fuzzy matching routines – is split into use of VantagePoint's general matching routine (a-1) and application of a variant tuned‐ for this term clumping (a-2). Note also that some steps appear more than once, especially for the DSSC clumping.

For Step b – application of thesauri to remove common terms – we distinguish the use of a modest size stopwords thesaurus (fewer than 300 words) as Step b-1 and the application of the 48,000 term thesaurus of common academic/scientific terms as Step b-2.

lieve these offer rich analytical possibilities. For instance, we could scan their introduction and frequency over time to identify "hot" topics in the field. Or, we could compare organi‐ zational emphases across these phrases to advance CTI interests. We might ask technical and/or business experts in the field to scan those 2000 phrases to identify particularly impor‐

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 61

Steps i and j represent a major "last step" for these sorts of term analyses. Here we explore select PCA steps; elsewhere, as noted, we pursue Topic Modeling [6, 7]. This factoring (~clustering) step reduces thousands of phrases to tens of phrases. If done accurately, this can be game-changing in terms of opening conceptual insights into topical emphases in the

VantagePoint's PCA routine is now applied as Step i. In these cases we have tried to mini‐ mize human-aiding, but we explore that elsewhere [6, 7]. We select three top tiers of terms to be subjected to separate Principal Components Analysis. Such selection can be handled by various coverage rules – e.g., terms appearing in at least 1% of the records. In the present exercises, we set thresholds to provide approximately 200 phrases as input to each of three PCA analyses. We run the default requested number of factors to extract – this is the square root of the number of terms submitted. We review the resulting three sets of factors in terms of recall (record coverage) and determine to focus on just the top tier PCA results here. For DSSCs, the top-tier PCA yields 12 factors that cover 98% of the records, whereas the 2d tier factors cover 47% and the 3d tier only 18%.For the MOT analyses, results are comparable – the 9 top-tier factors cover 90% of the records; 2d tier, 36%; 3d tier, 17%. [We have per‐ formed additional analyses of these data, exploring various PCA factor sets, including ones in which we perform post-PCA term cleaning based on inspection of initial results., then re‐ run PCAFor instance, a very high frequency term might be removed, or simple relations handled by refining a thesaurus (e.g., in one result "Data Envelopment Analysis" and its ac‐

Step j is of high interest, and we are exploring several alternative approaches, as mentioned.

Having stepped through multiple term clumping steps, what do we get? One has many rea‐ sonable choices as to which term clumping steps to apply, in what sequence. To get a feel for

Here, we just present the high tier set of PCA factors for face validity checks.

tant or novel ones for in-depth analyses.

ronym, DEA, constituted a factor).

**4. Term Clumping Case Results**

**1.** Initial phrase set

**3.** After TFIDF

**4.** After PCA

the gains, let's compare sample results at four Stages:

**2.** After the term clumping steps up to TFIDF

field under study.

Step c -- Human-aided and topic tailored cleaning (Steps c-1, c-2 & c-3) groups a variety of "obvious" cleaning routines. Our dilemma is whether to eliminate these, to facilitate devel‐ opment of semi-automated routines, or to include them, for easy improvement of the term consolidation? In the MOT term clumping reported in Table 1, we essentially avoid such cleaning. In the DSSC step-through, we include limited iterations of human-aided cleaning to see whether this makes a qualitative difference by the time the progression of steps is completed. [It does not seem to do so.]

Step d -- Phrase consolidation macro – consolidates only a modest percentage of the phrases (as applied here, reducing the phrase count by 2.3% for MOT and by 3.3% for DSSCs), but the improvements appear worthwhile. For instance, combining "Dye-Sensitized Solar Cells" with "Sensitized Solar Cells" can provide important conceptual concentration.

Step e – Pruning – is simply discarding the phrases that appear in only one record. Those would not add to co-occurrence based analyses. The challenge is to sequence pruning after consolidation so that potentially useful topical information is not discarded. Pruning is the overwhelmingly potent step in reducing the term or phrase counts. For MOT, it effects a re‐ duction of 81%; for DSSCs, 64%.

Step f -- Combine term networks (parent-child) – appears a powerful reducer. As discussed, Webb Myers devised this macro to consolidate author sets.We apply the macro to the phras‐ es field, showing sizable reductions for MOT (19.7%) and DSSCs (59.4%). The macro will combine major co-occurring terms in the new phrase name with a "&" between them. It also results in terms that appear in a single record being combined into a single phrase [hence, we perform the Pruning step prior to applying this macro].

Step g – TFIDF – strives to distinguish terms that provide specificity within the sample set.For example, if some form of "DSSC" appears in nearly every DSSC record, this would not be a high-value term in distinguishing patterns within the dataset. VantagePoint offers three TFIDF routines – A) un-normalized, B) log, and C) square root. We compared these and pro‐ ceed with the square root term set for DSSCs, whose 2008 terms are all included in sets A or B. Of the 2008 phrases, 1915 are in both A and B (so differences in this regard are small), with 42 in set A and 51 in set B. For the MOT data, B and C yield the same 1999 terms, whereas A yields 2052. Inspection of the distinct terms find the 78 only in sets B & C to appear more substantive than the 131 terms only in set A, so we opt for the 1999 term result.

Step h is included as a place-holder.On the one hand, Step b aims to remove generally com‐ mon terms.On another, Step g favors more specific terms within the document set being an‐ alyzed. With access to full databases or general samples from sources such as WoS, one could sort toward terms or phrases that are relatively unique to the search set.We have not done that here.

At this stage, we have very large, but clumped, phrase sets. In our two cases, these consist of about 2000 phrases. Consider the illustrative "post-TFIDF" tabulations in Table 2. We be‐ lieve these offer rich analytical possibilities. For instance, we could scan their introduction and frequency over time to identify "hot" topics in the field. Or, we could compare organi‐ zational emphases across these phrases to advance CTI interests. We might ask technical and/or business experts in the field to scan those 2000 phrases to identify particularly impor‐ tant or novel ones for in-depth analyses.

Steps i and j represent a major "last step" for these sorts of term analyses. Here we explore select PCA steps; elsewhere, as noted, we pursue Topic Modeling [6, 7]. This factoring (~clustering) step reduces thousands of phrases to tens of phrases. If done accurately, this can be game-changing in terms of opening conceptual insights into topical emphases in the field under study.

VantagePoint's PCA routine is now applied as Step i. In these cases we have tried to mini‐ mize human-aiding, but we explore that elsewhere [6, 7]. We select three top tiers of terms to be subjected to separate Principal Components Analysis. Such selection can be handled by various coverage rules – e.g., terms appearing in at least 1% of the records. In the present exercises, we set thresholds to provide approximately 200 phrases as input to each of three PCA analyses. We run the default requested number of factors to extract – this is the square root of the number of terms submitted. We review the resulting three sets of factors in terms of recall (record coverage) and determine to focus on just the top tier PCA results here. For DSSCs, the top-tier PCA yields 12 factors that cover 98% of the records, whereas the 2d tier factors cover 47% and the 3d tier only 18%.For the MOT analyses, results are comparable – the 9 top-tier factors cover 90% of the records; 2d tier, 36%; 3d tier, 17%. [We have per‐ formed additional analyses of these data, exploring various PCA factor sets, including ones in which we perform post-PCA term cleaning based on inspection of initial results., then re‐ run PCAFor instance, a very high frequency term might be removed, or simple relations handled by refining a thesaurus (e.g., in one result "Data Envelopment Analysis" and its ac‐ ronym, DEA, constituted a factor).

Step j is of high interest, and we are exploring several alternative approaches, as mentioned. Here, we just present the high tier set of PCA factors for face validity checks.

## **4. Term Clumping Case Results**

Having stepped through multiple term clumping steps, what do we get? One has many rea‐ sonable choices as to which term clumping steps to apply, in what sequence. To get a feel for the gains, let's compare sample results at four Stages:


For Step b – application of thesauri to remove common terms – we distinguish the use of a modest size stopwords thesaurus (fewer than 300 words) as Step b-1 and the application of

Step c -- Human-aided and topic tailored cleaning (Steps c-1, c-2 & c-3) groups a variety of "obvious" cleaning routines. Our dilemma is whether to eliminate these, to facilitate devel‐ opment of semi-automated routines, or to include them, for easy improvement of the term consolidation? In the MOT term clumping reported in Table 1, we essentially avoid such cleaning. In the DSSC step-through, we include limited iterations of human-aided cleaning to see whether this makes a qualitative difference by the time the progression of steps is

Step d -- Phrase consolidation macro – consolidates only a modest percentage of the phrases (as applied here, reducing the phrase count by 2.3% for MOT and by 3.3% for DSSCs), but the improvements appear worthwhile. For instance, combining "Dye-Sensitized Solar Cells"

Step e – Pruning – is simply discarding the phrases that appear in only one record. Those would not add to co-occurrence based analyses. The challenge is to sequence pruning after consolidation so that potentially useful topical information is not discarded. Pruning is the overwhelmingly potent step in reducing the term or phrase counts. For MOT, it effects a re‐

Step f -- Combine term networks (parent-child) – appears a powerful reducer. As discussed, Webb Myers devised this macro to consolidate author sets.We apply the macro to the phras‐ es field, showing sizable reductions for MOT (19.7%) and DSSCs (59.4%). The macro will combine major co-occurring terms in the new phrase name with a "&" between them. It also results in terms that appear in a single record being combined into a single phrase [hence,

Step g – TFIDF – strives to distinguish terms that provide specificity within the sample set.For example, if some form of "DSSC" appears in nearly every DSSC record, this would not be a high-value term in distinguishing patterns within the dataset. VantagePoint offers three TFIDF routines – A) un-normalized, B) log, and C) square root. We compared these and pro‐ ceed with the square root term set for DSSCs, whose 2008 terms are all included in sets A or B. Of the 2008 phrases, 1915 are in both A and B (so differences in this regard are small), with 42 in set A and 51 in set B. For the MOT data, B and C yield the same 1999 terms, whereas A yields 2052. Inspection of the distinct terms find the 78 only in sets B & C to appear more substantive

Step h is included as a place-holder.On the one hand, Step b aims to remove generally com‐ mon terms.On another, Step g favors more specific terms within the document set being an‐ alyzed. With access to full databases or general samples from sources such as WoS, one could sort toward terms or phrases that are relatively unique to the search set.We have not

At this stage, we have very large, but clumped, phrase sets. In our two cases, these consist of about 2000 phrases. Consider the illustrative "post-TFIDF" tabulations in Table 2. We be‐

the 48,000 term thesaurus of common academic/scientific terms as Step b-2.

with "Sensitized Solar Cells" can provide important conceptual concentration.

we perform the Pruning step prior to applying this macro].

than the 131 terms only in set A, so we opt for the 1999 term result.

completed. [It does not seem to do so.]

60 Theory and Applications for Advanced Text Mining Text Mining

duction of 81%; for DSSCs, 64%.

done that here.

**4.** After PCA

Referring to Figure 1, the Text Cleaning stage, in general, would be carried out in prepara‐ tion for nearly all further analyses. We would not anticipate aborting that processing partway, except in special cases (e.g., as mentioned in Cunningham's analysis of British science titles). The next stage of consolidating the cleaned and, therefore, partly consolidated phras‐ es, is where interesting choices arrive. Based on the analyses of the MOT and DSSC data, we note the significant effect of selecting the high TFIDF terms. We thus compare the phrase sets at Stage 1 (before cleaning and clumping), Stage 2 (before filtering to the top TFIDF terms), Stage 3 (after TFIDF), and Stage 4 (after applying one of the clustering family of tech‐ niques – PCA).

Table 3 shows the "Top 10" post-TFIDF terms and phrases, based on TFIDF scores. Recall that the 1999 terms and phrases at this Stage 3 are based on an arbitrary threshold – we sought about 2000. Note that term counts are unchanged for terms present in both Stages 2 & 3. TFIDF is not clumping, but rather screening based on occurrence patterns across the 5169 records.

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 63

Top 10 # Records # Instances SqRt TFIDF value

Table 4 presents another 10-term sample pair for Stages 1 and 2. Here, we alphabetically sort the phrase lists and arbitrarily take the ten phrases beginning with "knowledge" or "knowl‐ edg" --i.e., a stem version of the term. Notice that the big consolidation is for the stemmed version of "knowledg," for which the record count has gone up a tiny amount (2), whereas the instance count has increased by 272. In general, the term clumping increases term fre‐

Table 5 presents the top-tier PCA analysis results. The phrases appearing here tend to be more topically specific than those seen as most frequent at Stages 2 and 3. Only two terms -- "competition" and "knowledge" -- happen to be approximately in common. These nine fac‐ tors pass a face validity check – they seem quite coherent and potentially meaningful to study of the MOT research arena. Naming of the factors is done by VantagePoint, using an algorithm that takes into account relative term loading on the factor and term commonalities

quencies and consolidates related terms pretty well (but by no means completely).

Stage 1 Sample #R #I Stage 2 Sample #R #I knowledge 412 750 knowledge 414 1022

Knowledge 414 1022 35.05 technology 475 1113 34.59 applicable 444 998 33.68 relationship 356 801 32.89 competition 303 699 32.57 innovation technology 200 527 32.42 case study 472 931 31.72 technology manager 241 526 30.54 R&D 191 446 30.25 Governance 248 517 29.99 developed country 179 406 29.43

Stage 3 - post-TFIDF

**Table 3.** Stage 3 – Top 10 MOT Phrases based on TFIDF

among phrases.


**Table 2.** Stages 1 & 2 – Top 10MOT Phrases

Considering the MOT data first, Table 2 compares the ten most frequent terms or phrases as of Stages 1 and 2.As per Table 1, the clumping and, especially single-record term pruning, has reduced from 86014 to 10513 phrases – an 88% reduction. Table 2 lists the highest fre‐ quency terms and phrases based on record coverage. For instance, study appears in 1177 of the 5169 records (23%). The Table also shows instances, and we see that study appears more than once in some records to give a total of 1874 instances. MOT is Management of Technol‐ ogy. That said, the terms and phrases after clumping are somewhat more substantive. As one scans down the Stage 2 set of 10513 phrases, this is even more the case. Our sense is that a topical expert reviewing these to tag a set of themes to be analyzed (e.g., to track trends, or institutional emphases) would definitely prefer the clumped to the raw phrases.

In Tables 2-5, we show in how many of the full sample of MOT and DSSC records the partic‐ ular terms appear. We also show instances (i.e., some terms appear more than once in a re‐ cord). These just convey the changes in coverage resulting from the various clumping operations applied.

Table 3 shows the "Top 10" post-TFIDF terms and phrases, based on TFIDF scores. Recall that the 1999 terms and phrases at this Stage 3 are based on an arbitrary threshold – we sought about 2000. Note that term counts are unchanged for terms present in both Stages 2 & 3. TFIDF is not clumping, but rather screening based on occurrence patterns across the 5169 records.


**Table 3.** Stage 3 – Top 10 MOT Phrases based on TFIDF

Referring to Figure 1, the Text Cleaning stage, in general, would be carried out in prepara‐ tion for nearly all further analyses. We would not anticipate aborting that processing partway, except in special cases (e.g., as mentioned in Cunningham's analysis of British science titles). The next stage of consolidating the cleaned and, therefore, partly consolidated phras‐ es, is where interesting choices arrive. Based on the analyses of the MOT and DSSC data, we note the significant effect of selecting the high TFIDF terms. We thus compare the phrase sets at Stage 1 (before cleaning and clumping), Stage 2 (before filtering to the top TFIDF terms), Stage 3 (after TFIDF), and Stage 4 (after applying one of the clustering family of tech‐

Top 10 # Records # Instances Top 10 # Records #Instances study 1177 1874 technology 475 1113 results 894 1177 case study 472 931 research 792 1050 applicable 444 998 development 603 829 knowledge 414 1022 analysis 518 690 relationship 356 801 One 494 574 competition 303 699 innovation 465 800 governance 248 517 knowledge 412 750 technology manager 241 526 process 400 506 literature 227 344 industry 399 637 implication 221 327

Considering the MOT data first, Table 2 compares the ten most frequent terms or phrases as of Stages 1 and 2.As per Table 1, the clumping and, especially single-record term pruning, has reduced from 86014 to 10513 phrases – an 88% reduction. Table 2 lists the highest fre‐ quency terms and phrases based on record coverage. For instance, study appears in 1177 of the 5169 records (23%). The Table also shows instances, and we see that study appears more than once in some records to give a total of 1874 instances. MOT is Management of Technol‐ ogy. That said, the terms and phrases after clumping are somewhat more substantive. As one scans down the Stage 2 set of 10513 phrases, this is even more the case. Our sense is that a topical expert reviewing these to tag a set of themes to be analyzed (e.g., to track trends, or

In Tables 2-5, we show in how many of the full sample of MOT and DSSC records the partic‐ ular terms appear. We also show instances (i.e., some terms appear more than once in a re‐ cord). These just convey the changes in coverage resulting from the various clumping

institutional emphases) would definitely prefer the clumped to the raw phrases.

niques – PCA).

62 Theory and Applications for Advanced Text Mining Text Mining

**Table 2.** Stages 1 & 2 – Top 10MOT Phrases

operations applied.

Stage 1 - Initial Stage 2 - Clumped

Table 4 presents another 10-term sample pair for Stages 1 and 2. Here, we alphabetically sort the phrase lists and arbitrarily take the ten phrases beginning with "knowledge" or "knowl‐ edg" --i.e., a stem version of the term. Notice that the big consolidation is for the stemmed version of "knowledg," for which the record count has gone up a tiny amount (2), whereas the instance count has increased by 272. In general, the term clumping increases term fre‐ quencies and consolidates related terms pretty well (but by no means completely).

Table 5 presents the top-tier PCA analysis results. The phrases appearing here tend to be more topically specific than those seen as most frequent at Stages 2 and 3. Only two terms -- "competition" and "knowledge" -- happen to be approximately in common. These nine fac‐ tors pass a face validity check – they seem quite coherent and potentially meaningful to study of the MOT research arena. Naming of the factors is done by VantagePoint, using an algorithm that takes into account relative term loading on the factor and term commonalities among phrases.



Technology Capability Global Competition Competing Technologies Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 65

Technology Roadmap

Innovation Activity Open Innovation

Knowledge Manager

Knowledge Creation New Knowledge Share Knowledge

Project Success

**•** In the initial 90980 term list, there are 807 terms on "electron/electrons/electronic"

diffusion," "electron injection," "electron transfer," "electronic structure," etc.

**•** In the 2008 term list, there are 40 terms remaining, such as "electron acceptor," "electron

Table 6 shows the twelve top-tier DSSC PCA factors and the phrases that load highly on those factors. These results pace a face validity test in that the grouping of terms seems gen‐ erally sensible. These factors appear to be reasonable candidates for thematic analysis of this

Sol Gel Process

Decision Making Process

Communication Technology

Individual

Technology Roadmap Roadmap

Knowledge Knowledge

Project Success Project Manager

Make Decisions Make Decisions

**Table 5.** Stage 4 – Top Tier MOT PCA Factors and Constituent Phrases

**•** In the 8181 term list, there are 119 terms on this

solar cell research & development activity.

Sol Gel Process Sol Gel

Principle Component (Factor) High Loading Phrases

Communication Technology ICT

Innovation Process Innovation Process

**Table 4.** Stages 1 & 2 – 10 Sample MOT Phrases

#### Note: #R = # of Records; #I = # of Instances

As mentioned, we have done additional analyses of these data. In another PCA, starting with the 10513 terms (pre-TFIDF), we extracted a high frequency term set (112 terms or phrases appearing in 50-475 records). In addition we extracted a second-tier PCA based on 185 terms appearing in 25-49 records, and a third-tier PCA from 763 terms in 10-24 records. Each set was run using VantagePoint default settings for number of factors, yielding, respec‐ tively, 7, 9, and 16 factors. Of the present 9 top-tier factors, 3 show clear correspondence to either top or second-tier factors in the 10513-term based PCA; one shows partial correspond‐ ence; 5 are quite distinct. Which factor sets are better? Impressionalistically, the 9 post-TFIDF factors seem reasonable and somewhat superior, but lacking some of the specificity of (7 + 9 + 16 = 32) factors. As noted, we don't pursue the corresponding post-TFIDF PCA second and third tier factors because their record coverage is low.

Examination of DSSC phrase sets shows generally similar progressions as term clumping proceeds.In some respects, results are even more satisfactory with that more technical termi‐ nology.In the interest of space, we don't present tables like Tables 2-4 here.But here's a syn‐ opsis of one fruitful topical concentration within the DSSC phrase list:



**Table 5.** Stage 4 – Top Tier MOT PCA Factors and Constituent Phrases


knowledge absorption ability KAA 1 1 knowledge acquisition 6 11 knowledge access 1 1 knowledge age 4 10 knowledge accumulated 1 1 knowledge asset 4 5 knowledge accumulation 4 8 knowledge base 14 17 knowledge accumulation model 1 2 knowledge based competencies 2 3 knowledge acquisition 6 11 knowledge based economy 21 28

knowledge acquisition strategies 1 4 knowledge based perspective 3 4 knowledge across different sectors 1 1 Knowledge Based Product Models 2 2

As mentioned, we have done additional analyses of these data. In another PCA, starting with the 10513 terms (pre-TFIDF), we extracted a high frequency term set (112 terms or phrases appearing in 50-475 records). In addition we extracted a second-tier PCA based on 185 terms appearing in 25-49 records, and a third-tier PCA from 763 terms in 10-24 records. Each set was run using VantagePoint default settings for number of factors, yielding, respec‐ tively, 7, 9, and 16 factors. Of the present 9 top-tier factors, 3 show clear correspondence to either top or second-tier factors in the 10513-term based PCA; one shows partial correspond‐ ence; 5 are quite distinct. Which factor sets are better? Impressionalistically, the 9 post-TFIDF factors seem reasonable and somewhat superior, but lacking some of the specificity of (7 + 9 + 16 = 32) factors. As noted, we don't pursue the corresponding post-TFIDF PCA

Examination of DSSC phrase sets shows generally similar progressions as term clumping proceeds.In some respects, results are even more satisfactory with that more technical termi‐ nology.In the interest of space, we don't present tables like Tables 2-4 here.But here's a syn‐

Supply Chain

Commercial

Capability

strategy

2 4

knowledge acquisition KA 1 1 knowledge based organizational

second and third tier factors because their record coverage is low.

opsis of one fruitful topical concentration within the DSSC phrase list:

Principle Component (Factor) High Loading Phrases Managing Supply Chain Managing Supply Chain

Nanotechnology Nanotechnology

Competing Technologies Competition

**Table 4.** Stages 1 & 2 – 10 Sample MOT Phrases

64 Theory and Applications for Advanced Text Mining Text Mining

Note: #R = # of Records; #I = # of Instances

**•** In the 2008 term list, there are 40 terms remaining, such as "electron acceptor," "electron diffusion," "electron injection," "electron transfer," "electronic structure," etc.

Table 6 shows the twelve top-tier DSSC PCA factors and the phrases that load highly on those factors. These results pace a face validity test in that the grouping of terms seems gen‐ erally sensible. These factors appear to be reasonable candidates for thematic analysis of this solar cell research & development activity.



Principle Component (Factor) High Loading Phrases Open Circuit Voltage Open Circuit Voltage

Nanotube Nanotube

**Table 6.** Stage 4 – Top Tier DSSC PCA Factors and Constituent Phrases

**5. Discussion**

text analytics

further analyses

ogies and their potential applications.

Desirable features in such text analytics include:

**•** Transparency of actions – not black box

Fill Factor

Ion Exchange

Nanotube TiO2

Recent attention to themes like "Big Data" and "MoneyBall" draw attention to the potential in deriving usable intelligence from information resources. We have noted the potential for transformative gains, and some potential unintended consequences, of exploiting informa‐ tion resources [53]. Term clumping, as presented here, offers an important tool set to help move toward real improvements in identifying, tracking, and forecasting emerging technol‐

**•** Evaluation opportunities – we see value in comparing routines on datasets to ascertain what works better; we recognize that no one sequence of operations will be ideal for all

Phrase consolidation advantages stand out in one DSSC example. Starting with some 2000 terms relating to variations of titanium dioxide (e.g., TiO2, TiO2, TiO2 film), we reduce to 4

We are pointing toward generation of a macro that would present the analyst with options as to which cleaning and clumping steps to run, in what order; however, we also hope to come up with a default routine that works well to consolidate topical terms and phrases for

Some future research interests have been noted in conjunction with the list of steps, of which we are actively working on Steps h, j, and k. We are particularly interested in processing un‐ igrams, because of the potential in such approaches to work with multiple languages. On the other hand, we appreciate the value of phrases to convey thematic structure. Possibilities include processing single words, through a sequence of steps to Topic Modeling, and then trying to associate related phrases to help capture the thrust of each topic. We see potential

such terms, with the "combine term networks" (Step f) particularly helpful.

Anode

Electrochemical Corrosion

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 67

Electrochemical Impedance Spectroscopy Electrochemical Impedance Spectroscopy


**Table 6.** Stage 4 – Top Tier DSSC PCA Factors and Constituent Phrases

## **5. Discussion**

Principle Component (Factor) High Loading Phrases

Polymer Electrolyte Electrolyte

66 Theory and Applications for Advanced Text Mining Text Mining

Conduction Band Electron Injection

Coumarin Dye Organic Dye

Solar Equipment Photo Electrochemical cell

Material Nanostructure Material Nanostructure

Scanning Electron Microscopy Scanning Electron Microscopy

Electron Transport Electron Transpot

ZnO ZnO

Gel-Sol Method

Polym Ionic Liquid Polymer Electrolyte Gel Electrolyte Electrolyte Liquid Ionic Conduction Gel Polymer Electrolyte

Electrolysis Gelator Poly electrolyte

Temperature Molten-Salt

Density Functional Theory

Conduction Band Mobile Electrons

Coumarin Dye

Efficient Conversion Solar Energy Solar Equipment

Redox Reaction

Back Reaction

X-ray Diffraction

Electron Microscopy X-ray Diffraction Analysis

Semiconducting zinc compounds

Transmission Electron Microscopy

X-ray Photoelectron spectroscopy

Nanowire Nanorod

Recent attention to themes like "Big Data" and "MoneyBall" draw attention to the potential in deriving usable intelligence from information resources. We have noted the potential for transformative gains, and some potential unintended consequences, of exploiting informa‐ tion resources [53]. Term clumping, as presented here, offers an important tool set to help move toward real improvements in identifying, tracking, and forecasting emerging technol‐ ogies and their potential applications.

Desirable features in such text analytics include:


Phrase consolidation advantages stand out in one DSSC example. Starting with some 2000 terms relating to variations of titanium dioxide (e.g., TiO2, TiO2, TiO2 film), we reduce to 4 such terms, with the "combine term networks" (Step f) particularly helpful.

We are pointing toward generation of a macro that would present the analyst with options as to which cleaning and clumping steps to run, in what order; however, we also hope to come up with a default routine that works well to consolidate topical terms and phrases for further analyses

Some future research interests have been noted in conjunction with the list of steps, of which we are actively working on Steps h, j, and k. We are particularly interested in processing un‐ igrams, because of the potential in such approaches to work with multiple languages. On the other hand, we appreciate the value of phrases to convey thematic structure. Possibilities include processing single words, through a sequence of steps to Topic Modeling, and then trying to associate related phrases to help capture the thrust of each topic. We see potential use of clumped terms and phrases in various text analyses.To mention two relating to com‐ petitive technical intelligence (CTI) and Future-oriented Technology Analyses (FTA):

**Author details**

**References**

Kingdom.

Press.

Alan L. Porter1\* and Yi Zhang2

\*Address all correspondence to: alan.porter@isye.gatech.edu

*Competitive Advantage*, New York, Wiley.

nal.com/archives/2004/, (accessed 20 May 2012).

*PICMET2012, 29 July-2 August, Vancouver, Canada*.

Center, Georgia Tech, Atlanta, Georgia, USA

1 Search Technology, Inc., Norcross, Georgia, USA, and Technology Policy & Assessment

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 69

[2] Porter, A.L., & Cunningham, S.W. (2005). *Tech Mining: Exploiting New Technologies for*

[3] Kim, Y., Tian, Y., Jeong, Y., Ryu, J., & Myaeng, S. (2009). Automatic Discovery of Technology, Trends from Patent Text.In. *Proceedings of the 2009 ACM symposium on*

[4] Verbitsky, M. Semantic TRIZ. *The TRIZ Journal2004; Feb.*, http://www.triz-jour‐

[5] Porter, A. L., Zhang, Y., & Newman, N. C. (2012). Tech Mining to Identify Topical Emergence in Management of Technology. *The International Conference on Innovative Methods for Innovation Management and Policy, IM2012, 23-26 May 2012. Beijing, China*.

[6] Newman, N. C., Porter, A. L., Newman, D., Courseault-Trumbach, C., & Bolan, S. D. (2012). Comparing Methods to Extract Technical Content for Technological Intelli‐ gence. *Portland International Conference on Management of Engineering and Technology,*

[7] Porter, A. L., Newman, D., & Newman, N. C. (2012). Text Mining to identify topical emergence: Case study on'Management of Technology. *The 17th International Confer‐ ence on Science and Technology Indicators, STI2012, 5-8 September, Montreal, Canada*.

[8] Cunningham, S.W. (1996). The Content Evaluation of British Scientific Research. *D.Phil. Thesis, Science Policy Research Unit*, University of Sussex, Brighton, United

[9] Haywood, S. Academic Vocabulary. *Nottingham University*, http://www.notting‐

[10] Price, D.S. (1986). *Little science, big science and beyond*, New York, Columbia University

ham.ac.uk/~alzsh3/acvocab/wordlists.htm, (accessed 26 May, 2012).

2 School of Management and Economics, Beijing Institute of Technology, Beijing, China

[1] VantagePoint. www.theVantagePoint.com, (accessed 20 May 2012).

*Applied Computing, ACMSAC2009, 9-12 March 2009,Hawaii, USA*.

Combining empirical with expert analyses is highly desirable in CTI and FTA – clumped phrases can be further screened to provide digestible input for expert review to point out key topics and technologies for further scrutiny

Clumped phrases and/or PCA factors can provide appropriate level content for Technology RoadMapping (TRM) – for instance, to be located on a temporal plot.

We recognize considerable interplay among text content types as well.This poses various cleaning issues in conjunction with co-occurrence of topical terms with time periods, au‐ thors, organizations, and class codes.We look forward to exploring ways to use clumped terms and phrases to generate valuable CTI.

## **Key Acronyms:**


TM - Topic Modeling

WoS - Web of Science (including Science Citation Index)

## **6. Acknowledgements**

We acknowledge support from the US National Science Foundation (Award #1064146 – "Re‐ vealing Innovation Pathways: Hybrid Science Maps for Technology Assessment and Fore‐ sight"). The findings and observations contained in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation.

We thank David J. Schoeneck for devising groundrules for a semi-automated, 3-tier PCA and Webb Myers for the macro to combine term networks.Nils Newman has contributed pivotal ideas as we build our term clumping capabilities and determine how to deploy them.

## **Author details**

use of clumped terms and phrases in various text analyses.To mention two relating to com‐

Combining empirical with expert analyses is highly desirable in CTI and FTA – clumped phrases can be further screened to provide digestible input for expert review to point out

Clumped phrases and/or PCA factors can provide appropriate level content for Technology

We recognize considerable interplay among text content types as well.This poses various cleaning issues in conjunction with co-occurrence of topical terms with time periods, au‐ thors, organizations, and class codes.We look forward to exploring ways to use clumped

We acknowledge support from the US National Science Foundation (Award #1064146 – "Re‐ vealing Innovation Pathways: Hybrid Science Maps for Technology Assessment and Fore‐ sight"). The findings and observations contained in this paper are those of the authors and

We thank David J. Schoeneck for devising groundrules for a semi-automated, 3-tier PCA and Webb Myers for the macro to combine term networks.Nils Newman has contributed pivotal

ideas as we build our term clumping capabilities and determine how to deploy them.

petitive technical intelligence (CTI) and Future-oriented Technology Analyses (FTA):

RoadMapping (TRM) – for instance, to be located on a temporal plot.

DSSCs - Dye-Sensitized Solar Cells [one of two topical test sets]

MOT - Management of Technology [the second of two topical test sets]

do not necessarily reflect the views of the National Science Foundation.

key topics and technologies for further scrutiny

68 Theory and Applications for Advanced Text Mining Text Mining

terms and phrases to generate valuable CTI.

CTI - Competitive Technical Intelligence

LSI - Latent Semantic Indexing

NLP - Natural Language Processing

PCA - Principal Components Analysis

ST&I - Science, Technology & Innovation

WoS - Web of Science (including Science Citation Index)

**Key Acronyms:**

TM - Topic Modeling

**6. Acknowledgements**

Alan L. Porter1\* and Yi Zhang2

\*Address all correspondence to: alan.porter@isye.gatech.edu

1 Search Technology, Inc., Norcross, Georgia, USA, and Technology Policy & Assessment Center, Georgia Tech, Atlanta, Georgia, USA

2 School of Management and Economics, Beijing Institute of Technology, Beijing, China

## **References**


[11] Garfield, E., Malin, M., & Small, H. (1978). Citation Data as Science Indicators. *Y. El‐ kana,et al, (Eds.), The Metric of Science: The Advent of Science Indicators,*, New York, Wi‐ ley.

[24] Bookstein, A., & Raita, T. Discovering term occurrence structure in text. *Journal of the American Society for Information Science and Technology 2000*, 52(6), 476-486.

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 71

[25] Bookstein, A., Vladimir, K., Raita, T., & John, N. Adapting measures of clumping strength to assess term-term similarity. *Journal of the American Society for Information*

[26] Berry, M.W., & Castellanos, M. (2008). *Survey of text mining II : clustering, classification,*

[27] Beil, F., Ester, M., & Xu, X. Frequent term-based text clustering. *Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining, KDD2002*, http://

[28] Scime, A. (2005). *Web mining: applications and techniques.*, Hershey, PA, Idea Group

[29] Homayouni, R., Heinrich, K., Wei, L., & Berry, M.W. (2005). Gene clustering by la‐

[30] Xu, L., Furlotte, N., Lin, Y., Heinrich, K., & Berry, M.W. *Functional Cohesion of Gene Sets Determined byLatent Semantic Indexing of PubMedAbstracts.PLoS ONE 2011*, 6(4),

[31] Landauer, T.K., McNamara, D.S., Denis, S., & Kintsch, W. (2007). *Handbook of Latent*

[32] Maletic, J. I., & Marcus, A. Supporting program comprehension using semantic and structural information. *Proceedings of the 23rd International Conference on Software Engi‐*

[33] Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. *Journal of Machine*

[34] Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. *Proceedings of the Nation‐*

[35] Thomas, H. Probabilistic latent semantic indexing. *Proceedings of the 22nd annual inter‐ national ACM SIGIR conference on Research and development in information retrieval, SI‐*

[36] Ando, R.K. Latent semantic space: iterative scaling improves precision of inter-docu‐ ment similarity measurement. *Proceedings of the 23rd annual international ACM SIGIR*

[37] Li, W., & McCallum, A. Pachinko allocation: DAG-structured mixture models of top‐ ic correlations. *Proceedings of the 23rd international conference on Machine learning,*

[38] David, M., Li, W., & Mc Callum, A. Mixtures of hierarchical topics with Pachinko al‐ location. *Proceedings of the 24th international conference on Machine learning, ICML 2007*.

*conference on Research and development in information retrieval, SIGIR2000*.

tentsemantic indexing of MEDLINE abstracts. *Bioinformatics*, 21104-115.

dl.acm.org/citation.cfm?id=775110, (accessed 21 May 2012).

*Semantic Analysis*, Mahwah, NJ, Erlbaum Associates.

*al Academy of Sciences*, 101 (suppl.1), 5228-5235.

*Science and Technology 2003*, 54(7), 611-620.

*and retrieval.*, New York:, Springer.

Pub.

e18851.

*neering, ICSE2001*.

*GIR1999*.

*ICML2006*.

*Learning Research*, 3993-1022.


[24] Bookstein, A., & Raita, T. Discovering term occurrence structure in text. *Journal of the American Society for Information Science and Technology 2000*, 52(6), 476-486.

[11] Garfield, E., Malin, M., & Small, H. (1978). Citation Data as Science Indicators. *Y. El‐ kana,et al, (Eds.), The Metric of Science: The Advent of Science Indicators,*, New York, Wi‐

[12] Van Raan, A. F. J. (1992). Advanced Bibliometric Methods to Assess Research Per‐ formance and Scientific Development: Basic Principles and Recent Practical Applica‐

[13] De Bellis, N. (2009). *Bibliometrics and Citation Analysis*, Lanham, MD, The Scarecrow

[14] Porter, A.L., & Detampel, M.J. (1995). Technology opportunity analysis. *Technol. Fore‐*

[15] Watts, R.J., Porter, A.L., Cunningham, S.W., & Zhu, D. (1997). TOAS intelligence mining, an analysis of NLP and computational linguistics. *Lecture Notes in Computer*

[16] Zhu, D., & Porter, A.L. (2002). Automated extraction and visualization of information for technological intelligence and forecasting. *Technol. Forecast. Soc. Change*,

[17] Losiewicz, P., Oard, D.W., & Kostoff, R.N. (2000). Textual data mining to support sci‐ ence and technology management. *Journal of Intelligent Information Systems*, 15(2),

[18] Robinson, D.K.R., HuangL., , Guo, Y., & Porter, A.L. Forecasting Innovation Path‐ ways for New and Emerging Science & Technologies. *Technological Forecasting & So‐*

[19] Deerwester, S., Dumals, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Index‐ ing by latent semantic analysis. *Journal of the American Society for Information Science*,

[20] FodorI.K., . A survey of dimension reduction techniques. *U.S. Department of Energy, Lawrence Livermore National Lab.9 May 2002.https://e-reports-ext.llnl.gov/pdf/240921.pdf*,

[21] Watts, R. J., & Porter, A. L. (1999). Mining Foreign language Information Resources, Proceedings., *Portland International Conference on Management of Engineering and Tech‐*

[22] Watts, R. J., Porter, A. L., & Minsk, B. (2004). Automated text mining comparison of Japanese and USA multi-robot research, data mining 2004. *Fifth International Confer‐ ence on Data Mining, Text Mining and their Business Applications,15-17Sep. 2004, Malaga,*

[23] Bookstein, A., Klein, T., & Raita, T. (1998). Clumping properties of content-bearing words. *Journal of the American Society for Information Science*, 49(2), 102-114.

ley.

70 Theory and Applications for Advanced Text Mining Text Mining

Press.

tions. *Research Evaluation*, 3(3), 151-166.

*cast. Soc. Change*, 49, 237-255.

*Science*, 1263, 323-334.

69495-506.

99-119.

*cial Change*.

41391-407.

*Spain;*.

(accessed 22 May 2012).

*nology, PICMET1999, July 1999, Portland, OR, USA;*.


[39] Wang, X., & McCallum, A. Topics over time: A non-Markov continuous-time model of topical trends. *Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD2006*.

[51] Courseault-Trumbach, C., & Payne, D. (2007). Identifying Synonymous Concepts in

Text Clumping for Technical Intelligence http://dx.doi.org/10.5772/50973 73

[52] Watts, R.J., & Porter, A.L. (2003). R&D cluster quality measures and technology ma‐

[53] Porter, A.L., & Read, W. The Information Revolution: Current and Future Conse‐

[54] Zhang, Y., Porter, A. L., & Hu, Z. An Inductive Method for "Term Clumping": A Case Study on Dye-Sensitized Solar Cells. *The International Conference on Innovative Methods for Innovation Management and Policy, IM2012, 23-26May 2012, Beijing, China;*

[55] Porter, A.L., Schoeneck, D.J., & Anderson, T.R. (2012). PICMET Empirically: Tracking 14 Management of Technology Topics. *Portland International Conference on Manage‐ ment of Engineering and Technology, PICMET2012, 29 July-2 August, Vancouver, Canada*.

Preparation for Technology Mining. *Journal of Information Science*, 33(6).

turity. *Technological Forecasting and Social Change*, 70(8), 735-758.

quences. *Westport, CT: JAI/Ablex; 1998*.

*2012*.


[51] Courseault-Trumbach, C., & Payne, D. (2007). Identifying Synonymous Concepts in Preparation for Technology Mining. *Journal of Information Science*, 33(6).

[39] Wang, X., & McCallum, A. Topics over time: A non-Markov continuous-time model of topical trends. *Proceedings of the 12th ACM SIGKDD international conference on*

[40] David, M. B., & John, D. H. Dynamic topic models. *Proceeding Proceedings of the 23rd*

[41] Gruber, A., Rosen-Zvi, M., & Weiss, Y. *Hidden topic Markov models*, http://

[42] Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. The author-topic model for au‐ thors and documents. *Proceeding of the 20th conference on Uncertainty in artificial intelli‐*

[43] Mc Callum, A., Corrada-Emmanuel, A., & Wang, X. The Author-Recipient-Topic Model for Topic and Role Discovery. *Social Networks: Experiments with Enron and Aca‐ demic Email.*, http://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1024&con‐

[44] Mei, Q., Xu, L., Wondra, M., Su, H., & Zhai, C. Topic sentiment mixture: modeling facets and opinions in weblogs. *Proceedings of the 16th international conference on World*

[45] Ohsawa, Y., Benson, N.E., & Yachida, M. Keygraph: Automatic indexing by co-occur‐ rence graph based on building construction metaphor. *Proceedings of the Advances in*

[46] Tsuda, K., & Thawonmas, R. (2005). KeyGraph for Visualization of Discussions in Comments of a Blog Entry with Comment Scores. *World Scientific and Engineering*

[47] Sayyadi, H., Hurst, M., & Maykov, A. Event Detection and Story Tracking in Social Streams. *Proceeding of 3rd Int'l AAAI Conference on Weblogs and Social Media,*

[48] Kongthon, A. A. (2004). Text Mining Framework for Discovering Technological Intel‐ ligence to Support Science and Technology Management, Doctoral Dissertation,. *Georgia Institute of Technology*, http://smartech.gatech.edu/bitstream/handle/ 1853/5151/kongthon\_alisa\_200405\_phd.pdf.txt?sequence=2, (accessed 20 May 2012).

[49] Kongthon, A., Haruechaiyasak, C., & Thaiprayoon, S. Constructing Term Thesaurus using Text Association Rule Mining. *Proceedings of the 2008 Electrical Engineering/Elec‐ tronics, Computer, Telecommunications and Information Technology International Confer‐*

[50] Kongthon, A., & Angkawattanawit, N. Deriving Tree-Structured Network Relations in Bibliographic Databases. *Proceedings of the 10th International Conference on Asian*

*Digital Libraries, ICADL 2007, December 10-13, Hanoi, Vietnam; 2007*.

*Academy and Society (WSEAS) Trans. Computers*, 12(4), 1794-1801.

*ICWSM09, May 17-20, 2009, San Jose, California, USA; 2009*.

*Knowledge discovery and data mining, KDD2006*.

text=cs\_faculty\_pubs., Accessed March 20, 2012.

*gence, UAI2004*.

72 Theory and Applications for Advanced Text Mining Text Mining

*Wide Web, WWW2007*.

*ence, ECTI2008.*

*Digital Libraries Conference, ADL1998*.

*international conference on Machine learning, ICML 2006*.

www.cs.huji.ac.il/~amitg/aistats07.pdf., Accessed March 20, 2012.


**Chapter 4**

**A Semantic-Based Framework for Summarization and**

The World Wide Web has become a fundamental resource of information for an increasing number of activities, and a huge information flow is exchanged today through the Internet for the widest range of purposes. Although large-bandwidth communications yield fast ac‐ cess to virtually any kind of contents by both human users and machines, the unstructured nature of most available information may pose a crucial issue. In principle, humans can best extract relevant information from posted documents and texts; on the other hand, the over‐ whelming amount of raw data to be processed call for computer-supported approaches. Thus, in recent years, *Web mining* research tackled this issue by applying data mining techni‐

This chapter deals with the predominant portion of the web-based information, i.e., docu‐ ments embedding natural-language text. The huge amount of textual digital data [2, 3] and the dynamicity of natural language actually can make it difficult for an Internet user (either human or automated) to extract the desired information effectively: thus people every day face the problem of information overloading [4], whereas search engines often return too many results or biased/inadequate entries [5]. This in turn proves that: 1) treating web-based textual data effectively is a challenging task, and 2) further improvements are needed in the area of Web mining. In other words, algorithms are required to speed up human browsing or to support the actual crawling process [4]. Application areas that can benefit from the use of these algorithms include marketing, CV retrieval, laws and regulations exploration, com‐ petitive intelligence [6], web reputation, business intelligence [7], news articles search [1], topic tracking [8], and innovative technologies search. Focused crawlers represent another potential, crucial area of application of these technologies in the security domain [7, 9].

> © 2012 Leoncini et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

and reproduction in any medium, provided the original work is properly cited.

© 2012 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution,

**Page Segmentation in Web Mining**

Alessio Leoncini, Fabio Sangiacomo, Paolo Gastaldo and Rodolfo Zunino

http://dx.doi.org/10.5772/51178

**1. Introduction**

ques to Web resources [1].

Additional information is available at the end of the chapter
