**Abstract**

In this research paper, we present a system for named entity recognition and automatic document classification in an innovative knowledge management system for Applied Gaming. The objective of this project is to facilitate the management of machine learning-based named entity recognition models, that can be used for both: extracting different types of named entities and classifying text documents from different sources on the Web. We present real-world use case scenarios and derive features for training and managing NER models with the Stanford NLP machine learning API. Then, the integration of our developed NER system with an expert rule-based system is presented, which allows an automatic classification of text documents into different taxonomy categories available in the knowledge management system. Finally, we present the results of two evaluations. First, a functional evaluation that demonstrates the portability of our NER system using a standard text corpus in the medical area. Second, a qualitative evaluation that was conducted to optimize the overall user interface of our system and enable a suitable integration into the target environment.

**Keywords:** named entity recognition, document classification, rule-based expert system, social network, applied gaming, knowledge management system

### **1. Introduction**

The European research project Realizing and Applied Gaming Ecosystem (RAGE) is an innovative online portal and service-oriented platform for accessing and retrieving reusable software components and other related textual documents from the Web, such as research publications, source code repositories, issues, and online discussions. RAGE is used to support software reuse in the domain of applied gaming. Applied games (AG) or serious games (SG) aim at training, educating and motivating players, instead of pure entertainment [1]. RAGE supports the integration with various social networks like Stack Exchange ("Hot questions"), or GitHub ("Build software better"). For instance, RAGE includes facilities to connect with the Stack Exchange REST API which enables an easy import of online discussions into its ecosystem. RAGE users can easily import multiple discussions from, for instance, the Stack Overflow social site,

describe them with further meta information, classify them using an integrated taxonomy management system, and then finally retrieve useful information with faceted search that enables drilling down large set of documents. Currently, the classification of text documents into existing taxonomies in RAGE is done manually. The user has to, first, analyze the content of each document manually to understand the context in which this document is used. This is done by consulting the title and description of each imported document, as well as, analyzing all related meta-information (like keywords and tags), which are associated with this document. Once done, the user has to search for taxonomies that may be used to classify the imported document based on its content and metadata. This process can be very hard and requires the full attention of the user because he or she needs to consult the document and taxonomy each time manually. With a large number of documents and multiple hierarchical taxonomies, it can be very time-consuming to classify documents in RAGE.

generally consist of a title, a description, and other metadata. RAGE includes a *taxonomy management system* that serves at organizing and categorizing these documents into existing, hierarchical taxonomies found in its ecosystem. *Taxonomy* is the practice and science of classifying things and concepts including the principles underlining such classification [8]. It is used in RAGE to support faceted browsing, which is a technique allowing users to drill down their large number of search results, enabling faster information retrieval. However, it is hard to classify documents with multiple taxonomies. The user can easily mix up one with another while analyzing and classifying a document into multiple hierarchical taxonomies. Each individual document (including its metadata like title, description, tags) have to be analyzed each time manually in order to understand the context in which the document is used, before making a proper classification into the existing taxonomies. This process can be very challenging and time-consuming, especially with multiple documents and various taxonomies having complex hierarchical structures. To fulfill the requirements of the project, a very desirable goal would be to develop and integrate a named entity recognition system into RAGE that can automatically recognize and classify various kinds of named entities from the multiple social networks connected with the ecosystem. Then, to apply an expert rule-based system that will enable an automatic document classification by reasoning about the extracted named entities, the hierarchical taxonomies and other textual features

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

NER techniques generally include handcrafted rules or statistical methods that rely on machine learning (ML) [2], or even a combination of those. A NER technique is denoted as *rule-based* or *handcrafted* if all the parameters (including rules) that are used to identify and categorize named entities are defined manually by a human. *Machine learning* based techniques will use a computer to estimate those parameters automatically [7]. Existing ML techniques include *supervised learning* (parameter estimation is based on already annotated data), *semi-supervised learning* (parameter estimation uses only a small set of annotated data), and *unsupervised learning* (does not use annotated data for estimation). Most popular machine learning systems are relying on Conditional Random Fields (CRF), the state-of-the-art statistical modeling method for sequential text labelling [9]. CRF has been widely used with machine learning to support different NLP tasks, such as, *part-of-speech tagging* [10], *sentence splitting* [11] and *NER* [12]. Developing a machine learningbased NER system is however very challenges and requires a lot of data for model training. Often, *gazetteers* (dictionaries of specific named entities) are introduced as additional features to recognize unknown named entities - words that were not used in the training process. Likewise, *regular expressions* can be applied to optimize ML models, because they detect more complex named entities like compound

Many factors can influence the performance of a NER system, such as a) The *language*. Some NER systems were developed for one specific language like English. b) The *named entity type*. For instance, the class of a datetime can be easily found if it only contains absolute dates (2003; 6.2.2005, April 5, 2011), but it can be difficult

to detect relative dates (next Saturday, in December). c) The *domain* of the processed texts (corpora). If a classifier was trained using juristic texts, it will be difficult for this same classifier to deal with material originated from bioinformatics. The standard measures for evaluation machine NER systems are *precision*, *recall* and *F1* for this task. Recall is the ratio of correct annotated NEs to the total number of correct NEs. Precision is the ratio of correct annotated NEs to the total number (correct and incorrect) of annotated NEs. F1 score is calculated from precision and

found in RAGE textual documents.

words [13].

**65**

**2.2 Named entity recognition techniques**

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

To solve this problem, *Named Entity Recognition (NER)* is generally applied because it can extract various knowledge contents (like named entities) from natural language texts [2]. The extracted knowledge content can then be used to automate the process of classifying text documents from various domains on the Web, using, for instance, an expert rule-based system. NER has been widely used to recognize named entities in medical reports [3], news articles [4], and software web documents [5, 6]. Techniques for NER vary from rule-based, over machine learning (ML), to hybrid methods. But, ML-based NER methods are more efficient on Web contents, because they include statistical models that can automatically recognize and classify named entities from very large and heterogeneous contents on the Web. The training of a machine learning-based NER model is however very challenging. It requires, besides very good programming knowledge, dealing with different technologies and pipelines for text analysis, natural language processing (NLP), machine learning and rule-based operations [7]. Errors in the initial stages of the pipeline can have snowballing effects on the pipeline's end performance. Therefore, facilitating the development, management, and execution of all necessary NER related tasks and pipelines will, not only reduce the effort to train new NER models but also contribute to optimizing the performance of the whole system.

The goal of this research project is to develop and integrate a named entity recognition system into the RAGE ecosystem. The efficient integration of a NER system into the RAGE ecosystem will not only facilitate knowledge discovery (efficient extraction and analysis of named entities and their interrelationships), but also, enable an automatic classification of text documents into the existing taxonomies of the RAGE ecosystem.

After reviewing and comparing common systems and tools for named entity recognition and document classification, we present real-world use case scenarios and derive features for training and managing NER models with the Stanford NLP machine learning API. Then, the integration of our NER system together with the Drools expert rule-based system is presented, allowing an automatic classification of text documents into different taxonomy categories available in the knowledge management system. Finally, the results of a cognitive walkthrough are shown, serving as a qualitative evaluation and the optimization of the user interface and enabling a suitable integration into the target system.

#### **2. State of the art and related work**

#### **2.1 Rage**

As stated earlier, the RAGE social platform can be used to import questions from the Stack Exchange platform and other text documents from the Web, which

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*

generally consist of a title, a description, and other metadata. RAGE includes a *taxonomy management system* that serves at organizing and categorizing these documents into existing, hierarchical taxonomies found in its ecosystem. *Taxonomy* is the practice and science of classifying things and concepts including the principles underlining such classification [8]. It is used in RAGE to support faceted browsing, which is a technique allowing users to drill down their large number of search results, enabling faster information retrieval. However, it is hard to classify documents with multiple taxonomies. The user can easily mix up one with another while analyzing and classifying a document into multiple hierarchical taxonomies. Each individual document (including its metadata like title, description, tags) have to be analyzed each time manually in order to understand the context in which the document is used, before making a proper classification into the existing taxonomies. This process can be very challenging and time-consuming, especially with multiple documents and various taxonomies having complex hierarchical structures. To fulfill the requirements of the project, a very desirable goal would be to develop and integrate a named entity recognition system into RAGE that can automatically recognize and classify various kinds of named entities from the multiple social networks connected with the ecosystem. Then, to apply an expert rule-based system that will enable an automatic document classification by reasoning about the extracted named entities, the hierarchical taxonomies and other textual features found in RAGE textual documents.

#### **2.2 Named entity recognition techniques**

NER techniques generally include handcrafted rules or statistical methods that rely on machine learning (ML) [2], or even a combination of those. A NER technique is denoted as *rule-based* or *handcrafted* if all the parameters (including rules) that are used to identify and categorize named entities are defined manually by a human. *Machine learning* based techniques will use a computer to estimate those parameters automatically [7]. Existing ML techniques include *supervised learning* (parameter estimation is based on already annotated data), *semi-supervised learning* (parameter estimation uses only a small set of annotated data), and *unsupervised learning* (does not use annotated data for estimation). Most popular machine learning systems are relying on Conditional Random Fields (CRF), the state-of-the-art statistical modeling method for sequential text labelling [9]. CRF has been widely used with machine learning to support different NLP tasks, such as, *part-of-speech tagging* [10], *sentence splitting* [11] and *NER* [12]. Developing a machine learningbased NER system is however very challenges and requires a lot of data for model training. Often, *gazetteers* (dictionaries of specific named entities) are introduced as additional features to recognize unknown named entities - words that were not used in the training process. Likewise, *regular expressions* can be applied to optimize ML models, because they detect more complex named entities like compound words [13].

Many factors can influence the performance of a NER system, such as a) The *language*. Some NER systems were developed for one specific language like English. b) The *named entity type*. For instance, the class of a datetime can be easily found if it only contains absolute dates (2003; 6.2.2005, April 5, 2011), but it can be difficult to detect relative dates (next Saturday, in December). c) The *domain* of the processed texts (corpora). If a classifier was trained using juristic texts, it will be difficult for this same classifier to deal with material originated from bioinformatics. The standard measures for evaluation machine NER systems are *precision*, *recall* and *F1* for this task. Recall is the ratio of correct annotated NEs to the total number of correct NEs. Precision is the ratio of correct annotated NEs to the total number (correct and incorrect) of annotated NEs. F1 score is calculated from precision and

describe them with further meta information, classify them using an integrated taxonomy management system, and then finally retrieve useful information with faceted search that enables drilling down large set of documents. Currently, the classification of text documents into existing taxonomies in RAGE is done manually. The user has to, first, analyze the content of each document manually to understand the context in which this document is used. This is done by consulting the title and description of each imported document, as well as, analyzing all related meta-information (like keywords and tags), which are associated with this document. Once done, the user has to search for taxonomies that may be used to classify the imported document based on its content and metadata. This process can be very hard and requires the full attention of the user because he or she needs to consult the document and taxonomy each time manually. With a large number of documents and multiple hierarchical taxonomies, it

To solve this problem, *Named Entity Recognition (NER)* is generally applied because it can extract various knowledge contents (like named entities) from natural language texts [2]. The extracted knowledge content can then be used to automate the process of classifying text documents from various domains on the Web, using, for instance, an expert rule-based system. NER has been widely used to recognize named entities in medical reports [3], news articles [4], and software web documents [5, 6]. Techniques for NER vary from rule-based, over machine learning (ML), to hybrid methods. But, ML-based NER methods are more efficient on Web contents, because they include statistical models that can automatically recognize and classify named entities from very large and heterogeneous contents on the Web. The training of a machine learning-based NER model is however very challenging. It requires, besides very good programming knowledge, dealing with different technologies and pipelines for text analysis, natural language processing (NLP), machine learning and rule-based operations [7]. Errors in the initial stages of the pipeline can have snowballing effects on the pipeline's end performance. Therefore, facilitating the development, management, and execution of all necessary NER related tasks and pipelines will, not only reduce the effort to train new NER models but also contribute to optimizing the performance of the whole system. The goal of this research project is to develop and integrate a named entity recognition system into the RAGE ecosystem. The efficient integration of a NER system into the RAGE ecosystem will not only facilitate knowledge discovery (efficient extraction and analysis of named entities and their interrelationships), but also, enable an automatic classification of text documents into the existing

After reviewing and comparing common systems and tools for named entity recognition and document classification, we present real-world use case scenarios and derive features for training and managing NER models with the Stanford NLP machine learning API. Then, the integration of our NER system together with the Drools expert rule-based system is presented, allowing an automatic classification of text documents into different taxonomy categories available in the knowledge management system. Finally, the results of a cognitive walkthrough are shown, serving as a qualitative evaluation and the optimization of the user interface and

As stated earlier, the RAGE social platform can be used to import questions from

the Stack Exchange platform and other text documents from the Web, which

can be very time-consuming to classify documents in RAGE.

*The Role of Gamification in Software Development Lifecycle*

taxonomies of the RAGE ecosystem.

enabling a suitable integration into the target system.

**2. State of the art and related work**

**2.1 Rage**

**64**

recall and describes the balance between both measures. Most NER tools have functions to calculate precision, recall and F1 from a set of training and testing data.

annotators for text annotation, regular expressions and language processing tasks. These annotators can be easily combined and executed sequentially in different pipelines. A REST service interface is also available, which can be used by other external systems for different NLP tasks execution. Thus, CoreNLP may be easily integrated with a rule-based expert system to support the automatic document classification in RAGE. Finally, the training of NER models is very flexible and customizable. CoreNLP includes nearly 100 parameters for CRF-based model training and performance fine-tuning, including other options for adding gazette lists that can recognize unknown named entities. CoreNLP is licensed under the GPLv3 and has a very big active community. Thus, state-of-the-art NLP methods and algorithms are permanently developed and integrated into the software. **Stanza**<sup>7</sup> is a Python Library, developed by Stanford University as a possible successor for CoreNLP. It was launched in 2019 under the Apache license. Even the system is rather new it supports many features needed in our work, only sentiment analysis is missing. The ML models trained by CoreNLP are not directly supported in Stanza and need to be trained again. Stanza brings a client to connect to the CoreNLP server, so it is possible to use CoreNLP features over this interface, which increases the complexity. **SpaCy**<sup>8</sup> is one of the newer systems for NLP that was launched in 2015. It is written in Python and was published under the MIT license. It is used to produce software for production usage, which should be easy to use and fast. SpaCy supports most of the common NLP features, including dependency parsing and features for training custom models for NER. But it lacks support for sentiment analysis. The main disadvantage for our purpose is, it focuses on fast classification, which leads to a lower accuracy compared to other systems. **Spark NLP**<sup>9</sup> is one of the most recent NLP tools that was released in 2017. It is a library build on top of Apache Spark and TensorFlow. It supports Python, Java and Scala and focuses the usage in production systems. It has more dependencies to get it up and running compared to other systems, due to the Apache Spark architecture. The supported NLP features include all relevant features, including dependency parsing and the training of a custom model for NER. Due to its young age, the community is not as big and active compared to others. On Stack Overflow, only a few number of questions are tagged with "johnsnowlabs-spark-nlp", while the "stanford-nlp" tag has more than 3000 questions. We decided to use the Stanford CoreNLP suite for our project. CoreNLP is the only NLP software which met all our requirements. The competitors may be better or faster in one or another subtask, but overall CoreNLP seems to be the tool with the best mix of all required features. Especially the rich feature set in combination with an active and living community is a huge advantage

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

of Stanford CoreNLP, compared to the other solutions.

Expert systems are rapidly growing technology of Artificial Intelligence (AI) that use human expert knowledge for complex problem-solving in fields like Health, science, engineering, business and weather forecasting [15–17]. An expert system represents knowledge solicited by a human expert as data or production rules within a computer program [17]. These rules and data can be used to solve complex problems. For instance, a rule-based classification system can be applied to classify text documents into organized groups by applying a set of linguistic rules.

**2.3 Rule-based expert systems**

<sup>7</sup> https://stanfordnlp.github.io/stanza/

<sup>9</sup> https://nlp.johnsnowlabs.com/

<sup>8</sup> https://spacy.io/

**67**

### *2.2.1 Comparison of NER tools*

Many tools have been proposed in the literature for named entity recognition. We need to review and compare them to enable a suitable integration into our target system. Therefore, we introduce the following **selection criteria**: a) the selected tool should not be limited to a specific type of text or knowledge domain b) should include a rich set of NLP features (including *NER, POS,Tokenization, Dependency Parsing, Sentiment Analysis*), c) must be stable, extendable, distributed as opensource, and should have an active community of developers. Our solution is designed to classify a relatively small amount of data. The RAGE contents have a limited size and do not consist of many gigabytes of data. Therefore, we prefer to achieve good results with a high level of accuracy and do not need a very fast classification process which often results in lower accuracy.

Our tool comparison is based on the work of Pinto [14]. According to our selection criteria, we exclude from our comparison non-opensource tools, tools without NER support, and those focusing only on specific data. To compare stateof-the-art tools, we added SpaCy, Spark NLP and Stanza to our list, because these tools arose in the last view years and may be relevant in our work.

**GATE ANNIE**<sup>1</sup> is a more general solution for various NLP tasks. It was first developed to help software engineers and researchers working in NLP but has been optimized to a more powerful system with an integrated user interface, which supports different data preprocessing tasks and pipeline executions. GATE is distributed with an integrated information extraction system called ANNIE that supports NER and many other NLP tasks. ANNIE relies on the JAPE2 specification language, which provides finite state transduction over annotations based on regular expressions. Using the GATE interface, users can capture the provenance of machine and humangenerated annotated data to create new metrics for NLP tasks like named entity recognition. Additional metrics for more specific scenarios can be added, but this requires an existing implementation in the RAGE architecture, which introduces the overhead of familiarization with the entire GATE architecture.

The **Natural Language Toolkit (NLTK)**<sup>3</sup> is a Python library that supports most of the common NLP tasks. It was launched in 2001 under the Apache license. Each NLP task is performed by an independent module and it is possible to train an own model for NER. The main disadvantage is that it lacks support for dependency parsing and an interface for the standard Universal Dependencies4 dataset is missing.

**Apache OpenNLP**<sup>5</sup> is written in Java and based on machine learning. Launched in 2004 and licensed under the Apache License, the software supports NER and many NLP tasks. But it lacks support for dependency parsing.

The **Stanford CoreNLP**<sup>6</sup> is a Java-based tool suite from Stanford University that was launched in 2010. It supports all relevant NLP tasks, including NER and dependency parsing. CoreNLP can train new NER models independently from the data types, languages, or domain. Its API includes more than 24 different

<sup>1</sup> https://gate.ac.uk/ie/annie.html

<sup>2</sup> https://gate.ac.uk/sale/tao/splitch8.html

<sup>3</sup> https://www.nltk.org/

<sup>4</sup> https://universaldependencies.org/

<sup>5</sup> https://opennlp.apache.org/

<sup>6</sup> https://stanfordnlp.github.io/CoreNLP/

#### *Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*

annotators for text annotation, regular expressions and language processing tasks. These annotators can be easily combined and executed sequentially in different pipelines. A REST service interface is also available, which can be used by other external systems for different NLP tasks execution. Thus, CoreNLP may be easily integrated with a rule-based expert system to support the automatic document classification in RAGE. Finally, the training of NER models is very flexible and customizable. CoreNLP includes nearly 100 parameters for CRF-based model training and performance fine-tuning, including other options for adding gazette lists that can recognize unknown named entities. CoreNLP is licensed under the GPLv3 and has a very big active community. Thus, state-of-the-art NLP methods and algorithms are permanently developed and integrated into the software.

**Stanza**<sup>7</sup> is a Python Library, developed by Stanford University as a possible successor for CoreNLP. It was launched in 2019 under the Apache license. Even the system is rather new it supports many features needed in our work, only sentiment analysis is missing. The ML models trained by CoreNLP are not directly supported in Stanza and need to be trained again. Stanza brings a client to connect to the CoreNLP server, so it is possible to use CoreNLP features over this interface, which increases the complexity. **SpaCy**<sup>8</sup> is one of the newer systems for NLP that was launched in 2015. It is written in Python and was published under the MIT license. It is used to produce software for production usage, which should be easy to use and fast. SpaCy supports most of the common NLP features, including dependency parsing and features for training custom models for NER. But it lacks support for sentiment analysis. The main disadvantage for our purpose is, it focuses on fast classification, which leads to a lower accuracy compared to other systems. **Spark NLP**<sup>9</sup> is one of the most recent NLP tools that was released in 2017. It is a library build on top of Apache Spark and TensorFlow. It supports Python, Java and Scala and focuses the usage in production systems. It has more dependencies to get it up and running compared to other systems, due to the Apache Spark architecture. The supported NLP features include all relevant features, including dependency parsing and the training of a custom model for NER. Due to its young age, the community is not as big and active compared to others. On Stack Overflow, only a few number of questions are tagged with "johnsnowlabs-spark-nlp", while the "stanford-nlp" tag has more than 3000 questions. We decided to use the Stanford CoreNLP suite for our project. CoreNLP is the only NLP software which met all our requirements. The competitors may be better or faster in one or another subtask, but overall CoreNLP seems to be the tool with the best mix of all required features. Especially the rich feature set in combination with an active and living community is a huge advantage of Stanford CoreNLP, compared to the other solutions.

### **2.3 Rule-based expert systems**

Expert systems are rapidly growing technology of Artificial Intelligence (AI) that use human expert knowledge for complex problem-solving in fields like Health, science, engineering, business and weather forecasting [15–17]. An expert system represents knowledge solicited by a human expert as data or production rules within a computer program [17]. These rules and data can be used to solve complex problems. For instance, a rule-based classification system can be applied to classify text documents into organized groups by applying a set of linguistic rules.

recall and describes the balance between both measures. Most NER tools have functions to calculate precision, recall and F1 from a set of training and testing data.

classification process which often results in lower accuracy.

*The Role of Gamification in Software Development Lifecycle*

tools arose in the last view years and may be relevant in our work.

overhead of familiarization with the entire GATE architecture.

many NLP tasks. But it lacks support for dependency parsing.

<sup>1</sup> https://gate.ac.uk/ie/annie.html <sup>2</sup> https://gate.ac.uk/sale/tao/splitch8.html

<sup>4</sup> https://universaldependencies.org/ <sup>5</sup> https://opennlp.apache.org/

<sup>6</sup> https://stanfordnlp.github.io/CoreNLP/

<sup>3</sup> https://www.nltk.org/

**66**

Many tools have been proposed in the literature for named entity recognition. We need to review and compare them to enable a suitable integration into our target system. Therefore, we introduce the following **selection criteria**: a) the selected tool should not be limited to a specific type of text or knowledge domain b) should include a rich set of NLP features (including *NER, POS,Tokenization, Dependency Parsing, Sentiment Analysis*), c) must be stable, extendable, distributed as opensource, and should have an active community of developers. Our solution is designed to classify a relatively small amount of data. The RAGE contents have a limited size and do not consist of many gigabytes of data. Therefore, we prefer to achieve good results with a high level of accuracy and do not need a very fast

Our tool comparison is based on the work of Pinto [14]. According to our selection criteria, we exclude from our comparison non-opensource tools, tools without NER support, and those focusing only on specific data. To compare stateof-the-art tools, we added SpaCy, Spark NLP and Stanza to our list, because these

**GATE ANNIE**<sup>1</sup> is a more general solution for various NLP tasks. It was first developed to help software engineers and researchers working in NLP but has been optimized to a more powerful system with an integrated user interface, which supports different data preprocessing tasks and pipeline executions. GATE is distributed with an integrated information extraction system called ANNIE that supports NER and many other NLP tasks. ANNIE relies on the JAPE2 specification language, which provides finite state transduction over annotations based on regular expressions. Using the GATE interface, users can capture the provenance of machine and humangenerated annotated data to create new metrics for NLP tasks like named entity recognition. Additional metrics for more specific scenarios can be added, but this requires an existing implementation in the RAGE architecture, which introduces the

The **Natural Language Toolkit (NLTK)**<sup>3</sup> is a Python library that supports most of the common NLP tasks. It was launched in 2001 under the Apache license. Each NLP task is performed by an independent module and it is possible to train an own model for NER. The main disadvantage is that it lacks support for dependency parsing and an interface for the standard Universal Dependencies4 dataset is missing. **Apache OpenNLP**<sup>5</sup> is written in Java and based on machine learning. Launched in 2004 and licensed under the Apache License, the software supports NER and

The **Stanford CoreNLP**<sup>6</sup> is a Java-based tool suite from Stanford University that

was launched in 2010. It supports all relevant NLP tasks, including NER and dependency parsing. CoreNLP can train new NER models independently from the

data types, languages, or domain. Its API includes more than 24 different

*2.2.1 Comparison of NER tools*

<sup>7</sup> https://stanfordnlp.github.io/stanza/

<sup>8</sup> https://spacy.io/

<sup>9</sup> https://nlp.johnsnowlabs.com/

The rules will instruct the system to use semantically relevant elements of the document and its contents to identify useful categories for automatic classification [18]. Over the last decades, many expert systems have been proposed but essentially all of them are expressed using IF THEN-like statements which contain two parts: the conditions and the actions. In the mathematical sense, a rule can be defined as X == > Y, where X is the set of conditions (or antecedent) and Y is the set of actions (or the consequent). Rules are used to represent and manipulate knowledge in a declarative manner, while following the first-order logic in an unambiguous, human-readable form, and at the same time retaining machine interpretability. Rule-based systems generally include a "production memory" which contain a set of rules that are matched against facts stored in the "working memory" of an "inference engine" [40].

Our system allows any user (experts or novice developers) to customize and train a machine learning-based NER model in their domain of expertise. In the target system, the user starts with a named entity recognition definition, which is a set of parameters and configuration steps to train a named entity recognition model using machine learning. With the support of the system, the user can upload a text corpus, define the named entity categories, and the named entity names (including their related synonyms) based on the requirements of the target domain. Then, he/ she can customize all the conditional random fields and optimization parameters used to train a model with machine learning. The information about the NE categories, the NE names, and their related synonyms is used for the automatic annotation of the text corpus, using the BIO annotation mechanism which is integrated into our system. This is very useful because machine learning-based NER systems generally require a lot of annotated data for model train. However, while the system is able to suggest a first annotation of the text corpus, which can then be used for training and testing, it is necessary for the user to customize the testing data to avoid overfitting issues which may lead to very poor quality of the trained model [7]. Once a NER model is trained, the user can finally use it to construct flexible rules (by referring to the extracted named entities in the text) for automatic document classification in various domains. These rules are business rules and are constructed using a rule-based expert system. They will be used to represent and manipulate knowledge in a declarative manner using a set of WHEN … THEN statements in a human-readable form. The next sections will now provide an overview of relevant use cases and describe the overall architecture of the system.

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

Our use case diagram in **Figure 1** describes all tasks for a user to create a NER model definition, train a model, manage it, and finally use the trained model to support automated document classification in RAGE. We call our system the Standford Named Entity Recognition and Classification (SNER), as it relies on Standford NLP for NER, and Drools for document classification. Our actor is a

**3.1 Use case**

**Figure 1.** *SNERC use case.*

**69**

The **C Language Integrated Production System (CLIPS)** is a public domain software tool for building expert systems. It was developed by the NASA in 1985 [19]. It has become one of the most used RBES in the market because of its efficiency and portability [20]. CLIPS was written C, and for C programming. But, it is now incorporating a complete object-oriented language for writing expert systems, called COOL. COOL combines the programming paradigms of procedural, objectoriented and logical languages. While CLIPS can separate the knowledge base (the expert rules) from its inference logic, it is not that user friendly in the formulation of rules like many other systems [19].

Ten years after CLIPS, the **Java expert System Shell (JESS)** was launched by Ernest Friedman-Hill of Sandia National Lab [19] as a Java-based implementation of the CLIPS system. It supports the development of rule-based expert systems that can be tightly coupled to Java code and is often referred to as an expert system shell [21]. JESS is compatible with the CLIPS rule language, but a declarative language (called JessML) is also available for specifying rules in XML. JESS is free to use for educational and governmental purpose, but it is not an opensource software. There is no free source code under any available license10.

The **Drools** expert system is an opensource software that was first developed by Bob McWhiter (in 2001), and later on, absorbed by the JBoss organization (in 2005). Drools is based on Java and its rule definitions rely on IF...THEN statements which are easier to understand than the syntax provided by CLIPS and JESS. Drools rules can be also specified using a native XML format. The rule engine essentially is based on the Rete algorithm [22], however, extended to support object-oriented programming in the rule formulation. Drools is available under the Apache Software Foundation's opensource license [23]. Because its easy and far more readable rule syntax, Drools has been widely used as an expert system in various domains [6]. Therefore, we chose Drools to enable an automatic document classification in the RAGE ecosystem.

## **3. System design**

Our system design relies on the user-centered design (UCD) approach by [24], which has proved to be very successful in the optimization of the product usefulness and usability [25]. Applying the UCD to design a system includes: a) understanding the context in which users may use the system, b) identifying and specifying the users' requirements, c) developing the design solutions, and finally, d) evaluating the design against users' context and requirements.

<sup>10</sup> https://jess.sandia.gov/jess/FAQ.shtml

#### *Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*

Our system allows any user (experts or novice developers) to customize and train a machine learning-based NER model in their domain of expertise. In the target system, the user starts with a named entity recognition definition, which is a set of parameters and configuration steps to train a named entity recognition model using machine learning. With the support of the system, the user can upload a text corpus, define the named entity categories, and the named entity names (including their related synonyms) based on the requirements of the target domain. Then, he/ she can customize all the conditional random fields and optimization parameters used to train a model with machine learning. The information about the NE categories, the NE names, and their related synonyms is used for the automatic annotation of the text corpus, using the BIO annotation mechanism which is integrated into our system. This is very useful because machine learning-based NER systems generally require a lot of annotated data for model train. However, while the system is able to suggest a first annotation of the text corpus, which can then be used for training and testing, it is necessary for the user to customize the testing data to avoid overfitting issues which may lead to very poor quality of the trained model [7]. Once a NER model is trained, the user can finally use it to construct flexible rules (by referring to the extracted named entities in the text) for automatic document classification in various domains. These rules are business rules and are constructed using a rule-based expert system. They will be used to represent and manipulate knowledge in a declarative manner using a set of WHEN … THEN statements in a human-readable form. The next sections will now provide an overview of relevant use cases and describe the overall architecture of the system.

## **3.1 Use case**

The rules will instruct the system to use semantically relevant elements of the document and its contents to identify useful categories for automatic classification [18]. Over the last decades, many expert systems have been proposed but essentially all of them are expressed using IF THEN-like statements which contain two parts: the conditions and the actions. In the mathematical sense, a rule can be defined as X == > Y, where X is the set of conditions (or antecedent) and Y is the set of actions (or the consequent). Rules are used to represent and manipulate knowledge in a declarative manner, while following the first-order logic in an unambiguous, human-readable form, and at the same time retaining machine interpretability. Rule-based systems generally include a "production memory" which contain a set of

*The Role of Gamification in Software Development Lifecycle*

rules that are matched against facts stored in the "working memory" of an

The **C Language Integrated Production System (CLIPS)** is a public domain software tool for building expert systems. It was developed by the NASA in 1985 [19]. It has become one of the most used RBES in the market because of its efficiency and portability [20]. CLIPS was written C, and for C programming. But, it is now incorporating a complete object-oriented language for writing expert systems, called COOL. COOL combines the programming paradigms of procedural, objectoriented and logical languages. While CLIPS can separate the knowledge base (the expert rules) from its inference logic, it is not that user friendly in the formulation

Ten years after CLIPS, the **Java expert System Shell (JESS)** was launched by Ernest Friedman-Hill of Sandia National Lab [19] as a Java-based implementation of the CLIPS system. It supports the development of rule-based expert systems that can be tightly coupled to Java code and is often referred to as an expert system shell [21]. JESS is compatible with the CLIPS rule language, but a declarative language (called JessML) is also available for specifying rules in XML. JESS is free to use for educational and governmental purpose, but it is not an opensource software. There

The **Drools** expert system is an opensource software that was first developed by

Our system design relies on the user-centered design (UCD) approach by [24], which has proved to be very successful in the optimization of the product usefulness and usability [25]. Applying the UCD to design a system includes: a) understanding the context in which users may use the system, b) identifying and specifying the users' requirements, c) developing the design solutions, and finally,

d) evaluating the design against users' context and requirements.

Bob McWhiter (in 2001), and later on, absorbed by the JBoss organization (in 2005). Drools is based on Java and its rule definitions rely on IF...THEN statements which are easier to understand than the syntax provided by CLIPS and JESS. Drools rules can be also specified using a native XML format. The rule engine essentially is based on the Rete algorithm [22], however, extended to support object-oriented programming in the rule formulation. Drools is available under the Apache Software Foundation's opensource license [23]. Because its easy and far more readable rule syntax, Drools has been widely used as an expert system in various domains [6]. Therefore, we chose Drools to enable an automatic document classification in

"inference engine" [40].

the RAGE ecosystem.

**3. System design**

**68**

<sup>10</sup> https://jess.sandia.gov/jess/FAQ.shtml

of rules like many other systems [19].

is no free source code under any available license10.

Our use case diagram in **Figure 1** describes all tasks for a user to create a NER model definition, train a model, manage it, and finally use the trained model to support automated document classification in RAGE. We call our system the Standford Named Entity Recognition and Classification (SNER), as it relies on Standford NLP for NER, and Drools for document classification. Our actor is a

**Figure 1.** *SNERC use case.*

registered and logged-in user in KM-EP. There are four main actions that can be executed by the user: **1) "Manage NER model definition"**. This includes uploading a data dump for use in the target domain, defining the corresponding NE categories, names, and synonyms, customizing CRF and performance parameters, adding regular expressions to identify complex named entities (like Java 11.0), preparing the NER model, which includes features for the automatic annotation of the text corpus and the splitting of the annotated text into testing and training data. Finally, training the NER model using CronJobs and the Stanford NLP machine learning API. **2) "Manage NER model"**. This includes dealing with the management of the created NER models, reviewing the performance indicators like precision, recall and F1, edition and deletion of NER models, and upload of already existing NER models in the system. **3) "Manage classifier parameter definition"**. This action deals with adding, editing or deleting business rules that are used for classifying text documents into existing taxonomies. To create new rules, the user can select the taxonomies and NER models that are relevant for its specific domain. 4) The **"Edit content"** action describes the steps, where a KM-EP content is edited and the automated classification suggestion is retrieved, supervised and saved.

attribute in our work. We focus on serious game development, where existing game engines and tools for classic video game development are used, and we want to classify the Stack Overflow posts in this way. Third, [30] proposed a lightweight taxonomy to standardize the definition of common *tools*, *development environments* [TOOL/IDE], and game engines [TOOL/ENG] that are used for game development. We can use this taxonomy as a classification scheme for the Stack Overflow posts. Fourth, another aspect is *machine learning [ML]*, the most trending aspect in serious games development. Machine learning is one of the main techniques used in reusable software components [32] and for creating intelligent learning systems. For instance, pedagogical systems use observational data to improve their adaptive ability, instead of relying on theoretical guidelines [33]. This motivates us to integrate a machine learning-based classification scheme in our work. [34] created such a scheme and gave a brief overview of state-of-the-art machine learning algorithms. We will use this in our work for classifying posts in the machine learning dimension. Our final dimension is regarding video game bugs [BUG]. As shown in our study, one of the main concerns of serious games developers (like most of the software developers) is to find solutions to fix their bugs, whether during the design or implementation of their games. [35] developed in 2010 a taxonomy for video game bugs, which differentiate between specification bugs [BUG/SPEC] and implementation bugs [BUG/IMP]. A specification bug is generally referring to a wrong requirement in the game design document. This may refer to missing of critical information, conflicting requirements, or incorrectly stated requirements. A bug in an implementation is an error found in any asset (source code, art, level design, etc.) that is created to make the specification into a playable game [36]. A failure in an implementation is generally a deviation of the game's operation from

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

This section presents our Drools extensions that is relevant to enable a flexible classification of text documents into the RAGE taxonomies. Our features extension rely on techniques for *Linguistic Analysis*, *Syntactic Pattern Matching* and *Document Structure Analysis*. Our classification system will be implemented as a standalone RESTful webservice so that it can be easily integrated within RAGE and any other external systems that may need to classify documents into predefined taxonomies. **Linguistic Analysis**. We use the Stanford NLP API to support linguistic analysis in our System. Stanford NLP supports many NLP tasks like part-of-speech tagging (POS), tokenization, and NER. By analyzing specific part-of-speeches and recognizing various mentions of named entities discussion sentences, we can analyze the

syntactic structure of each sentence. Then, we can refer to the **sentence**

**components** (subject, predicate, object), the *sentence form* (whether it is *affirmative* [37] or *negative*), and the *sentence mood* (whether it is *interrogative* or *declarative*) to understand the structure of each sentence and derive its meaning. A similar approach was proposed by [37] for the classification of Stack Overflow discussions into software engineering-related facets, but this approach relied on hand-crafted rules for recognizing named entities in discussion posts. Instead of applying handcrafted rules for NER, we will rely on our NER system to extract SG-related named entities (like game genres, programming languages, or game engines) from the existing text documents. To detect the *sentence form* and determine if a sentence is positive or negative, we will rely on the *StanfordNLP Sentiment Analysis API*11, as it

the original game specification [35].

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

<sup>11</sup> https://nlp.stanford.edu/sentiment/index.html

**71**

**3.3 Drools extensions for document classification**

#### **3.2 Taxonomies in serious games development**

Our system is developed to enable automatic document classification into hierarchical taxonomies. Since, our research is applied to the domain of serious games development, we need to review existing taxonomies and find out, which ones may be useful to validate our approach. We can refer to our previous study about software search during serious games development [26] to figure out which taxonomies may be relevant for the domain of serious games. In this research [26], we applied the LDA statistical topic modeling to automatically discover 30 topics about serious games development, from which the following belong to the most popular ones: *Programming and Scripting Language, 3D-Modeling, Game Design, Rendering, Game Engines, Game Physics, Networking, Platform*, and *Animation*. We can now review the current state-of-the-art in taxonomies for serious games and select a list of taxonomies to be used in our proof-of-concept.

Taxonomies in serious games have many aspects and dimensions. Most relevant taxonomies for our work are related to 1) *Game genre*, 2) *programming languages*, 3) *video game tools*, 4) *machine learning algorithms*, and 5) *video game specification* and *implementation bugs*. Many researchers have proposed different hierarchical taxonomies in the domain of serious games. Their main objective was to elucidate the important characteristics of popular serious games and to provide a tool through which future research can examine their impact and ultimately contribute to their development [27]. Our first classification taxonomy reflects the *game genre* [GEN], as it is one the basic classification schemes proposed by researchers in the classification of serious games [27–30]. A serious game can be classified based on the market [GEN/MAR](e.g. Education, HealthCare, Military), the game type [GEN/ TYPE](board-game, card-game, simulation, role-playing game, toys, etc) or the platform [GEN/PLA] in which the game runs (Browser, Mobile, Console, PC) [27]. Many Stack Overflow discussions are already tagged with specific words like "education", "board-game", "simulation", "console". Therefore, we want to classify SGrelated discussions in the game genre dimension. Second, our analysis of SG-related online discussions in Stack Overflow has revealed that developers of serious games are generally concerned with finding ways to implement new features using a specific programming language (or scripting) language [LANG]. So, a taxonomy in the programming language dimension is essential. To classify programming languages, we refer to Roy's work [31] and use the *programming paradigm* as the main

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*

attribute in our work. We focus on serious game development, where existing game engines and tools for classic video game development are used, and we want to classify the Stack Overflow posts in this way. Third, [30] proposed a lightweight taxonomy to standardize the definition of common *tools*, *development environments* [TOOL/IDE], and game engines [TOOL/ENG] that are used for game development. We can use this taxonomy as a classification scheme for the Stack Overflow posts. Fourth, another aspect is *machine learning [ML]*, the most trending aspect in serious games development. Machine learning is one of the main techniques used in reusable software components [32] and for creating intelligent learning systems. For instance, pedagogical systems use observational data to improve their adaptive ability, instead of relying on theoretical guidelines [33]. This motivates us to integrate a machine learning-based classification scheme in our work. [34] created such a scheme and gave a brief overview of state-of-the-art machine learning algorithms. We will use this in our work for classifying posts in the machine learning dimension. Our final dimension is regarding video game bugs [BUG]. As shown in our study, one of the main concerns of serious games developers (like most of the software developers) is to find solutions to fix their bugs, whether during the design or implementation of their games. [35] developed in 2010 a taxonomy for video game bugs, which differentiate between specification bugs [BUG/SPEC] and implementation bugs [BUG/IMP]. A specification bug is generally referring to a wrong requirement in the game design document. This may refer to missing of critical information, conflicting requirements, or incorrectly stated requirements. A bug in an implementation is an error found in any asset (source code, art, level design, etc.) that is created to make the specification into a playable game [36]. A failure in an implementation is generally a deviation of the game's operation from the original game specification [35].

#### **3.3 Drools extensions for document classification**

This section presents our Drools extensions that is relevant to enable a flexible classification of text documents into the RAGE taxonomies. Our features extension rely on techniques for *Linguistic Analysis*, *Syntactic Pattern Matching* and *Document Structure Analysis*. Our classification system will be implemented as a standalone RESTful webservice so that it can be easily integrated within RAGE and any other external systems that may need to classify documents into predefined taxonomies.

**Linguistic Analysis**. We use the Stanford NLP API to support linguistic analysis in our System. Stanford NLP supports many NLP tasks like part-of-speech tagging (POS), tokenization, and NER. By analyzing specific part-of-speeches and recognizing various mentions of named entities discussion sentences, we can analyze the syntactic structure of each sentence. Then, we can refer to the **sentence components** (subject, predicate, object), the *sentence form* (whether it is *affirmative* [37] or *negative*), and the *sentence mood* (whether it is *interrogative* or *declarative*) to understand the structure of each sentence and derive its meaning. A similar approach was proposed by [37] for the classification of Stack Overflow discussions into software engineering-related facets, but this approach relied on hand-crafted rules for recognizing named entities in discussion posts. Instead of applying handcrafted rules for NER, we will rely on our NER system to extract SG-related named entities (like game genres, programming languages, or game engines) from the existing text documents. To detect the *sentence form* and determine if a sentence is positive or negative, we will rely on the *StanfordNLP Sentiment Analysis API*11, as it

registered and logged-in user in KM-EP. There are four main actions that can be executed by the user: **1) "Manage NER model definition"**. This includes uploading a data dump for use in the target domain, defining the corresponding NE categories, names, and synonyms, customizing CRF and performance parameters, adding regular expressions to identify complex named entities (like Java 11.0), preparing the NER model, which includes features for the automatic annotation of the text corpus and the splitting of the annotated text into testing and training data. Finally, training the NER model using CronJobs and the Stanford NLP machine learning API. **2) "Manage NER model"**. This includes dealing with the management of the created NER models, reviewing the performance indicators like precision, recall and F1, edition and deletion of NER models, and upload of already existing NER models in the system. **3) "Manage classifier parameter definition"**. This action deals with adding, editing or deleting business rules that are used for classifying text documents into existing taxonomies. To create new rules, the user can select the taxonomies and NER models that are relevant for its specific domain. 4) The **"Edit content"** action describes the steps, where a KM-EP content is edited and the automated classification suggestion is retrieved, supervised and saved.

Our system is developed to enable automatic document classification into hierarchical taxonomies. Since, our research is applied to the domain of serious games development, we need to review existing taxonomies and find out, which ones may be useful to validate our approach. We can refer to our previous study about software search during serious games development [26] to figure out which taxonomies may be relevant for the domain of serious games. In this research [26], we applied the LDA statistical topic modeling to automatically discover 30 topics about serious games development, from which the following belong to the most popular ones: *Programming and Scripting Language, 3D-Modeling, Game Design, Rendering, Game Engines, Game Physics, Networking, Platform*, and *Animation*. We can now review the current state-of-the-art in taxonomies for serious games and select a list

Taxonomies in serious games have many aspects and dimensions. Most relevant taxonomies for our work are related to 1) *Game genre*, 2) *programming languages*, 3) *video game tools*, 4) *machine learning algorithms*, and 5) *video game specification* and *implementation bugs*. Many researchers have proposed different hierarchical taxonomies in the domain of serious games. Their main objective was to elucidate the important characteristics of popular serious games and to provide a tool through which future research can examine their impact and ultimately contribute to their development [27]. Our first classification taxonomy reflects the *game genre* [GEN], as it is one the basic classification schemes proposed by researchers in the classification of serious games [27–30]. A serious game can be classified based on the market [GEN/MAR](e.g. Education, HealthCare, Military), the game type [GEN/ TYPE](board-game, card-game, simulation, role-playing game, toys, etc) or the platform [GEN/PLA] in which the game runs (Browser, Mobile, Console, PC) [27]. Many Stack Overflow discussions are already tagged with specific words like "education", "board-game", "simulation", "console". Therefore, we want to classify SGrelated discussions in the game genre dimension. Second, our analysis of SG-related online discussions in Stack Overflow has revealed that developers of serious games are generally concerned with finding ways to implement new features using a specific programming language (or scripting) language [LANG]. So, a taxonomy in the programming language dimension is essential. To classify programming languages, we refer to Roy's work [31] and use the *programming paradigm* as the main

**3.2 Taxonomies in serious games development**

*The Role of Gamification in Software Development Lifecycle*

of taxonomies to be used in our proof-of-concept.

**70**

<sup>11</sup> https://nlp.stanford.edu/sentiment/index.html

includes a machine learning-based API for this purpose. We will rely on regular expressions to determine the *sentence mood*. We will consider a sentence to be *interrogative*, if it contains a question mark, or if it starts with an interrogative word (what, how, why, etc.) (e.g. what is the best way to record player's orientation?), otherwise the sentence is *declarative*. Using our linguistic analysis features, we can understand the meaning of each individual sentence, and use this information to derive the semantic of a document. Then, it becomes easier to group documents having similar semantic into a single taxonomy.

**Syntactic Pattern Matching**. Research on web content mining has demonstrated that certain lexico-syntactic patterns matched in texts convey a specific relation [38]. Liu's study has revealed that many online questions belonging to similar topics have similar syntactic patterns. They found that many programming languages usually appear after a preposition, like **with** Java, **in** JavaScript. After carefully analyzing the title and description of some SG-related topics in Stack Overflow, we could easily observe similar behavior for game genres, game engines and tools, such as **for** educational games, **in** Unity 3D, **with** GameMaker, etc. Thus, the categories of a question can be derived based on the syntactic patterns of its sentences.

class with each term. We will rely on the WordNet API<sup>12</sup> to create semantic classes of candidate synonyms using standardized terms. When a new term is added, all its synonyms should be identified using WordNet and then considered for inclusion. By combining different terms and synonyms, we can discover a wide range of expressions and term combinations and phrases used in the majority of SG-related discussions. For instance, the term combination <*Best* > <*Way*> can be used to identify posts containing the expressions: "best way", "best strategy", "proper design", "optimal solution", etc. This will allow us to have a more generic syntactic pattern definition that can easily scale in different domains compared to [37]'s

Programming Language <*implement* > implement, develop, code, create, construct, build, set

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

— <*howto* > How to, How do (I,we), How can (I,we), How should (I,we)

<*error* > error, bug, defect, exception, warning, mistake

Specification Bug <*specify* > design, require, define, determine, redefine

Game Engine <*configure* > configure, setup, adjust, adapt, optimize

... Bug <*fix* > fix, solve, remove, get rid of, eliminate

**Taxonomy Category Term Term synonyms**

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

**Document Structure Analysis**. This feature is used to explore the structure of online text documents. We can refer to specific HTML elements to find out if a document contains a code snippets (< *code*> … <*=code* >), bullet points

(<*ul* > … <*=ul*>), or even images (<*img=*> ). Exploring the structure of online discussion can help us to classify documents into specific taxonomies like *Programming Languages* or *Video Game Bugs*. A quality study of Stack Overflow online discussion [39] has revealed that explanations (generally represented using bullet points in the question bodys) accompanying code snippets are as important as the snippets themselves. Also, existing survey research on document structure analysis has demonstrated that analyzing the hierarchy of physical components of a web page can be very useful in indexing and retrieving the information contained in this document [40]. For instance, if a Stack Overflow post, contains the word "bug" in its title, and one or more code snippets in its body, then it may be assigned to the *Implementation Category* of the *Video Game Bug Taxonomy*. Generally, such a discussion would include sentences like "How to **fix** my bug in … " or "How can I

system (**Table 3**).

**Table 3.**

**73**

**Pattern Description**

*Patterns for document structure analysis.*

<sup>12</sup> https://wordnet.princeton.edu/

LS Text contains multiple bullet points as HTML list CS Text contains one or multiple code snippets

IM Text contains one or multiple images followed by a text description

Implementation Bug Specification Bug

**Table 2.** *List of synonyms.*

**Table 1** shows the list of our syntactic patterns that can be used to classify Stack Overflow discussions into taxonomies of the RAGE system. Our syntactic pattern definition is based on a rich set of terms, term combinations, and standardized synonyms (**Table 2**), that we observed in various Stack Overflow discussions. Applying synonyms in our approach is very important to automatically detect name variations in text and enable a classification to perform better. For instance, we can use a pattern that includes the term "implement" and use the same pattern to identify texts that include the term "develop" or "build". To achieve this goal, we will need to create a domain dictionary with a set of semantic classes, each of which includes a standardized term and its synonyms [37].

For each parameter in our defined template shown in **Table 2**, and for each taxonomy and category that the template applies to, we will use a list of popular terms found in Stack Overflow to instantiate our template and created a semantic


**Table 1.** *List of syntactic patterns.* *Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*


**Table 2.**

includes a machine learning-based API for this purpose. We will rely on regular expressions to determine the *sentence mood*. We will consider a sentence to be *interrogative*, if it contains a question mark, or if it starts with an interrogative word (what, how, why, etc.) (e.g. what is the best way to record player's orientation?), otherwise the sentence is *declarative*. Using our linguistic analysis features, we can understand the meaning of each individual sentence, and use this information to derive the semantic of a document. Then, it becomes easier to group documents

**Syntactic Pattern Matching**. Research on web content mining has demonstrated that certain lexico-syntactic patterns matched in texts convey a specific relation [38]. Liu's study has revealed that many online questions belonging to similar topics have similar syntactic patterns. They found that many programming languages usually appear after a preposition, like **with** Java, **in** JavaScript. After carefully analyzing the title and description of some SG-related topics in Stack Overflow, we could easily observe similar behavior for game genres, game engines and tools, such as **for** educational games, **in** Unity 3D, **with** GameMaker, etc. Thus, the categories of a question can be derived based on the syntactic patterns of its

**Table 1** shows the list of our syntactic patterns that can be used to classify Stack Overflow discussions into taxonomies of the RAGE system. Our syntactic pattern definition is based on a rich set of terms, term combinations, and standardized synonyms (**Table 2**), that we observed in various Stack Overflow discussions. Applying synonyms in our approach is very important to automatically detect name variations in text and enable a classification to perform better. For instance, we can use a pattern that includes the term "implement" and use the same pattern to identify texts that include the term "develop" or "build". To achieve this goal, we will need to create a domain dictionary with a set of semantic classes, each of which

For each parameter in our defined template shown in **Table 2**, and for each taxonomy and category that the template applies to, we will use a list of popular terms found in Stack Overflow to instantiate our template and created a semantic

TT Term combination < *term*1 > <*term*2> appears in a sentence TTSG Term combination < *term*1 > <*term*2> appears in the subject group TTOB Term combination < *term*1 > <*term*2> appears in the object group TTPB Term combination < *term*1 > <*term*2> appears before a preposition

having similar semantic into a single taxonomy.

*The Role of Gamification in Software Development Lifecycle*

includes a standardized term and its synonyms [37].

SA The sentence is affirmative SI The sentence is interrogative SP The sentence is positive SN The sentence is negative

PA Entity or Term appears after a preposition PB Entity or Term appears before a preposition SG Entity or Term appears in the subject group PG Term appears in the predicate group OG Entity or Term appears in the object group

**Pattern Description**

sentences.

**Table 1.**

**72**

*List of syntactic patterns.*

*List of synonyms.*

class with each term. We will rely on the WordNet API<sup>12</sup> to create semantic classes of candidate synonyms using standardized terms. When a new term is added, all its synonyms should be identified using WordNet and then considered for inclusion. By combining different terms and synonyms, we can discover a wide range of expressions and term combinations and phrases used in the majority of SG-related discussions. For instance, the term combination <*Best* > <*Way*> can be used to identify posts containing the expressions: "best way", "best strategy", "proper design", "optimal solution", etc. This will allow us to have a more generic syntactic pattern definition that can easily scale in different domains compared to [37]'s system (**Table 3**).

**Document Structure Analysis**. This feature is used to explore the structure of online text documents. We can refer to specific HTML elements to find out if a document contains a code snippets (< *code*> … <*=code* >), bullet points (<*ul* > … <*=ul* >), or even images (<*img=*> ). Exploring the structure of online discussion can help us to classify documents into specific taxonomies like *Programming Languages* or *Video Game Bugs*. A quality study of Stack Overflow online discussion [39] has revealed that explanations (generally represented using bullet points in the question bodys) accompanying code snippets are as important as the snippets themselves. Also, existing survey research on document structure analysis has demonstrated that analyzing the hierarchy of physical components of a web page can be very useful in indexing and retrieving the information contained in this document [40]. For instance, if a Stack Overflow post, contains the word "bug" in its title, and one or more code snippets in its body, then it may be assigned to the *Implementation Category* of the *Video Game Bug Taxonomy*. Generally, such a discussion would include sentences like "How to **fix** my bug in … " or "How can I


**Table 3.**

*Patterns for document structure analysis.*

<sup>12</sup> https://wordnet.princeton.edu/

**solve** this issue... in my game" in its title or description body. Similarly, if a bug discussion includes terms like "requirement, design, or specification" in its title (e.g. I want to **fix** ... in my **specification**), with multiple bullet points in its description body, then it may indicate that the user is seeking help to solve an issue in a particular section of its design specification. In this case, the discussion post may be classified into the *Specification Bug* category of the *Video Game Bug Taxonomy*.

gazette lists, and regex. We use the Stanford RegexNER API to construct and store complex rules, as they can easily be combined with already trained models.

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

**NER Model Trainer** is our second component that is used to prepare a NER model. This includes the automatic annotation of the domain text corpus (or data dump) based on the previously defined NE categories, NE names and synonyms. Our system is also able to split the annotated text corpus into testing and training data. The testing data, however, needs to be reviewed by a human expert and uploaded again to avoid overfitting, and thus a realistic calculation of precision, recall and F1 scores. When this is done, the NER Model Trainer component can execute the task for training a NER model using jobs and the Stanford CoreNLP. As the NER Model Trainer is written in Java and KM-EP is a PHP project, we designed it as a separate REST service component. This has further advantages. First, the service can be developed independently and does not affect KM-EP. Second, this service can be used separately from KM-EP as it is defined as a REST API. Other external systems will just need to define the input data in a JSON format and send them via an HTTP REST call to this service. The NER Model Trainer has a class called *NER Model Definition* which represents the corresponding GUI components

in KM-EP. The Trainer class is used to control the training process.

recognized named entity with the NER Model Manager.

SKOS persistent identifier representing the category.

**Figure 3.**

**75**

*Example of a recognized named entity.*

**NER Model Manager**. This component is very straightforward since it only serves the storage of the trained NER models into the KM-EP filesystem so that they can be used by other systems like a linguistic analyzer or our document classification system. If a model is prepared with a NER Model Definition, users can update the created testing and training data within the NER Model Manager to get better Precision, Recall and F1 scores. Also, the created Stanford Regex NER rules can be edited and updated. It is also possible to upload a StanfordNLP NER model that was trained with another system and use it in KM-EP. **Figure 3** shows an example of a

**Classification Parameter Definition Manager**. This component is used to manage and store business rules in KM-EP. To construct business rules that mention named entities and can be used to classify documents into existing taxonomy categories, the design of the "Classification Parameter Definition Manager" component needs to include links to the "NER Model Manager", "Content Manager" and "Taxonomy Manager" of KM-EP. We use the *Simple Knowledge Organization System (SKOS)* as the unique connection between our business rules and the taxonomy categories found in KM-EP. Even each taxonomy category in KM-EP has a

**NER Classifier Server**. The NER Classify Server is our last component. It is developed as a standalone RestFul service to classify documents into taxonomies. To execute a document classification, the NER Classify Server needs information about the document (title, description, tags), the Drools rule, and references about the NER models, so that named entities can be used in the rule formulation. This information is sent to the server from KM-EP in a JSON format. With the provided document data and the references to the NER models, the server can now execute the NER, perform the synonym detection (with WordNet), and execute Linguistic Analysis, and Syntactic Pattern Matching on the Document structure and content. This analysis is done in the "classify()" method of a Java object, called Document. The analysis result is then stored into the properties of this object and can be used

Our features extensions are very flexible and can be easily combined to construct even more complex rules in the Drools language. There is also no limitations for adding new extensions to document classification in our system (**Table 4**).
