**3.4 System architecture of SNERC**

This section presents the system architecture of SNERC. Based on our use cases, we have defined 5 main components which will want to describe here (**Figure 2**).

**NER Model Definition Manager** manages all the necessary definitions and parameters for model training using machine learning. It includes 3 main classes. The first two, Named Entity Category and Named Entity, hold information about the domain-specific named entities names and categories. The third class, NERModelDefinition, is used to stored data like the model name, text corpus,


#### **Table 4.**

*Pattern matching rules for matching stack overflow discussion posts.*

**Figure 2.** *Model of the conceptual architecture.*

#### *Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*

gazette lists, and regex. We use the Stanford RegexNER API to construct and store complex rules, as they can easily be combined with already trained models.

**NER Model Trainer** is our second component that is used to prepare a NER model. This includes the automatic annotation of the domain text corpus (or data dump) based on the previously defined NE categories, NE names and synonyms. Our system is also able to split the annotated text corpus into testing and training data. The testing data, however, needs to be reviewed by a human expert and uploaded again to avoid overfitting, and thus a realistic calculation of precision, recall and F1 scores. When this is done, the NER Model Trainer component can execute the task for training a NER model using jobs and the Stanford CoreNLP. As the NER Model Trainer is written in Java and KM-EP is a PHP project, we designed it as a separate REST service component. This has further advantages. First, the service can be developed independently and does not affect KM-EP. Second, this service can be used separately from KM-EP as it is defined as a REST API. Other external systems will just need to define the input data in a JSON format and send them via an HTTP REST call to this service. The NER Model Trainer has a class called *NER Model Definition* which represents the corresponding GUI components in KM-EP. The Trainer class is used to control the training process.

**NER Model Manager**. This component is very straightforward since it only serves the storage of the trained NER models into the KM-EP filesystem so that they can be used by other systems like a linguistic analyzer or our document classification system. If a model is prepared with a NER Model Definition, users can update the created testing and training data within the NER Model Manager to get better Precision, Recall and F1 scores. Also, the created Stanford Regex NER rules can be edited and updated. It is also possible to upload a StanfordNLP NER model that was trained with another system and use it in KM-EP. **Figure 3** shows an example of a recognized named entity with the NER Model Manager.

**Classification Parameter Definition Manager**. This component is used to manage and store business rules in KM-EP. To construct business rules that mention named entities and can be used to classify documents into existing taxonomy categories, the design of the "Classification Parameter Definition Manager" component needs to include links to the "NER Model Manager", "Content Manager" and "Taxonomy Manager" of KM-EP. We use the *Simple Knowledge Organization System (SKOS)* as the unique connection between our business rules and the taxonomy categories found in KM-EP. Even each taxonomy category in KM-EP has a SKOS persistent identifier representing the category.

**NER Classifier Server**. The NER Classify Server is our last component. It is developed as a standalone RestFul service to classify documents into taxonomies. To execute a document classification, the NER Classify Server needs information about the document (title, description, tags), the Drools rule, and references about the NER models, so that named entities can be used in the rule formulation. This information is sent to the server from KM-EP in a JSON format. With the provided document data and the references to the NER models, the server can now execute the NER, perform the synonym detection (with WordNet), and execute Linguistic Analysis, and Syntactic Pattern Matching on the Document structure and content. This analysis is done in the "classify()" method of a Java object, called Document. The analysis result is then stored into the properties of this object and can be used

**Figure 3.** *Example of a recognized named entity.*

**solve** this issue... in my game" in its title or description body. Similarly, if a bug discussion includes terms like "requirement, design, or specification" in its title (e.g. I want to **fix** ... in my **specification**), with multiple bullet points in its

*Taxonomy*.

**Table 4.**

**Figure 2.**

**74**

*Model of the conceptual architecture.*

**3.4 System architecture of SNERC**

**Pattern Matching Taxonomy**

**Categories**

*The Role of Gamification in Software Development Lifecycle*

*Pattern matching rules for matching stack overflow discussion posts.*

description body, then it may indicate that the user is seeking help to solve an issue in a particular section of its design specification. In this case, the discussion post may be classified into the *Specification Bug* category of the *Video Game Bug*

Our features extensions are very flexible and can be easily combined to construct even more complex rules in the Drools language. There is also no limitations for adding new extensions to document classification in our system (**Table 4**).

This section presents the system architecture of SNERC. Based on our use cases, we have defined 5 main components which will want to describe here (**Figure 2**). **NER Model Definition Manager** manages all the necessary definitions and parameters for model training using machine learning. It includes 3 main classes. The first two, Named Entity Category and Named Entity, hold information about

**Examples**

<**Educational Game**> for learning prog. Language.

PA (SG || OG) && SA LANG, GENRE, ... < **How to**> to do animation with <**unity3d5***:***2**> An

(TT && SI) || PA SPB It might be an issue in the <**game** > <**design**> spec. PB && CS IMB I am using a nstimer and it has a <**bug** > with my game

loop <**code** >...<*=***code**>

the domain-specific named entities names and categories. The third class, NERModelDefinition, is used to stored data like the model name, text corpus, during the execution of Drools rules. The following code snippet shows the implementation of our Document.classify() method.

Classification Parameter Definition Manager component to classify discussion texts into taxonomies of the system. For instance, there may be a Stack Overflow post

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

"I am making a game on xcode 5. I am using a nstimer in C\# and there may be a bug in my game loop. Can you help me please. All help is great.

According to our previous definition, we can create Drools rules to automatically classify this document into *Video Game Bug* and *Programming Language* taxonomies. First, we will start with the creation of a "Classification Parameter Definition", where we select the desired taxonomy and NER models for named entity extraction. Then, we will construct our classification rules using the WHEN … THEN syntax provided by Drools. Based on the selected taxonomy, the NER models, and our rich set of features extensions, we can easily refer to specific named entities (like C# (LANG), cocoa-touch (TOOL)) in our rule definitions and perform *Linguistic Analysis*, *Syntactic Pattern Matching*, and *Document Structure Analysis* on the document. **Figure 4** shows an example of such classification rules in the Drools language.

• Lines 6–7 (of rule 1) refer to our *WordNet* integration to detect if the term "bug" (or one of its synonyms) is included in the discussion title. Line 9 analyzes the document structure to identify if the post description includes a code snippet. Because both conditions are true, the document is automatically

assigned to the *Implementation Bug* of the *Video Game Bug* taxonomy.

• Line 19 (of rule 2) checks the syntax of the post description to identify if a named entity of type LANG appears after a preposition. Since it is true, the post is assigned to the C# category of the *Programming Language* taxonomy.

To make it easier for the user to test the created rules, we implemented a form to test the developed rules. The user can input some text, execute the classification parameter definition and see a classification report with the results of the annotation and classification process. There is also a visualization of the NLP features

like this in RAGE:

Description:

**Figure 4.**

**77**

*Selected categories and their rules.*

<code>...</code>"

Title: "bug in my game loop" Keywords: "cocoa-touch, nstimer"

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

```
Server
Document
  title
  description
  tags
  ...
  classify()
    LinguisticAnalyzer.check(sentence)
      detectNamedEntities()
      detectSynonyms()
      appearsAfterPreposition()
      appearsBeforePreposition()
      isAffirmative()
      appearsInSubject()
      isSentencePostive()
    DocumentStructureAnalyzer(text)
    hasCodeSnippet()
    hasBulletPoint()
    hasImages()
```
### *3.4.1 System service implementation*

To make the features of our implemented REST services available to the various KM-EP components, we created two new services in KM-EP. These services are used as an adaptor between KM-EP and its objects and our developed REST services. Each service bundles the features of the corresponding REST service and is connected with the KM-EP PHP API. The big advantage of relying on this servicebased architecture is that, if we decide to change or update our REST APIs, we will only need to change the KM-EP services and leave their underline implementations untouched.

**NER Model Trainer Service**. The NER Model Trainer Service of KM-EP is used to connect with the NER Model Trainer REST service. As already discussed in the previous sections, this component includes the creation of a NER Model preview, the preparation of a NER Model and model training. Because the NER Models are created using the NER Model Trainer component, they need to be downloaded from there into KM-EP and deleted afterwards.

**Classifier Service**. The Classifier Service of KM-EP is used for the communication between KM-EP and the NER Classify Server REST service. To handle the automatic document classification, we first need to manage the NER Models using the NER Classify Server. Then, the Classifier Service of KM-EP can trigger the execution of the operation for adding or deleting NER Models by calling the NER Classify Server. Furthermore, the Classifier Service will be able to trigger the automatic classification of documents to be suggested to the user.

#### **3.5 Proof-of-concept**

After presenting our major use cases and showing details about our implemented components, we can now present a common use case scenario where Stack Overflow discussions about SG topics can be classified in RAGE. With an existing NER model in the system, a classification parameter definition can be created with the

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*

Classification Parameter Definition Manager component to classify discussion texts into taxonomies of the system. For instance, there may be a Stack Overflow post like this in RAGE:

Title: "bug in my game loop" Keywords: "cocoa-touch, nstimer" Description: "I am making a game on xcode 5. I am using a nstimer in C\# and there may be a bug in my game loop. Can you help me please. All help is great. <code>...</code>"

According to our previous definition, we can create Drools rules to automatically classify this document into *Video Game Bug* and *Programming Language* taxonomies. First, we will start with the creation of a "Classification Parameter Definition", where we select the desired taxonomy and NER models for named entity extraction. Then, we will construct our classification rules using the WHEN … THEN syntax provided by Drools. Based on the selected taxonomy, the NER models, and our rich set of features extensions, we can easily refer to specific named entities (like C# (LANG), cocoa-touch (TOOL)) in our rule definitions and perform *Linguistic Analysis*, *Syntactic Pattern Matching*, and *Document Structure Analysis* on the document. **Figure 4** shows an example of such classification rules in the Drools language.


To make it easier for the user to test the created rules, we implemented a form to test the developed rules. The user can input some text, execute the classification parameter definition and see a classification report with the results of the annotation and classification process. There is also a visualization of the NLP features

**Figure 4.** *Selected categories and their rules.*

during the execution of Drools rules. The following code snippet shows the imple-

To make the features of our implemented REST services available to the various KM-EP components, we created two new services in KM-EP. These services are used as an adaptor between KM-EP and its objects and our developed REST services. Each service bundles the features of the corresponding REST service and is connected with the KM-EP PHP API. The big advantage of relying on this servicebased architecture is that, if we decide to change or update our REST APIs, we will only need to change the KM-EP services and leave their underline implementations

**NER Model Trainer Service**. The NER Model Trainer Service of KM-EP is used to connect with the NER Model Trainer REST service. As already discussed in the previous sections, this component includes the creation of a NER Model preview, the preparation of a NER Model and model training. Because the NER Models are created using the NER Model Trainer component, they need to be downloaded from

**Classifier Service**. The Classifier Service of KM-EP is used for the communica-

After presenting our major use cases and showing details about our implemented components, we can now present a common use case scenario where Stack Overflow discussions about SG topics can be classified in RAGE. With an existing NER model in the system, a classification parameter definition can be created with the

tion between KM-EP and the NER Classify Server REST service. To handle the automatic document classification, we first need to manage the NER Models using the NER Classify Server. Then, the Classifier Service of KM-EP can trigger the execution of the operation for adding or deleting NER Models by calling the NER Classify Server. Furthermore, the Classifier Service will be able to trigger the

automatic classification of documents to be suggested to the user.

mentation of our Document.classify() method.

*The Role of Gamification in Software Development Lifecycle*

LinguisticAnalyzer.check(sentence)

detectNamedEntities() detectSynonyms()

isAffirmative() appearsInSubject() isSentencePostive()

hasCodeSnippet() hasBulletPoint() hasImages()

there into KM-EP and deleted afterwards.

*3.4.1 System service implementation*

untouched.

**3.5 Proof-of-concept**

**76**

appearsAfterPreposition() appearsBeforePreposition()

DocumentStructureAnalyzer(text)

Server

Document title description

> tags ...

classify()

detected by Stanford CoreNLP which is based on CoreNLP Brat13. The reports include the following information:

A list of persistent identifiers of the **detected categories**, an area for the detected **sentences** with the results of the Stanford CoreNLP features, representation of detected **Parts-of-Speech**, **detected NEs**, detected **basic dependencies** and the detected **sentiment**. For further analysis the original Stanford CoreNLP output is also available in JSON format in the GUI.

NER methods) can make them easily usable in Information Retrieval by search

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

Duttenhofer [42] used the Stanford CoreNLP for model training with the

• CoNLL2003 ("english-training-data.txt"): a reference data set used to evaluate

• The NE dictionary Medical Subject Headings (MeSH) ("training-data.txt"). A

• User Relevance Feedback(URF) ("urf1.txt, urf2.txt, urf3.txt"). A set of known emerging Named Entities (eNEs) provided by experts in the medical field.

Data sets from CoNLL2003 and MeSH were selected and combined with three different variants of URF data sets. The following listing shows the parameters used

These parameters describe the methods and features required for training NER models using the machine learning-based system available in Stanford CoreNLP

• map: describes the data format of the training data. The data must be separated using tabs. Column 0 must include the word (or NE), and column 1 the

• maxLeft: The number of words to be used as contextual feature for observing

• useClasses: The "NE class" should be used as an additional feature during training.

words on the left of the current word during the model training [6].

• useWord: Each "word" of the text corpus should be used as an additional

following data sets to train NER models in the medical context.

dictionary (or thesaurus) of standard medical terms.

NER systems dealing with English documents.

queries or indexing of documents.

*4.1.1 Data preparation and system setting*

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

for model training using Stanford CoreNLP.

(see chapter 2.2). These parameters include:

feature during training.

**79**

corresponding label used to annotate this NE.

map=word=0,answer=1

useClassFeature=true

useNeighborNGrams=true

maxLeft=1

useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6

usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC
