*4.1.1 Data preparation and system setting*

detected by Stanford CoreNLP which is based on CoreNLP Brat13. The reports

A list of persistent identifiers of the **detected categories**, an area for the detected **sentences** with the results of the Stanford CoreNLP features, representation of detected **Parts-of-Speech**, **detected NEs**, detected **basic dependencies** and the detected **sentiment**. For further analysis the original Stanford CoreNLP output

In the last chapter, we have described the implementation of our SNERC system, and presented a proof-of-concept scenario, where a machine learning NER model is used to support a rule-based classification of Stack Overflow discussions into taxonomies used in the domain of serious games. The concepts, models, designs, specifications, architectures, and technologies used in chapter 3 has demonstrated

Now, we need to evaluate our developed system and prove that it is usable, useful, effective, efficient, etc. Therefore, this chapter presents different evaluations, that we conducted to evaluate different aspects of SNERC. There are several

it the basic component used for NE recognition and classification, and also for supporting automatic document classification in RAGE. Thus, we use a standard text corpus to train a set of NER models and compare our evaluation values with another system, that is also based on Stanford CoreNLP. We use a text corpus of the medical area to demonstrate cross-domain portability of our approach. *Precision*, *recall*, and *F1* are also applied in this evaluation, as they are the standard evaluation

Our first evaluation is introduced to test the functionality of our NER system, as

Our second evaluation relies on the "Cognitive Walkthrough" [41] approach, which is a usability inspection method for identifying potential usability problems in interactive systems. This approach focuses on how easy it is for a user to accomplish a task with little or no formal instruction or informal coaching. We have used this method to identify possible issues in the SNERC user interface, while working through a series of tasks to perform NER and classify textual documents using

In this section, we describe the functional evaluation of our Stanford-based NER system and demonstrate the reproductivity of our approach in the medical research area. Thus, we refer to different text corpus previously used in the medical domain to train NER models with our system. Then, we compare our training result with another Stanford-based NER system applied on the same data set. Our system is compared with the work of [42], where various NER models for discovering emerging named entities (eNEs) were trained and applied in a medical Virtual Research Environments (VREs). As stated in Section 2.2, eNEs in medical environments are new research terms, that are already in use in medical literature, but are widely unknown by medical experts. The automatic recognition of eNEs (using

<sup>13</sup> https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/pipeline/demo

evaluation methods that can be used to evaluate software systems.

parameters for comparing machine learning-based NER models.

include the following information:

**4. Evaluation of SNERC**

the feasibility of this prototype.

business rules.

**78**

**4.1 Comparison with a standard corpus**

is also available in JSON format in the GUI.

*The Role of Gamification in Software Development Lifecycle*

Duttenhofer [42] used the Stanford CoreNLP for model training with the following data sets to train NER models in the medical context.


Data sets from CoNLL2003 and MeSH were selected and combined with three different variants of URF data sets. The following listing shows the parameters used for model training using Stanford CoreNLP.

map=word=0,answer=1 maxLeft=1 useClassFeature=true useWord=true useNGrams=true noMidNGrams=true maxNGramLeng=6 useNeighborNGrams=true usePrev=true useNext=true useDisjunctive=true useSequences=true usePrevSequences=true useTypeSeqs=true useTypeSeqs2=true useTypeySequences=true wordShape=chris2useLC

These parameters describe the methods and features required for training NER models using the machine learning-based system available in Stanford CoreNLP (see chapter 2.2). These parameters include:


We use the same list of parameters for training the three models (classifierURF1, classifierURF2 and classifierURF3) initially developed in [42] (see **Table 5**). For testing these models, we use the data set ("test-document-with-O-and-NE-andeNE-replaced1.tok"), which is an update version of the MeSH data set used by Duttenhofer.

#### *4.1.2 Model training with SNERC*

To train the same models developed by Duttenhöfer [42], we first defined three "NER Model Definitions" in our SNERC system. The data sets used in [42] are already annotated, thus, there is no need to upload a new data dump or use our automatic annotation tool to generate training and testing data. Also, we skipped the step for cleaning up the data dump (removal or HTML tags, code snippets, URLs, etc.). We continued by adding all the parameters for model training in the tab "Training Properties", where each of them can be easily changed, if needed. Then, we clicked on "Prepare NER Model" in the tab "Train Model" to prepare our models. Our model preparation function generated three documents representing the prepared models, which we renamed to remain consistent with our input data. The input documents used for training in Duttenhofer ("training-data.txt, englishtraining-data.txt, urf1.txt, urf2.txt, urf3.txt") were combined and uploaded to the respective prepared models. Then, we uploaded an annotated document "'*testdocument-with-O-and-NE-and-eNE-replaced1.tok*"' for testing to the generated models. Finally, the training process was triggered using job. **Figure 5** shows the final result of our trained models using SNERC, which also displays the evaluation values precision, recall and F1 (**Table 6**).

the medical area to train three different NER models and show the cross-domain portability of approach. As it can be seen, all the models trained with SNERC have the same evaluation values as in the reference work, since both systems are relying on Stanford CoreNLP for machine learning-based NER. We also note, that all the evaluation values shown in picture 5 are automatically computed by SNERC and can be read in the log output function of the "NER Model Definition Manager component" (see Section 3.4). This feature is always available and can used by a user to check the performance of a model during the preparation or training process.

*Comparion of evaluation values (precision, recall, F1) between SNERC and Duttenhofer system.*

**Generated model names Renamed models Text corpus**

*Generated classifier names and text corpus for training.*

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

**Table 6.**

**Table 7.**

**81**

d3dbc3839dx *SNERC\_classifier\_urf1* training-data.txt, english-training-data.txt, urf1.txt x5dhgfb33gh *SNERC\_classifier\_urf2* training-data.txt, english-training-data.txt, urf2.txt bc8ac12fgdb *SNERC\_classifier\_urf3* training-data.txt, english-training-data.txt, urf3.txt

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

**Classifier Entity P R F1 P R F1** *classifierURF1* NE 93,52% 99,51% 96,42% 93,52% 99,51% 96,42%

*classifierURF2* NE 98,50% 97,04% 97,77% 98,50% 97,04% 97,77%

*classifierURF3* NE 98,47% 95,07% 96,74% 98,47% 95,07% 96,74%

**Duttenhoefer SNERC**

eNE 0,00% 0,00% 0,00% 0,00% 0,00% 0,00% total 93,52% 55,96% 70,02% 93,52% 55,96% 70,02%

eNE 100,00% 48,73% 65,53% 100,00% 48,73% 65,53% total 98,92% 75,90% 85,89% 98,92% 75,90% 85,89%

eNE 95,57% 95,57% 95,57% 95,57% 95,57% 95,57% total 97,18% 95,29% 96,22% 97,18% 95,29% 96,22%

After we implemented SNERC, it is needed to prove the usability of the system. There are several evaluation methods available to perform this task. Automated and formal methods are testing a system with a computer program, based on a formal specification, or with formal models. As it is difficult to create such a specification or model, we will not use one of these methods. Other methods like empirical methods involve a crowd of potential users of the system, which will perform common tasks in it. Such an evaluation is very resource-intensive and therefore not appropriate to our purpose. Informal methods are based on the knowledge and experience of the evaluating persons. It is known, that these methods create good results and detect many problems in a given system. On the other hand, they are not very difficult or expensive to implement, so they may be a good approach for our project. One of these informal inspection methods is the "Cognitive Walkthrough" [41], where a group of experts simulates a potential user of the system. The group navigates the system and tries to perform the typical steps to achieve the results a user tries to get. Potential problems and defects are documented and solved.

**4.2 Cognitive walkthrough**

#### *4.1.3 Result*

**Classifier Precision Recall F1** *classifierURF1* 93,52% 55,96% 70,02% *classifierURF2* 98,92% 75,90% 85,89% *classifierURF3* 97,18% 95,29% 96,22%

**Table 7** shows the evaluation values of our trained models and the comparison with the system of Duttenhofer [42]. We have used a text corpus previously used in

#### **Table 5.**

*Evalutation results of Duttenhöfer [42].*


#### **Figure 5.**

*SNERC evaluation of Duttenhofer trained models.*

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*


**Table 6.**

• useNGrams: Derive features from N-grams, such as Substrings of the word

share function to be used (here "chris2useLC'").

*The Role of Gamification in Software Development Lifecycle*

Duttenhofer.

*4.1.3 Result*

**Table 5.**

**Figure 5.**

**80**

*Evalutation results of Duttenhöfer [42].*

*SNERC evaluation of Duttenhofer trained models.*

*4.1.2 Model training with SNERC*

values precision, recall and F1 (**Table 6**).

• Other features includes are used for word shape like useTypeSeqs (for upper/ lower case), useTypeSeqs2, useTypeySequences. WordShape defines the word

We use the same list of parameters for training the three models (classifierURF1, classifierURF2 and classifierURF3) initially developed in [42] (see **Table 5**). For testing these models, we use the data set ("test-document-with-O-and-NE-andeNE-replaced1.tok"), which is an update version of the MeSH data set used by

To train the same models developed by Duttenhöfer [42], we first defined three

**Table 7** shows the evaluation values of our trained models and the comparison with the system of Duttenhofer [42]. We have used a text corpus previously used in

**Classifier Precision Recall F1** *classifierURF1* 93,52% 55,96% 70,02% *classifierURF2* 98,92% 75,90% 85,89% *classifierURF3* 97,18% 95,29% 96,22%

"NER Model Definitions" in our SNERC system. The data sets used in [42] are already annotated, thus, there is no need to upload a new data dump or use our automatic annotation tool to generate training and testing data. Also, we skipped the step for cleaning up the data dump (removal or HTML tags, code snippets, URLs, etc.). We continued by adding all the parameters for model training in the tab "Training Properties", where each of them can be easily changed, if needed. Then, we clicked on "Prepare NER Model" in the tab "Train Model" to prepare our models. Our model preparation function generated three documents representing the prepared models, which we renamed to remain consistent with our input data. The input documents used for training in Duttenhofer ("training-data.txt, englishtraining-data.txt, urf1.txt, urf2.txt, urf3.txt") were combined and uploaded to the respective prepared models. Then, we uploaded an annotated document "'*testdocument-with-O-and-NE-and-eNE-replaced1.tok*"' for testing to the generated models. Finally, the training process was triggered using job. **Figure 5** shows the final result of our trained models using SNERC, which also displays the evaluation

*Generated classifier names and text corpus for training.*


#### **Table 7.**

*Comparion of evaluation values (precision, recall, F1) between SNERC and Duttenhofer system.*

the medical area to train three different NER models and show the cross-domain portability of approach. As it can be seen, all the models trained with SNERC have the same evaluation values as in the reference work, since both systems are relying on Stanford CoreNLP for machine learning-based NER. We also note, that all the evaluation values shown in picture 5 are automatically computed by SNERC and can be read in the log output function of the "NER Model Definition Manager component" (see Section 3.4). This feature is always available and can used by a user to check the performance of a model during the preparation or training process.

#### **4.2 Cognitive walkthrough**

After we implemented SNERC, it is needed to prove the usability of the system. There are several evaluation methods available to perform this task. Automated and formal methods are testing a system with a computer program, based on a formal specification, or with formal models. As it is difficult to create such a specification or model, we will not use one of these methods. Other methods like empirical methods involve a crowd of potential users of the system, which will perform common tasks in it. Such an evaluation is very resource-intensive and therefore not appropriate to our purpose. Informal methods are based on the knowledge and experience of the evaluating persons. It is known, that these methods create good results and detect many problems in a given system. On the other hand, they are not very difficult or expensive to implement, so they may be a good approach for our project. One of these informal inspection methods is the "Cognitive Walkthrough" [41], where a group of experts simulates a potential user of the system. The group navigates the system and tries to perform the typical steps to achieve the results a user tries to get. Potential problems and defects are documented and solved.

Afterwards, the cognitive walkthrough may be repeated. We chose the cognitive walkthrough as an appropriate evaluation method for our system.

Our evaluation was performed in two steps. First, we performed a cognitive walkthrough in a collaborative meeting with three experienced experts: **Expert 1** is a very experienced professor and since many years Char of Area of Multimedia and Internet Application in the Department of Mathematics and Computer Science at FernUniversität in Hagen. **Expert 2** is a PhD, significantly responsible for the concept and design of KM-EP. **Expert 3** is a PhD student, researching in the area of serious games and named entity recognition.

First, the menu structure of SNERC was navigated exploratively, to simulate the navigation of a potential user in the system. Then each SNERC component was tested. Finally, the creation of an automated classification was evaluated. Within these steps, there were overall eight defects detected, which needed to be fixed. Then, a second evaluation was performed. We extended the expert group by two new evaluators: **Expert 4** is a PhD student, researching in the medical area and emerging named entity recognition. **Expert 5** is a PhD student, researching in the area of advanced visual interfaces and artificial intelligence.

Within the second cognitive walkthrough all typical steps where performed, as a potential user would do it. There were no further defects detected. Expert 4 pointed to the problem of unrealistic performance indicators due to overfitting. This could be disproved with the possibility to supervise and edit the automatically generated testing data within the NER Model Manager. A further note was, SNERC may not be suitable to deal with huge data sets, because of its web-based GUI architecture. As KM-EP does not deal with such huge data sets this is not a real problem for our approach.

We saw the informal evaluation method lead to many results with a limited amount of time and resources. Nevertheless, an empirical evaluation with a bigger group of potential users should be done, to prove the usability and robustness of the system further.

#### **5. Conclusion and final discussion**

In this research, we presented a system for named entity recognition and automatic document classification that was integrated into an innovative Knowledge Management System for Applied Gaming. After presenting various real-word use case scenarios, we demonstrated, that it is possible to support users in the process of automatic document classification by combining techniques, such as, semantic analysis, natural language processing techniques (like named entity recognition) and a rule-based expert system. Our NER system was validated using the standard metrics for machine learning models. We demonstrated the portability of this system by using standard text corpus for model training and testing in various domains. Our overall system consisting of both, the NER and document classification system, has been successfully integrated into the target environment and was validated using Cognitive Walkthrough. A future evaluation with a bigger group of potential users may help to gather further insights about the usage, usability and error handling of the entire system.

**Author details**

Philippe Tamla<sup>1</sup>

**83**

† These authors contributed equally.

provided the original work is properly cited.

\*†, Florian Freund1† and Matthias Hemmje<sup>2</sup>

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval*

*DOI: http://dx.doi.org/10.5772/intechopen.95076*

1 Faculty of Multimedia and Computer Science, Hagen University, Germany

\*Address all correspondence to: philippe.tamla@fernuni-hagen.de

2 Research Institute for Telekommunikation and Cooperation, Dortmund, Germany

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

*Supporting Named Entity Recognition and Document Classification for Effective Text Retrieval DOI: http://dx.doi.org/10.5772/intechopen.95076*
