**1. Introduction**

Banana (*Musa* spp.) is one of the most produced and consumed fruit globally comprising of an edible pulp and a peel [1, 2]. Majority of the banana in the world was produced in Asia (52.8%), America (26.6%), Africa (17.8%), Oceania (1.4%) and Europe (0.4%) between 2000 and 2016. Green banana flour (GBF) from the pulp is rich in vitamin C and A, glutathione, flavonoids and phenolic compounds with potent antioxidant activity [3]. Banana peels are rich in phenolics and are good source of antioxidants [1]. Banana peel and unripe banana fruit are rich in dietary fibre and indigestible carbohydrates, proteins, essential amino acids, cellulose,

hemicellulose, lignin, starch, resistant starch, polyunsaturated fatty acids and potassium [4]. The interest in GBF relates to its high resistant starch (40.9–58.5%) and dietary fibre (6.0–15.5%) as well as bioactive compounds [4]. The high resistant starch might contribute to controlling glycemic indexes, cholesterol, gastric fullness, intestinal regularity and fermentation by intestinal bacteria, producing shortchain fatty acids that can prevent cancer in intestinal cells [5]. The health benefits of banana have attracted production of innovative food products in addition to their sensory properties in recent years. These scientific results have been communicated in the form of scientific papers containing unstructured data which use free flowing natural language combined with domain-specific terminology and numeric phrases [6]. Manual abstraction of information from these papers for literature review has huge labour cost and delay with considerable source of error and data corruption. Hence, scientific papers are attractive for the development of machine processes for automatic information extraction [6] using text mining.

Text mining uses the Natural Language Processing (NLP) tools for the automatic discovery of previously unknown information from unstructured data [7, 6] typically consisting of four stages (a) information retrieval by gathering a set of textual materials for a given topic; (b) entity recognition characterised by identifying textual features from gathered texts; (c) information extraction which aims to extract relationships among the recognised textural features such as occurrence and co-occurrence of specific terms (indexing) and (d) knowledge discovery, the extracted relationships are used to identify useful patterns from the data set [8]. Network analysis is a sociology techniques used to study the relationships and community structures in social data and has since been applied in other fields such as bioinformatics in order to find key molecular markers and communities within an interaction network [8]. It can be used to study the co-occurrence of specific terms.

Konstanz Information Miner (KNIME) text processing feature can read and process textual data and transform it into numerical data (document and term vectors) such as the term co-occurrence adjacency matrix in order to apply regular KNIME data mining nodes [9].

The objective is to process the unstructured textural information related to banana sensory using text mining and network analysis approach in order to extract knowledge like the use of banana in innovative food products and visualise associated relationship to the banana sensory attributes and consumer acceptability.

#### **2. Topic detection and network analysis methodology**

Published articles (106) from PubMed database on 'banana sensory' were uploaded into KNIME (Konstanz Information Miner) software using the PubMed document parser. Some of the articles were observed not to relate to the manuscript of interest. Hence, the documents were indexed for 'banana' in the titles using the table indexer and index query nodes resulting to 28 relevant documents. The texts were tagged with the OSCAR (Open Source Chemistry Analysis Routines, an open source extensible system for the automated annotation of chemistry in scientific articles) chemical named entity using the Oscar Tagger node and pre-processed by filtering and stemming, then transformed into a bag of words, which was filtered again such that only the terms with relative frequency from 0.02 to 1.0 was used as features (**Figure 1a**). The term co-occurrence counter node was used to count the number of co-occurrences in sentences. Following which the documents were transformed into

**89**

**Figure 1.**

*(c) knowledge detection.*

*Integrating Text Mining and Network Analysis for Topic Detection from Published Articles…*

*KNIME network for banana sensory data mining: (a) OSCAR tagging, (b) term co-occurrence and* 

*DOI: http://dx.doi.org/10.5772/intechopen.84857*

*Integrating Text Mining and Network Analysis for Topic Detection from Published Articles… DOI: http://dx.doi.org/10.5772/intechopen.84857*

**Figure 1.**

*Banana Nutrition - Function and Processing Kinetics*

automatic information extraction [6] using text mining.

hemicellulose, lignin, starch, resistant starch, polyunsaturated fatty acids and potassium [4]. The interest in GBF relates to its high resistant starch (40.9–58.5%) and dietary fibre (6.0–15.5%) as well as bioactive compounds [4]. The high resistant starch might contribute to controlling glycemic indexes, cholesterol, gastric fullness, intestinal regularity and fermentation by intestinal bacteria, producing shortchain fatty acids that can prevent cancer in intestinal cells [5]. The health benefits of banana have attracted production of innovative food products in addition to their sensory properties in recent years. These scientific results have been communicated in the form of scientific papers containing unstructured data which use free flowing natural language combined with domain-specific terminology and numeric phrases [6]. Manual abstraction of information from these papers for literature review has huge labour cost and delay with considerable source of error and data corruption. Hence, scientific papers are attractive for the development of machine processes for

Text mining uses the Natural Language Processing (NLP) tools for the automatic discovery of previously unknown information from unstructured data [7, 6] typically consisting of four stages (a) information retrieval by gathering a set of textual materials for a given topic; (b) entity recognition characterised by identifying textual features from gathered texts; (c) information extraction which aims to extract relationships among the recognised textural features such as occurrence and co-occurrence of specific terms (indexing) and (d) knowledge discovery, the extracted relationships are used to identify useful patterns from the data set [8]. Network analysis is a sociology techniques used to study the relationships and community structures in social data and has since been applied in other fields such as bioinformatics in order to find key molecular markers and communities within an interaction network [8]. It can be used to study the co-occurrence

Konstanz Information Miner (KNIME) text processing feature can read and process textual data and transform it into numerical data (document and term vectors) such as the term co-occurrence adjacency matrix in order to apply regular KNIME

The objective is to process the unstructured textural information related to banana sensory using text mining and network analysis approach in order to extract knowledge like the use of banana in innovative food products and visualise associated relationship to the banana sensory attributes and consumer

Published articles (106) from PubMed database on 'banana sensory' were uploaded into KNIME (Konstanz Information Miner) software using the PubMed document parser. Some of the articles were observed not to relate to the manuscript of interest. Hence, the documents were indexed for 'banana' in the titles using the table indexer and index query nodes resulting to 28 relevant documents. The texts were tagged with the OSCAR (Open Source Chemistry Analysis Routines, an open source extensible system for the automated annotation of chemistry in scientific articles) chemical named entity using the Oscar Tagger node and pre-processed by filtering and stemming, then transformed into a bag of words, which was filtered again such that only the terms with relative frequency from 0.02 to 1.0 was used as features (**Figure 1a**). The term co-occurrence counter node was used to count the number of co-occurrences in sentences. Following which the documents were transformed into

**2. Topic detection and network analysis methodology**

**88**

of specific terms.

acceptability.

data mining nodes [9].

*KNIME network for banana sensory data mining: (a) OSCAR tagging, (b) term co-occurrence and (c) knowledge detection.*

document vectors. The co-occurrence terms in at least four sentences were converted to node adjacency matrix (**Figure 1b**) and imported into Gephi Graph Visualisation and Manipulation software version 0.02. Network statistics such as modularity class, degree centrality, betweenness and closeness centrality were estimated. Degree centrality is the central tendency of each node in the network. The more direct connects each term has, the more power it has in the network and so the more important it is. The betweenness centrality reflects the ability of a node to take control of other nodes communication and control resources in the network. Closeness centrality is the ability of a node not being controlled by other nodes and measures the closeness of a node to others in the network. The Latent Dirichlet Allocation (LDA) node which uses a machine learning for language toolkit (MALLET) topic modeling library was applied to extract relevant information from an unstructured text (**Figure 1c**).
