**2. Structuralization of geological literature**

Text data are usually consisted of sentences written by authors with personal understandings and opinions. Compared to metadata, text data are characterized by ambiguity, polysemy, and irregular input in the natural language. It is difficult for computers to read and understand. It is necessary to segment a piece of text into semantic word sequences for further computer processing. English and other Latin languages have relatively simple morphology, especially inflectional morphology, and are segmented by spaces between words naturally. For those languages, it is often possible to ignore the word segmentation task entirely. In contrast, there is no space between words in a few other languages, such as Chinese. It is difficult for a computer to identify the boundary of a meaningful word or phrase in Chinese [33, 34]. The methods of Chinese word segmentation were classified into dictionary-based, statistically based, and hybrid approaches [33]. The statistically based methods include machine learning and deep learning methods, such as hidden Markov model (HMM), maximum entropy Markov model (MEMM), conditional random fields (CRF), and long short-term memory (LSTM).

From another perspective, the methods of word segmentation can be divided into generic and specific domain methods according the usage scenarios. In the generic domain, because of the shortcomings of word segmentation rules, some new words, especially the professional terms, will be regarded as out-of-vocabulary and cannot be identified correctly. Geology, as a knowledge-intensive discipline, has a systematic domain-specific terminology. Most of geologic terms are not familiar with the public. Geological literature including the geological terms has their unique characteristics. For instance, the geological literature is always organized according to some fixed format and contains lots of professional geologic terms that only people with a background knowledge can read and understand. The geological literature is dominated by descriptive sentences and has little ambiguity in information expression. In geological literature written in Chinese, it is also featured by mixed writing of Chinese and English terms as well as compound terms consisted of multiple geological terms [2, 7]. The text data in the natural language are sequence data; the word usage and combination are only influenced by the context. Based on the characteristics of text data, machine learning method (e.g., CRF) and deep learning method of neural network (e.g., neural network (CNN), LSTM) have been introduced to segment geological literature in Chinese in recent 2 years with successful results [7, 34–36].

#### **2.1 Conditional random fields**

For a random vector (e.g., in NLP), the joint probability is a high-dimensional distribution, which oversteps the processing power of an ordinary computer and is difficult to monitor during data processing. To reduce the data size, the highdimensional distribution is divided into a series of production of conditional

probability based on the independence hypothesis [37]. The probabilistic graphical model is a graph to describe independence relationship between multivariate in a high-dimensional probabilistic model, thus to reduce computer load. The probabilistic graphical model includes both directed and undirected models. The directed graphical model indicates there is a causation relationship between the variables, such as Bayesian networks. The variables in undirected graphical model have dependency with each other, such as Markov networks and CRF, which is different from the causation relationships.

dictionary matching and manual label methods on the basis of geological literature in CNKI, geology dictionary, TCCGMR (the terminologies and classification codes of geology and mineral resources), and a generic corpus of Peking University. Second, the segmentation rules were trained to build geological word segmentation model by the hybrid corpus, and then the trained model containing word segmentation rules was used to segment geological literature in Chinese. The workflow is

In that study, a geology dictionary of 11,000 geological terms, the TCCGMR of 80,000 geological terms, and the generic corpus of Peking University were used to build the hybrid corpus. By this way, geological knowledge was introduced into the corpus to train the rules of word segmentation of geological literature. It is the most notable feature compared with other Chinese word segmentation machine. The three parameters of precision, recall, and F-scores were used to evaluate the performance of CRF-based word segmentation in that work. The result was showed in **Figure 3**. The hybrid corpus combining a generic corpus and a geological corpus has a better performance than either the generic corpus or the geological corpus alone. The precision of the hybrid training reaches 94.14%, which is 7.84% and 0.52% higher than that of CRF-PKU and CRF-GEO, respectively. The recall of hybrid corpus reaches 91.40%, which is 9.30% and 0.41% higher than that of CRF-PKU and CRF-GEO, respectively. The F-score of the hybrid corpus reaches 92.75%, which is 8.60% and 0.46% higher than that of CRF-PKU and CRF-GEO, respectively.

Geological Corpus

90.99 91.40

Precision/% Recall/% F-scores/%

*Performances of the CRF model in different corpus. CRF-PKU, generic corpus of Peking University; CRF-GEO,*

*geological corpus; CRF-GEO + PKU, the hybrid corpus combining generic and geological corpora.*

CRF-PKU CRF-GEO CRF-GEO+PKU

82.10

Generic Corpus

Geological word segmentation model

Word segmentation result

Performance evaluation

92.29 92.75

84.15

Input

shown in **Figure 2**.

*Geology dictionary and* TCCGMR

CNKI Raw Corpus *Manual Label*

86.30

**Figure 2.**

100

90

80

70

60

**Figure 3.**

**125**

*Matching*

*Text Mining to Facilitate Domain Knowledge Discovery DOI: http://dx.doi.org/10.5772/intechopen.85362*

*Workflow of CRF-based word segmentation for geological literature in Chinese.*

93.62 94.14

CRF model is a discriminative graph model, while HMM is a generative graph model. The role of CRF model is to create the discriminant boundaries similar to the support vector machine model, which has a wide usage in the fields of NLP and bioinformatics. Compared with HMM and the maximum entropy model (MEM), the CRF model improves the accuracy and addresses the drawback of label bias [38, 39]. Text data are unstructured sequence data. The structuralization of geological text is a process of word segmentation or named entity recognition (NER), which divides the geological text into a series of semantic words. For natural language, the text is only influenced by the context, which is consistent with the assumption condition of the CRF model. The assumption condition is that multivariable obeys the Markov property. In other word, the label of part of speech at *n* position in NLP only has relationship with the word or character at *n-1* position. From the point of view of the graph model, *Yv* is a subset of *V* nodes set in the graph *G* ¼ ð Þ *V; E* ; the following equation is established:

$$p(Y\_v|X, Y\_w, w \neq v) = p(Y\_v|X, Y\_w, w \sim v) \tag{1}$$

where **X** ¼ f g *X*1*;X*2*;* …*;Xn* is the words or characters of text data in NLP, *w* � *v* denotes neighbor nodes of node *v* in the graph, and *Y* is the label set of part-of-speech f g *B; E; M; S* . For NLP, the graphical structure is chain-structured (**Figure 1**) [14–16].

According the factorization of joint probability distribution of undirected graph, the CRF model can be written as

$$p(\mathbf{Y}|\mathbf{X}) = \frac{\mathbf{1}}{Z(\mathbf{X})} \prod\_{i} e^{\sum\_{k} i\_{k} f\_{k}(\mathbf{X}, Y\_{i-1}, Y\_{i}, i)} = \frac{\mathbf{1}}{Z(\mathbf{X})} e^{\sum\_{i} \sum\_{k} i\_{k} f\_{k}(\mathbf{X}, Y\_{i-1}, Y\_{i}, i)} \tag{2}$$

In which *i* is the node position, *k* denotes the sequence number of feature function, and *λ<sup>k</sup>* is the weight parameter. In Eq. (2), the feature function can be expressed in Eq. (3), which contains information of transfer and status features.

$$f = \sum\_{i}^{T} \sum\_{k}^{M} \lambda\_k f\_k(\mathbf{X}, Y\_{i-1}, Y\_i, i) \tag{3}$$

In CRF-based word segmentation, Wang et al. [7] designed a two-step workflow to segment geological literature in Chinese. First, a hybrid corpus was created using

**Figure 1.** *A chain-structured CRF graph [7, 40].*

*Text Mining to Facilitate Domain Knowledge Discovery DOI: http://dx.doi.org/10.5772/intechopen.85362*

dictionary matching and manual label methods on the basis of geological literature in CNKI, geology dictionary, TCCGMR (the terminologies and classification codes of geology and mineral resources), and a generic corpus of Peking University. Second, the segmentation rules were trained to build geological word segmentation model by the hybrid corpus, and then the trained model containing word segmentation rules was used to segment geological literature in Chinese. The workflow is shown in **Figure 2**.

In that study, a geology dictionary of 11,000 geological terms, the TCCGMR of 80,000 geological terms, and the generic corpus of Peking University were used to build the hybrid corpus. By this way, geological knowledge was introduced into the corpus to train the rules of word segmentation of geological literature. It is the most notable feature compared with other Chinese word segmentation machine. The three parameters of precision, recall, and F-scores were used to evaluate the performance of CRF-based word segmentation in that work. The result was showed in **Figure 3**. The hybrid corpus combining a generic corpus and a geological corpus has a better performance than either the generic corpus or the geological corpus alone. The precision of the hybrid training reaches 94.14%, which is 7.84% and 0.52% higher than that of CRF-PKU and CRF-GEO, respectively. The recall of hybrid corpus reaches 91.40%, which is 9.30% and 0.41% higher than that of CRF-PKU and CRF-GEO, respectively. The F-score of the hybrid corpus reaches 92.75%, which is 8.60% and 0.46% higher than that of CRF-PKU and CRF-GEO, respectively.

**Figure 2.**

probability based on the independence hypothesis [37]. The probabilistic graphical model is a graph to describe independence relationship between multivariate in a high-dimensional probabilistic model, thus to reduce computer load. The probabilistic graphical model includes both directed and undirected models. The directed graphical model indicates there is a causation relationship between the variables, such as Bayesian networks. The variables in undirected graphical model have dependency with each other, such as Markov networks and CRF, which is different

CRF model is a discriminative graph model, while HMM is a generative graph model. The role of CRF model is to create the discriminant boundaries similar to the support vector machine model, which has a wide usage in the fields of NLP and bioinformatics. Compared with HMM and the maximum entropy model (MEM), the CRF model improves the accuracy and addresses the drawback of label bias [38, 39]. Text data are unstructured sequence data. The structuralization of geological text is a process of word segmentation or named entity recognition (NER), which divides the geological text into a series of semantic words. For natural language, the text is only influenced by the context, which is consistent with the assumption condition of the CRF model. The assumption condition is that multivariable obeys the Markov property. In other word, the label of part of speech at *n* position in NLP only has relationship with the word or character at *n-1* position. From the point of view of the graph model, *Yv* is a subset of *V* nodes set in the graph

where **X** ¼ f g *X*1*;X*2*;* …*;Xn* is the words or characters of text data in NLP, *w* � *v* denotes neighbor nodes of node *v* in the graph, and *Y* is the label set of part-of-speech f g *B; E; M; S* . For NLP, the graphical structure is chain-structured

According the factorization of joint probability distribution of undirected graph,

<sup>∑</sup>*kλ<sup>k</sup> <sup>f</sup> <sup>k</sup>* **<sup>X</sup>***;Yi*�1*;Yi* ð Þ*;<sup>i</sup>* <sup>¼</sup> <sup>1</sup>

In CRF-based word segmentation, Wang et al. [7] designed a two-step workflow to segment geological literature in Chinese. First, a hybrid corpus was created using

...

*Y1 Y2 Y3 Yn-1 Yn*

*X*=*X ,...,X ,X 1 n-1 n*

In which *i* is the node position, *k* denotes the sequence number of feature function, and *λ<sup>k</sup>* is the weight parameter. In Eq. (2), the feature function can be expressed in Eq. (3), which contains information of transfer and status features.

*p Y*ð *<sup>v</sup>*j*X; Yw; w* 6¼ *v*Þ ¼ *p Y*ð Þ *<sup>v</sup>*j*X; Yw; w* � *v* (1)

*<sup>Z</sup>*ð Þ **<sup>X</sup>** *<sup>e</sup>* ∑*i*

*λ<sup>k</sup> f <sup>k</sup>* **X***; Yi*�<sup>1</sup>*; Yi* ð Þ *; i* (3)

<sup>∑</sup>*kλ<sup>k</sup> <sup>f</sup> <sup>k</sup>* **<sup>X</sup>***;Yi*�1*;Yi* ð Þ*;<sup>i</sup>* (2)

from the causation relationships.

*Cyberspace*

(**Figure 1**) [14–16].

**Figure 1.**

**124**

*A chain-structured CRF graph [7, 40].*

the CRF model can be written as

*<sup>p</sup>*ð Þ¼ **<sup>Y</sup>**j**<sup>X</sup>** <sup>1</sup>

*G* ¼ ð Þ *V; E* ; the following equation is established:

*Z*ð Þ **X**

Y *i e*

> *f* ¼ ∑ *T i* ∑ *M k*

*Workflow of CRF-based word segmentation for geological literature in Chinese.*

**Figure 3.**

*Performances of the CRF model in different corpus. CRF-PKU, generic corpus of Peking University; CRF-GEO, geological corpus; CRF-GEO + PKU, the hybrid corpus combining generic and geological corpora.*

#### **2.2 Long short-term memory**

The text data are consisted of a series of sequential words or characters, which can be regarded as a special data of time series and can be processed by the methods used in the time series analysis. Words or characters in text data are not completely independent but are connected to and influenced by the adjacent words or characters. In the model of neural network, it contains three basic compositions: input layer, hidden layer, and output layer. The layers of ordinary neural networks are linked with each other by weights. The nodes in a same layer are independent and have no link with each other. If the ordinary neural network methods are used to process text data, the semantic information of context will be missing. Recurrent neural network (RNN) has a short memory by nodes connecting in the hidden layer, which can receive information from self-cell and other cells. RNN has been used in the fields of NLP and automatic speech recognition [41, 42]. RNN model has the drawback of vanishing gradient problem, which means RNN model only obtains the information that is limited in the adjacent node position [43]. To address this challenge, the LSTM model designed input gate, output gate, and a forget gate to obtain information of far nodes and regulate the information flow between the cells [44] (**Figure 4**).

$$i\_t = \sigma(W\_i \cdot [h\_{t-1}, \chi\_t]) + b\_i \tag{4}$$

2.Words grouped: Each word is grouped based on frequency and a ranking

is extracted and joined together randomly.

*Text Mining to Facilitate Domain Knowledge Discovery DOI: http://dx.doi.org/10.5772/intechopen.85362*

3. Random extraction and combination: Each group of words in the previous step

4.Training: With the previous processing, sentences are formed via combination

5. Testing and output: The resulting segmentation is post-processed and output.

In this research work, the significant highlight is that the training corpus is

corresponding sequences of the training corpus. The training corpus did not have any manual label information. The precision, recall, and F-scores reach 86.1%, 87.1%, and 86.6%, respectively. Compared with Wang et al. [7], the performance of CRF-based method is better than the Bi-LSTM-based segmenter based on the performance reported in their papers. But the Bi-LSTM-based method has a strong ability of identifying new words. The rate of out-of-vocabulary word identification

The nodes of content word and their links are the carrier of literature information and knowledge. In a large open knowledge graph, the key information was stored in in a triple format. Moreover, the bigram is also widely used in the text information representation. Wang et al. [7] used the bigram graph to represent the

The visualization was built based on the "from," "to," and "weight" variables. The variables of "from" and "to" indicate the sequence of content words in the content word corpus. In the content-word pairs, the former content word is defined as "from" variable, and the latter is defined as "to" variable. Their weights were defined by the co-occurrence frequency of content-word pairs. The bigram graph

In geological exploration, the anomaly information of geology, geochemical exploration, geophysical exploration, and remote sensing is important clues for mineral prospecting [46]. To state different anomaly information, literatures of geological exploration will have significant features in the term of word frequency. **Figure 5** shows the main information hidden in a single literature of geophysical exploration. In this visualization, geological terms (e.g., *aeromagnetic*, *gravity*, *magnetic*) and geophysical data processing terms (e.g., *inversion*, *horizontal gradient*, *information*) are all linked to the term *anomaly*. The visualization represents the

Geology research not only reveals the earth evolution and promotes our understanding of the Earth but also has a close relationship with the human society. One of the important roles of applied geology is to discover mineral deposits and provide raw material for economic construction and development. In the long geological

random. The segmentation rule was learned from the words and their

**3. Text information visualization and knowledge discovery**

**3.1 Information visualization of a single geological literature**

was used to visualize the nodes of content words and their links.

**3.2 Geological text mining for discovering ore prospecting clues**

hidden key knowledge in the geological literature.

algorithm.

reached 71.1%.

**127**

single geological literature.

based on deep learning.

$$f\_t = \sigma(\mathcal{W}\_f \cdot [h\_{t-1}, \mathfrak{x}\_t]) + b\_f \tag{5}$$

$$c\_t = f\_{\;t}c\_{t-1} + i\_t \tanh(W\_c x\_t + W\_c h\_{t-1} + b\_c) \tag{6}$$

$$\text{tot} = \sigma(\mathcal{W}\_o \cdot [h\_{t-1}, \mathfrak{x}\_t]) + b\_o \tag{7}$$

$$h\_t = o\_t \, ^\*\mathbf{tanh}(c\_t) \tag{8}$$

in which *i*, *f*, *c*, and *o* denote input gate, forget gate, cell vector, and output gate, receptively. σ denotes the activation functions. *W* denotes weight matrices and bias vector parameters which need to be learned during the training.

Qiu et al. [36] proposed a geological literature segmenter based on the Bi-LSTM model. The segmenter was carried out by the following stages (more details can be seen in the reference article):

1. Corpus construction: The corpus from domain-generic and domain-specific texts is collected and constructed.

**Figure 4.** *The cell of LSTM [43, 45].*

**2.2 Long short-term memory**

*Cyberspace*

[44] (**Figure 4**).

seen in the reference article):

*ct-1*

*ht-1*

*xt*

**Figure 4.**

**126**

*The cell of LSTM [43, 45].*

texts is collected and constructed.

*ct* ¼ *ft*

*ht* ¼ *ot*

vector parameters which need to be learned during the training.

σ · tanh

in which *i*, *f*, *c*, and *o* denote input gate, forget gate, cell vector, and output gate, receptively. σ denotes the activation functions. *W* denotes weight matrices and bias

Qiu et al. [36] proposed a geological literature segmenter based on the Bi-LSTM model. The segmenter was carried out by the following stages (more details can be

1. Corpus construction: The corpus from domain-generic and domain-specific

Neural Network Layer Pointwise Operation

The text data are consisted of a series of sequential words or characters, which can be regarded as a special data of time series and can be processed by the methods used in the time series analysis. Words or characters in text data are not completely independent but are connected to and influenced by the adjacent words or characters. In the model of neural network, it contains three basic compositions: input layer, hidden layer, and output layer. The layers of ordinary neural networks are linked with each other by weights. The nodes in a same layer are independent and have no link with each other. If the ordinary neural network methods are used to process text data, the semantic information of context will be missing. Recurrent neural network (RNN) has a short memory by nodes connecting in the hidden layer, which can receive information from self-cell and other cells. RNN has been used in the fields of NLP and automatic speech recognition [41, 42]. RNN model has the drawback of vanishing gradient problem, which means RNN model only obtains the information that is limited in the adjacent node position [43]. To address this challenge, the LSTM model designed input gate, output gate, and a forget gate to obtain information of far nodes and regulate the information flow between the cells

> *it* ¼ *σ Wi* � *ht*�<sup>1</sup>*; xt* ð ½ �Þ þ *bi* (4) *ft* <sup>¼</sup> *<sup>σ</sup> Wf* � *ht*�<sup>1</sup>*; xt* ½ � <sup>þ</sup> *bf* (5)

*ct*�<sup>1</sup> þ *it*tanhð Þ *Wcxt* þ *Wcht*�<sup>1</sup> þ *bc* (6) *ot* ¼ *σ Wo* � *ht*�<sup>1</sup>*; xt* ð ½ �Þ þ *bo* (7)

tanh

·

<sup>∗</sup> tanhð Þ *ct* (8)

*ht*

*ht*

*ct*


In this research work, the significant highlight is that the training corpus is random. The segmentation rule was learned from the words and their corresponding sequences of the training corpus. The training corpus did not have any manual label information. The precision, recall, and F-scores reach 86.1%, 87.1%, and 86.6%, respectively. Compared with Wang et al. [7], the performance of CRF-based method is better than the Bi-LSTM-based segmenter based on the performance reported in their papers. But the Bi-LSTM-based method has a strong ability of identifying new words. The rate of out-of-vocabulary word identification reached 71.1%.
