**3.1 Information visualization of a single geological literature**

The nodes of content word and their links are the carrier of literature information and knowledge. In a large open knowledge graph, the key information was stored in in a triple format. Moreover, the bigram is also widely used in the text information representation. Wang et al. [7] used the bigram graph to represent the single geological literature.

The visualization was built based on the "from," "to," and "weight" variables. The variables of "from" and "to" indicate the sequence of content words in the content word corpus. In the content-word pairs, the former content word is defined as "from" variable, and the latter is defined as "to" variable. Their weights were defined by the co-occurrence frequency of content-word pairs. The bigram graph was used to visualize the nodes of content words and their links.

In geological exploration, the anomaly information of geology, geochemical exploration, geophysical exploration, and remote sensing is important clues for mineral prospecting [46]. To state different anomaly information, literatures of geological exploration will have significant features in the term of word frequency. **Figure 5** shows the main information hidden in a single literature of geophysical exploration. In this visualization, geological terms (e.g., *aeromagnetic*, *gravity*, *magnetic*) and geophysical data processing terms (e.g., *inversion*, *horizontal gradient*, *information*) are all linked to the term *anomaly*. The visualization represents the hidden key knowledge in the geological literature.

#### **3.2 Geological text mining for discovering ore prospecting clues**

Geology research not only reveals the earth evolution and promotes our understanding of the Earth but also has a close relationship with the human society. One of the important roles of applied geology is to discover mineral deposits and provide raw material for economic construction and development. In the long geological

93.50%, and 92.68%, respectively. Then a co-occurrence matrix was utilized to extract content words and their relationships as nodes and links from the classification result and to visualize the information in a knowledge graph. By this way, four categories of favorable information for mineral prospecting and exploration were

**3.3 Geological text mining to assist database construction and knowledge**

The microfossil at 4280 million years old found in Quebec, Canada, may be the oldest fossil as so far [47]. In the Earth's history, biological evolution has a close corresponding with the geological evolution. The existence of biology depends on specific physical and chemical conditions, such as oxygen content and temperature. In other words, different biotypes and biocenoses indicate the conditions of different earth environments. The fossils were formed along with the sedimentary environment and are the footprint left by the biosphere. Each fossil records some biological information, such as biological morphology and living environment. Paleontologists always study the fossils to explore the earth environment evolution. A single fossil cannot indicate biological and geological evolution. The conclusions of such evolution are based on a series of comparative studies of fossils in different

The Paleobiology Database (PBDB; http://paleobiodb.org) contains systematic and detailed fossil information, which make it a necessary infrastructure for fossil comparative researches. The PBDB is one of the most successful fossil databases, which was founded nearly two decades ago. Now it has become an open and active community for different research agendas. In the initial stage, the fossil records in the PBDB were from original fieldworks and extracted from published literature manually. As the rapid development of digital publication, the manual data entry for fossil information became tedious and less efficient and was not able to deal with the massive amounts of new and legacy publications. To address this challenge, PaleoDeepDive [46], a machine reading and learning system, was developed to extract fossil information from literature. This system uses the factor graph and NLP technologies to identify fossil entities and their semantic relationships. The extracted results were stored in the form of triples inside a knowledge base. Compared with the manual fossil data entry, the output of PaleoDeepDive has an obvious advantage in terms of quantity. Moreover, the change trend (e.g., taxonomic diversity and genus-level turnover) has a high corresponding relationship with the manual data entry [48]. The extracted fossil records have been used to update the PBDB. Now, the PBDB is not just a paleobiology database, it also provides WebGISbased interface for fossil information retrieval and query. It also provides R library, API, and a mobile APP for researchers and the general public to use. Based on the PDBD, a series of high-quality research papers have been published to improve our understanding about the Earth. For instance, Peters et al. [49] analyzed the rise and fall of stromatolites in North America and divided the marine environment into

The application of GeoDeepDive is still ongoing. Macrostrat (https://macrostrat. org/), a collaborative platform for geological data exploration and integration, was constructed based on the results that GeoDeepDive extracted from massive amounts of scientific literature. By April 2018, Macrostrat has contained 33,903 properties of geological units distributed across 1474 regions in North and South America, the Caribbean, New Zealand, and the deep sea, more than 180,000 geochemical and outcrop-derived measurements, all the fossil records in PBDB, and more than 2.3 million bedrock geologic map units from over 200 map sources [50].

expressed in a bigram graph and a chord graph.

*Text Mining to Facilitate Domain Knowledge Discovery DOI: http://dx.doi.org/10.5772/intechopen.85362*

**discovery**

geological times and settings.

three phases based the change of stromatolites.

**129**

**Figure 5.**

*Bigram graph of content words in the whole literature represents the key information in the geological report (n > 10).*

history, mineral deposits were formed with large-scale geological events and were buried in the depth of Earth crust. If the mineral deposits were not broken down under the erosion of weathering after mineralization, there are ways to discover them. In the earlier days, geologists discovered mineral deposits by identifying the rock outcrops associated with mineralization. Then, along with many technological developments, the geochemical exploration, geophysical exploration, and remote sensing were also used to improve the result of mineral prospecting and mineral exploration. In recent years, GIS-based and three-dimensional mineral prospect mapping has been used in mineral exploration. Through those technologies, multisource anomalies, such as geochemical anomalies, geophysical anomalies, geological anomalies, and remote sensing anomalies, can be determined.

The anomaly information is usually derived from structured numeric data. The structured numeric data are only one part of geological big data. The majority of geological big data are unstructured, such as text and image. Previous mineral exploration mainly depends on derived information from the structured numeric data. Some important information related to the mineral prospecting and exploration is hidden in the unstructured text, such as host rock, alteration types, geological setting, ore-controlled factors, geochemical and geophysical anomaly patterns, and location. The favorable information extraction and identification from geological literature are a big challenge for conventional research methods. The NLP-based text mining provides a chance to address this challenge.

Li et al. [35] used the CNN method to classify geological text data into four categories (geology, geophysics, geochemistry, and remote sensing) on three scales (word, sentence, and paragraph). Their work extended the work on Chinese word segmentation and text preprocessing to the domain of mineral exploration. These four categories represent four types of mineral exploration information. Compared with word and paragraph scales, the sentence scale has the best performance. In their work, the *precision*, *recall* and *F-scores* of text classification reach 93.68%,

#### *Text Mining to Facilitate Domain Knowledge Discovery DOI: http://dx.doi.org/10.5772/intechopen.85362*

93.50%, and 92.68%, respectively. Then a co-occurrence matrix was utilized to extract content words and their relationships as nodes and links from the classification result and to visualize the information in a knowledge graph. By this way, four categories of favorable information for mineral prospecting and exploration were expressed in a bigram graph and a chord graph.

## **3.3 Geological text mining to assist database construction and knowledge discovery**

The microfossil at 4280 million years old found in Quebec, Canada, may be the oldest fossil as so far [47]. In the Earth's history, biological evolution has a close corresponding with the geological evolution. The existence of biology depends on specific physical and chemical conditions, such as oxygen content and temperature. In other words, different biotypes and biocenoses indicate the conditions of different earth environments. The fossils were formed along with the sedimentary environment and are the footprint left by the biosphere. Each fossil records some biological information, such as biological morphology and living environment. Paleontologists always study the fossils to explore the earth environment evolution. A single fossil cannot indicate biological and geological evolution. The conclusions of such evolution are based on a series of comparative studies of fossils in different geological times and settings.

The Paleobiology Database (PBDB; http://paleobiodb.org) contains systematic and detailed fossil information, which make it a necessary infrastructure for fossil comparative researches. The PBDB is one of the most successful fossil databases, which was founded nearly two decades ago. Now it has become an open and active community for different research agendas. In the initial stage, the fossil records in the PBDB were from original fieldworks and extracted from published literature manually. As the rapid development of digital publication, the manual data entry for fossil information became tedious and less efficient and was not able to deal with the massive amounts of new and legacy publications. To address this challenge, PaleoDeepDive [46], a machine reading and learning system, was developed to extract fossil information from literature. This system uses the factor graph and NLP technologies to identify fossil entities and their semantic relationships. The extracted results were stored in the form of triples inside a knowledge base. Compared with the manual fossil data entry, the output of PaleoDeepDive has an obvious advantage in terms of quantity. Moreover, the change trend (e.g., taxonomic diversity and genus-level turnover) has a high corresponding relationship with the manual data entry [48]. The extracted fossil records have been used to update the PBDB. Now, the PBDB is not just a paleobiology database, it also provides WebGISbased interface for fossil information retrieval and query. It also provides R library, API, and a mobile APP for researchers and the general public to use. Based on the PDBD, a series of high-quality research papers have been published to improve our understanding about the Earth. For instance, Peters et al. [49] analyzed the rise and fall of stromatolites in North America and divided the marine environment into three phases based the change of stromatolites.

The application of GeoDeepDive is still ongoing. Macrostrat (https://macrostrat. org/), a collaborative platform for geological data exploration and integration, was constructed based on the results that GeoDeepDive extracted from massive amounts of scientific literature. By April 2018, Macrostrat has contained 33,903 properties of geological units distributed across 1474 regions in North and South America, the Caribbean, New Zealand, and the deep sea, more than 180,000 geochemical and outcrop-derived measurements, all the fossil records in PBDB, and more than 2.3 million bedrock geologic map units from over 200 map sources [50].

history, mineral deposits were formed with large-scale geological events and were buried in the depth of Earth crust. If the mineral deposits were not broken down under the erosion of weathering after mineralization, there are ways to discover them. In the earlier days, geologists discovered mineral deposits by identifying the rock outcrops associated with mineralization. Then, along with many technological developments, the geochemical exploration, geophysical exploration, and remote sensing were also used to improve the result of mineral prospecting and mineral exploration. In recent years, GIS-based and three-dimensional mineral prospect mapping has been used in mineral exploration. Through those technologies, multisource anomalies, such as geochemical anomalies, geophysical anomalies, geological anomalies, and remote sensing anomalies, can be determined.

*Bigram graph of content words in the whole literature represents the key information in the geological report*

The anomaly information is usually derived from structured numeric data. The structured numeric data are only one part of geological big data. The majority of geological big data are unstructured, such as text and image. Previous mineral exploration mainly depends on derived information from the structured numeric data. Some important information related to the mineral prospecting and exploration is hidden in the unstructured text, such as host rock, alteration types, geological setting, ore-controlled factors, geochemical and geophysical anomaly patterns, and location. The favorable information extraction and identification from geological literature are a big challenge for conventional research methods. The NLP-based

Li et al. [35] used the CNN method to classify geological text data into four categories (geology, geophysics, geochemistry, and remote sensing) on three scales (word, sentence, and paragraph). Their work extended the work on Chinese word segmentation and text preprocessing to the domain of mineral exploration. These four categories represent four types of mineral exploration information. Compared with word and paragraph scales, the sentence scale has the best performance. In their work, the *precision*, *recall* and *F-scores* of text classification reach 93.68%,

text mining provides a chance to address this challenge.

**Figure 5.**

*Cyberspace*

*(n > 10).*

**128**
