**2. Machine learning**

*Updates in Volcanology – Transdisciplinary Nature of Volcano Science*

lie close to highly populated areas and the impact of their eruptions could be economically very strong. Stochastic forecasts of volcanic eruptions are difficult [1, 2], but deterministic forecasts (i.e., specifying when, where, how an eruption will occur) are even harder. Many volcanoes are monitored by observatories that try to estimate at least the probability of the different hazardous volcanic events [3]. Different time series can be monitored and hopefully used for forecasting, including seismic data [4], geomagnetic and electromagnetic data [5], geochemical data [6], deformation data [7], infrasonic data [8], gas data [9], thermal data from satellite [10] and from the ground [11]. Whenever possible, a multiparametric approach is always advisable. For instance, at Merapi volcano, seismic, satellite radar, ground geodetic and geochemical data were efficiently integrated to study the major 2010 eruption [12]; a multiparametric approach is essential to understand shallow processes such as the ones seen at geothermal systems like e.g., Dallol in Ethiopia [13]. Although many time series may be available, seismic data remain always at the heart of any monitoring system, and should always include the analysis of continuous volcanic tremor [14]; tremor has in fact a great potential [15] due to its persistence and memory [1, 2] and its sensitivity to external triggering such as regional tectonic events [16] or Earth tides [17]. Moreover, its time evolution can be indicative of variations in other parameters, such as gas flux [18]. Other information-rich time series can be built looking at the time evolution of the number of the different discrete volcano-seismic events that can be recorded on a volcano. These include volcano-tectonic (VT) earthquakes, rockfall events, long-period (LP) and very-long-period (VLP) events, explosions, etc. Counting the overall number of events is not enough: one has to detect them and classify them, because they are linked to different processes, as detailed below. For this reason it is important to generate automatically different time series for each type

VT can be described as "normal" earthquakes which take place in a volcanic environment and can indicate magma movement [19, 20]. LP events have a great potential for forecasting [21]. Their debated interpretation involves the repeated expansion and compression of sub-horizontal cracks filled with steam or other ash-laden gas [22], stick–slip magma motion [23], fluid-driven flow [24], eddy shedding, turbulent slug flow, soda bottle analogues [25], deformation acceleration of solidified domes [26] and slow ruptures [27]. Explosion quakes are generated by sudden magma, ash, and gas extrusion in an explosive event, often associated to VLP events [28]. In many papers also "Tremor episodes" (TRE events) are described and counted, usually associated to magma degassing [20]. However, a volcano with any activity produces a continuous "tremor" which detectability only depends on the seismic instrumentation sensitivity [29, 30]. So, the class "TRE" should be better defined as "tremor episode that exceeds the detection limits". Of course, at volcanoes we can also record natural but non-volcanic seismic signals such as far tectonic earthquakes, far explosions, etc., and also anthropogenic signals

e.g., due to industries, ground vehicles, helicopters used for monitoring, etc.

Most volcano observatories rely on manual classification and counting of such seismic events, which suffers from human subjectivity and can become unfeasible during an unrest or a seismic crisis [31, 32]. For this reason, manual classification should be substituted by an automated processing, and here is where machine learning (ML) comes into place. The same reasoning applies of course also to the automated processing of other monitoring time series, such as deformation, gas and water geochemistry, etc. Moreover, ML in volcanology is not restricted to monitoring active volcanoes but has demonstrated to be useful also when dealing with other large datasets. Examples include correlating volcanic units in general e.g., [33], of tephra e.g., [34, 35] and ignimbrites e.g., [36], a task which may become very difficult especially when many deposits of similar ages and geochemical and

**106**

of volcano-seismic event.

ML is a field of computer science dedicated to the development of algorithms which are based on a collection of examples of some phenomenon. These examples can be natural, human-generated or computer-generated. From another point of view, ML can be seen as the process of solving a problem by building and using a statistical model based on an existing dataset [39]. ML can also be defined as the study of algorithms that allow computer programs to automatically improve through experience [40]. ML is only one of the ways we expect to achieve Artificial Intelligence (AI). AI has in fact a wider, dynamic and fuzzier definition, e.g., Andrew Moore, former Dean of the School of Computer Science at Carnegie Mellon University, defined it as "the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence". ML is usually characterized by a series of steps: data reduction, model training, model evaluation, model final deployment for classification of new, unknown data (see **Figure 1**). The training (which is the proper learning phase) can be supervised, semi-supervised, unsupervised or based on reinforcement.

More data does not necessarily imply better results. Low quality and irrelevant data can instead lead to worse classification performances. If for each datum we have a very high number of columns, we may wonder how many of those are really informative. A number of techniques can help us with this process of **data reduction**. The simplest include column variance estimations and evaluating correlations between columns. Each of the components of the vector that "survive" this phase is called a feature and is supposed to describe somehow the data item, hopefully in a way that makes it easier to associate the item to a given class. There are dimensionality reduction algorithms [41] where the output is a simplified feature vector that is (almost) equally good at describing the data. There are many techniques to find a smaller number of independent features, such as Independent Component Analysis (ICA) [42], Non-negative Matrix

### **Figure 1.**

*ML can be divided in several steps, from top to bottom. Raw data have first to be reduced by extracting short and information-rich feature vectors. These can then be used to build models that are trained, analyzed and finally used for classification of new data. The [labels] are present only in a (semi-)supervised approach.*

Factorization (NMF) [43], Singular Value Decomposition [44], Principal Component Analysis (PCA) [45] and Auto-encoders [46]. Linear Discriminant Analysis (LDA) [47] uses the training samples to estimate the between-class and within-class scatter matrices, and then employs the Fisher criterion to obtain the projection matrix for feature extraction (or feature reduction).

In **supervised learning**, the dataset is a collection of example couples of the type (data, label) {( )} 1.. ,*i i i N x y* <sup>=</sup> . Each element *xi* is called a feature vector and has a companion label *<sup>i</sup> y* . In the supervised learning approach the dataset is used to derive a model that takes a feature vector as input and outputs a label that should describe it. For example, the feature vector of volcano-seismic data could contain several amplitude-based, spectral-based, shape-based or dynamical parameters and the label to be assigned could be one of those described above, i.e., VT, LP, VLP. In a volcanic geochemical example, feature vectors could contain major elements weight percentages, and labels the corresponding rock type. The reliability of the labels is often the most critical issue of the setup of a supervised ML classification scheme. Labels should therefore be assigned carefully by experts. In general, it is much better to have relatively few training events with reliable labels than to have many more, but not so reliable, labeled examples.

In **unsupervised learning**, the dataset is a collection of examples without any labeling, i.e., containing only the data { } 1.. *<sup>i</sup> i N x* <sup>=</sup> . As in the previous case, each *xi* is a feature vector, and the goal is to create a model that maps a feature vector *x* into a value (or another vector) that can help solving a problem. Typical examples are all the clustering procedures, where the output is the cluster number to which each datum belongs. The choice of the best features to use is a difficult one, and several techniques of Unsupervised Feature Selection were proposed, with the capability of identifying and selecting relevant features in unlabeled data [48]. Unsupervised outlier detection methods [49] can also be used, where the output indicates if a given feature vector is likely to describe a "normal" or "anomalous" member of the dataset.

The **semi-supervised learning** approach stands somehow in the middle, and the dataset contains both labeled (usually a few) and unlabeled (usually many more) feature vectors. The basic idea is similar to supervised learning, but with the possibility to exploit also the presence of (many more) unlabeled examples in the training phase.

In **reinforcement learning**, the machine is "embedded" in an environment, which state is again described by a feature vector. In each state the machine can execute actions, which produce different rewards and can cause an environmental state transition. The goal in this case is to learn a policy, i.e., a function or model that takes the feature vector as input and outputs an optimal action to execute in that state. The action is optimal if it maximizes the expected average reward. We can also say that reinforcement learning is a behavioral learning model. The algorithm receives feedback from the data analysis, guiding the user to the best outcome. Here the main point is that the system is not trained with a sample dataset but learns through trial and error. Therefore, a sequence of successful decisions will result in that process being reinforced, because it best solves the problem at hand. Problems that can be tackled with this approach are the ones where decision making is sequential, and the goal is long-term, such as game playing, robotics, resource management, or logistics. Time is therefore explicitly used here, contrary to other approaches, in which in most of the cases data items are analyzed one by one without taking into account the time order in which they arrive.

In some domains (and volcanology is a good example) training data are scarce. In this case we can profit from knowledge acquired in another domain using

**109**

provides are contradictory.

preserve the input topological properties.

*Machine Learning in Volcanology: A Review DOI: http://dx.doi.org/10.5772/intechopen.94217*

**3. Machine learning techniques**

Mount St. Helens (USA) and Bezymianny (Russia) [52].

found in open access repositories such as GitHub [59].

techniques known as **Transfer Learning** (TL) [50]. The basic idea here is to train a model in one domain with abundant data (original domain) and then use it as a pretrained model in a different domain (with less data). There is a successive fine-tuning phase using domain-specific available data (in the target domain). This approach was applied for instance at Volcán de Fuego de Colima (Mexico) [51],

Among the computer languages that are most used for implementing ML techniques we can cite Python [53], R [54], Java [55], Javascript [56], Julia [57] and Scala [58]. Many dedicated, open source libraries are available for each of them, and many computer codes, also specialized for volcanic and geophysical data, can be

Extracted feature vectors can become inputs to several different techniques of machine learning. We can cite among others Cluster Analysis (CA) [60], Self-Organizing Maps (SOM) [61–63], Artificial Neural Networks (ANN) and Multi Layer Perceptrons (MLP) [64–66], Support Vector Machines (SVM) [67], Convolutional Neural Networks (CNN) [51], Recurrent Neural Networks (RNN) [68], Hidden Markov Models (HMM) [3, 31, 69–71] and their Parallel System

CA (**Figure 2a**) is an unsupervised learning approach aimed at grouping similar data while separating different ones, where similarity is measured quantitatively using a distance function in the space of feature vectors. The clustering algorithms can be divided into hierarchical and non-hierarchical. In the former a tree-like structure is built to represent the relations between clusters, while in the latter new clusters are formed by merging or splitting existing ones without following a treelike structure but just grouping the data in order to maximize or minimize some evaluation criteria. CA includes a vast class of algorithms, including e.g., K-means, K-medians, Mean-shift, DBSCAN, Expectation–Maximization (EM), Clustering using Gaussian Mixture Models (GMM), Agglomerative Hierarchical, Affinity Propagation, Spectral Clustering, Ward, Birch, etc. Most of these methods are described and implemented in the open-source Python package scikit-learn [73]. The use of six different unsupervised, clustering-based methods to classify volcano seismic events was explored at Cotopaxi Volcano [32]. One of the most difficult issues is the choice of the number of clusters into which the data should be divided; this number in most of the cases has in fact to be fixed a priori before running the code. Several techniques exist in order to help with this choice, such as elbow, silhouette, gap statistics, heuristics, etc. Many of them are described and included in the R package NbClust [74]. Problems arise when the estimates that each of them

Another approach to unsupervised classification is SOM (**Figure 2b**) or Kohonen maps [75, 76], a type of ANN trained to produce a low dimensional, usually 2D, discretized representation of the feature vector space. The training is based on competitive and collaboration learning, using a neighborhood function to

A very common type of ANN, often used for supervised classification, is MLP, which consists of at least three layers of nodes (**Figure 2c**): an input layer, (at least) one hidden layer and an output layer. Nodes use nonlinear activation functions and are trained through the backpropagation mechanism. If the number of hidden layers of an ANN becomes very high, we talk of Deep Neural Networks (DNN), which are also used mainly in a supervised fashion. Among DNN, the CNN (**Figure 2d**) contain

Architecture (PSA) based on Gaussian Mixture Models (GMM) [72].

### *Machine Learning in Volcanology: A Review DOI: http://dx.doi.org/10.5772/intechopen.94217*

*Updates in Volcanology – Transdisciplinary Nature of Volcano Science*

feature extraction (or feature reduction).

more, but not so reliable, labeled examples.

Factorization (NMF) [43], Singular Value Decomposition [44], Principal Component Analysis (PCA) [45] and Auto-encoders [46]. Linear Discriminant Analysis (LDA) [47] uses the training samples to estimate the between-class and within-class scatter matrices, and then employs the Fisher criterion to obtain the projection matrix for

In **supervised learning**, the dataset is a collection of example couples of the type (data, label) {( )} 1.. ,*i i i N x y* <sup>=</sup> . Each element *xi* is called a feature vector and has a companion label *<sup>i</sup> y* . In the supervised learning approach the dataset is used to derive a model that takes a feature vector as input and outputs a label that should describe it. For example, the feature vector of volcano-seismic data could contain several amplitude-based, spectral-based, shape-based or dynamical parameters and the label to be assigned could be one of those described above, i.e., VT, LP, VLP. In a volcanic geochemical example, feature vectors could contain major elements weight percentages, and labels the corresponding rock type. The reliability of the labels is often the most critical issue of the setup of a supervised ML classification scheme. Labels should therefore be assigned carefully by experts. In general, it is much better to have relatively few training events with reliable labels than to have many

In **unsupervised learning**, the dataset is a collection of examples without any labeling, i.e., containing only the data { } 1.. *<sup>i</sup> i N x* <sup>=</sup> . As in the previous case, each *xi* is a feature vector, and the goal is to create a model that maps a feature vector *x* into a value (or another vector) that can help solving a problem. Typical examples are all the clustering procedures, where the output is the cluster number to which each datum belongs. The choice of the best features to use is a difficult one, and several techniques of Unsupervised Feature Selection were proposed, with the capability of identifying and selecting relevant features in unlabeled data [48]. Unsupervised outlier detection methods [49] can also be used, where the output indicates if a given feature vector is likely to describe a "normal" or "anomalous" member of the

The **semi-supervised learning** approach stands somehow in the middle, and the dataset contains both labeled (usually a few) and unlabeled (usually many more) feature vectors. The basic idea is similar to supervised learning, but with the possibility to exploit also the presence of (many more) unlabeled examples in the

In **reinforcement learning**, the machine is "embedded" in an environment, which state is again described by a feature vector. In each state the machine can execute actions, which produce different rewards and can cause an environmental state transition. The goal in this case is to learn a policy, i.e., a function or model that takes the feature vector as input and outputs an optimal action to execute in that state. The action is optimal if it maximizes the expected average reward. We can also say that reinforcement learning is a behavioral learning model. The algorithm receives feedback from the data analysis, guiding the user to the best outcome. Here the main point is that the system is not trained with a sample dataset but learns through trial and error. Therefore, a sequence of successful decisions will result in that process being reinforced, because it best solves the problem at hand. Problems that can be tackled with this approach are the ones where decision making is sequential, and the goal is long-term, such as game playing, robotics, resource management, or logistics. Time is therefore explicitly used here, contrary to other approaches, in which in most of the cases data items are analyzed one by one without taking into account the time order in which

In some domains (and volcanology is a good example) training data are scarce.

In this case we can profit from knowledge acquired in another domain using

**108**

they arrive.

dataset.

training phase.

techniques known as **Transfer Learning** (TL) [50]. The basic idea here is to train a model in one domain with abundant data (original domain) and then use it as a pretrained model in a different domain (with less data). There is a successive fine-tuning phase using domain-specific available data (in the target domain). This approach was applied for instance at Volcán de Fuego de Colima (Mexico) [51], Mount St. Helens (USA) and Bezymianny (Russia) [52].

Among the computer languages that are most used for implementing ML techniques we can cite Python [53], R [54], Java [55], Javascript [56], Julia [57] and Scala [58]. Many dedicated, open source libraries are available for each of them, and many computer codes, also specialized for volcanic and geophysical data, can be found in open access repositories such as GitHub [59].
