**3. Machine learning techniques**

Extracted feature vectors can become inputs to several different techniques of machine learning. We can cite among others Cluster Analysis (CA) [60], Self-Organizing Maps (SOM) [61–63], Artificial Neural Networks (ANN) and Multi Layer Perceptrons (MLP) [64–66], Support Vector Machines (SVM) [67], Convolutional Neural Networks (CNN) [51], Recurrent Neural Networks (RNN) [68], Hidden Markov Models (HMM) [3, 31, 69–71] and their Parallel System Architecture (PSA) based on Gaussian Mixture Models (GMM) [72].

CA (**Figure 2a**) is an unsupervised learning approach aimed at grouping similar data while separating different ones, where similarity is measured quantitatively using a distance function in the space of feature vectors. The clustering algorithms can be divided into hierarchical and non-hierarchical. In the former a tree-like structure is built to represent the relations between clusters, while in the latter new clusters are formed by merging or splitting existing ones without following a treelike structure but just grouping the data in order to maximize or minimize some evaluation criteria. CA includes a vast class of algorithms, including e.g., K-means, K-medians, Mean-shift, DBSCAN, Expectation–Maximization (EM), Clustering using Gaussian Mixture Models (GMM), Agglomerative Hierarchical, Affinity Propagation, Spectral Clustering, Ward, Birch, etc. Most of these methods are described and implemented in the open-source Python package scikit-learn [73]. The use of six different unsupervised, clustering-based methods to classify volcano seismic events was explored at Cotopaxi Volcano [32]. One of the most difficult issues is the choice of the number of clusters into which the data should be divided; this number in most of the cases has in fact to be fixed a priori before running the code. Several techniques exist in order to help with this choice, such as elbow, silhouette, gap statistics, heuristics, etc. Many of them are described and included in the R package NbClust [74]. Problems arise when the estimates that each of them provides are contradictory.

Another approach to unsupervised classification is SOM (**Figure 2b**) or Kohonen maps [75, 76], a type of ANN trained to produce a low dimensional, usually 2D, discretized representation of the feature vector space. The training is based on competitive and collaboration learning, using a neighborhood function to preserve the input topological properties.

A very common type of ANN, often used for supervised classification, is MLP, which consists of at least three layers of nodes (**Figure 2c**): an input layer, (at least) one hidden layer and an output layer. Nodes use nonlinear activation functions and are trained through the backpropagation mechanism. If the number of hidden layers of an ANN becomes very high, we talk of Deep Neural Networks (DNN), which are also used mainly in a supervised fashion. Among DNN, the CNN (**Figure 2d**) contain

### **Figure 2.**

*Schematic illustration of some of the ML techniques described in the text. (a) Cluster analysis in its hierarchical and non-hierarchical versions. (b) Self-organizing maps (c) multilayer perceptron (d) convolutional neural network.*

at least some convolutional layers, that convolve their inputs with a multiplication or other dot product. The activation function in the case of CNN is commonly a rectified linear unit (ReLU), and there are also pooling layers, fully connected layers and normalization layers.

A RNN is a type of ANN with a feedback loop (**Figure 3a**), in which neuron outputs can also be used as neuron inputs in the same layer, allowing to maintain some information during the training process. Long Short Term Memory networks (LSTM) are a subset of RNN, capable of learning long-term dependencies [77] and better remember information for long periods of time. RNN can be used for both supervised and unsupervised learning.

**111**

**Figure 3.**

*Machine Learning in Volcanology: A Review DOI: http://dx.doi.org/10.5772/intechopen.94217*

Logistic regression (LR) (**Figure 3b**) is a supervised generalized linear model, i.e., the classification (probability) dependence on the features is linear [78]. In order to avoid the problems linked to high dimensionality of the data, techniques such as the Least Absolute Shrinkage and Selection Operator (LASSO) can be applied to reduce the number of dimensions of the feature vectors which are input to LR [79]. SVM (**Figure 3c**) constitute a supervised statistical learning framework [80]. It is most commonly used as a non-probabilistic binary classifier. Examples are seen as points in space, and the aim is to separate categories by a gap that is as wide as possible. Unknown samples are then assigned to a category based on the side of the gap on which they fall. In order to perform a non-linear classification, data are mapped into

*Schematic illustration of some of the ML techniques described in the text. (a) Recurrent neural network (b) logistic regression (c) support vector machine (d) random forest (e) hidden Markov model.*

high-dimensional feature spaces using suitable kernel functions.

*Updates in Volcanology – Transdisciplinary Nature of Volcano Science*

at least some convolutional layers, that convolve their inputs with a multiplication or other dot product. The activation function in the case of CNN is commonly a rectified linear unit (ReLU), and there are also pooling layers, fully connected layers and

*Schematic illustration of some of the ML techniques described in the text. (a) Cluster analysis in its hierarchical and non-hierarchical versions. (b) Self-organizing maps (c) multilayer perceptron (d) convolutional neural* 

A RNN is a type of ANN with a feedback loop (**Figure 3a**), in which neuron outputs can also be used as neuron inputs in the same layer, allowing to maintain some information during the training process. Long Short Term Memory networks (LSTM) are a subset of RNN, capable of learning long-term dependencies [77] and better remember information for long periods of time. RNN can be used for both

**110**

normalization layers.

**Figure 2.**

*network.*

supervised and unsupervised learning.

### **Figure 3.**

*Schematic illustration of some of the ML techniques described in the text. (a) Recurrent neural network (b) logistic regression (c) support vector machine (d) random forest (e) hidden Markov model.*

Logistic regression (LR) (**Figure 3b**) is a supervised generalized linear model, i.e., the classification (probability) dependence on the features is linear [78]. In order to avoid the problems linked to high dimensionality of the data, techniques such as the Least Absolute Shrinkage and Selection Operator (LASSO) can be applied to reduce the number of dimensions of the feature vectors which are input to LR [79].

SVM (**Figure 3c**) constitute a supervised statistical learning framework [80]. It is most commonly used as a non-probabilistic binary classifier. Examples are seen as points in space, and the aim is to separate categories by a gap that is as wide as possible. Unknown samples are then assigned to a category based on the side of the gap on which they fall. In order to perform a non-linear classification, data are mapped into high-dimensional feature spaces using suitable kernel functions.

Sparse Multinomial Logistic Regression (SMLR) is a class of supervised methods for learning sparse classifiers that incorporate weighted sums of basis functions with sparsity-promoting priors encouraging the weight estimates to be either significantly large or exactly zero [81]. The sparsity concept is similar to the one at the base of Non-negative Matrix Factorization (NMF) [82]. The sparsity-promoting priors result in an automatic feature selection, enabling to somehow avoid the so-called "curse of dimensionality". So, sparsity in the kernel basis functions and automatic feature selection can be achieved at the same time [83]. SMLR methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. There are fast algorithms for SMLR that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in high-dimensional feature spaces.

A Decision Tree (DT) is an acyclic graph. At each branching node, a specific feature *xi* is examined. The left or right branch is followed depending on the value of *xi* in relation to a given threshold. A class is assigned to each datum when a leaf node is reached. As usual, a DT can be learned from labeled data, using different strategies. In the DT class we can mention Best First Decision Tree (BFT), Functional Tree (FT), J48 Decision Tree (J48DT), Naïve Bayes Tree (NBT) and Reduced Error Pruning Trees (REPT). Ensemble learning techniques such as Random SubSpace (RSS) can be used to combine the results of the different trees [84].

The Boosting concept, a kind of ensemble meta-algorithm mostly (but not only) associated to supervised learning, uses original training data to create iteratively multiple models by using a weak learner. Each model would be different from the previous one as the weak learners try to "fix" the errors made by previous models. An ensemble model will then combine the results of the different weak models. On the other side, Bootstrap aggregating, also called by the contracted name Bagging, consists of creating many "almost-copies" of the training data (each copy is slightly different from the others) and then apply a weak learner to each copy and finally combine the results. A popular and effective algorithm based on bagging is Random Forest (RF). Random Forest (**Figure 3d**) is different from the standard bagging in just one way. At each learning step, a random subset of the features is chosen; this helps to minimize correlation of the trees, as correlated predictors are not efficient in improving classification accuracy. Particular attention has to be taken in order to best choose the number of trees and the size of the random feature subsets.

A Hidden Markov Model (HMM) (**Figure 2e**) is a statistical model in which the system being modeled is assumed to be a Markov process. It describes a sequence of possible events for which the probability of each event depends only on the state occupied in the previous event. The states are unobservable ("hidden") but at each state the Model emits a "message" which depends probabilistically on the current state. Applications are wide in scope, from reinforcement learning to temporal pattern recognition, and the approach works well when time is important; speech [85], handwriting and gesture recognition are then typical fields of applications, but also volcano seismology [69, 86].
