Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision

*Selma Tekir and Yalin Bastanlar*

#### **Abstract**

Deep learning has become the most popular approach in machine learning in recent years. The reason lies in considerably high accuracies obtained by deep learning methods in many tasks especially with textual and visual data. In fact, natural language processing (NLP) and computer vision are the two research areas that deep learning has demonstrated its impact at utmost level. This chapter will firstly summarize the historical evolution of deep neural networks and their fundamental working principles. After briefly introducing the natural language processing and computer vision research areas, it will explain how exactly deep learning is used to solve the problems in these two areas. Several examples regarding the common tasks of these research areas and some discussion are also provided.

**Keywords:** deep learning, machine learning, natural language processing, computer vision, transfer learning

#### **1. Introduction**

Early approaches of artificial intelligence (AI) have sought solutions through formal representation of knowledge and applying logical inference rules. Later on, with having more data available, machine learning approaches prevailed which have the capability of learning from data. Many successful examples today, such as language translation, are results of this data-driven approach. When compared to other machine learning approaches, deep learning (deep artificial neural networks) has two advantages. It benefits well from vast amount of data—more and more of what we do is recorded every day, and it does not require defining the features to be learned beforehand. As a consequence, in the last decade, we have seen numerous success stories achieved with deep learning approaches especially with textual and visual data.

In this chapter, first a relatively short history of neural networks will be provided, and their main principles will be explained. Then, the chapter will proceed to two parallel paths. The first path treats text data and explains the use of deep learning in the area of natural language processing (NLP). Neural network methods first transformed the core task of language modeling. Neural language models have been introduced, and they superseded n-gram language models. Thus, initially the task of language modeling will be covered. The primary focus of this part will be

representation learning, where the main impact of deep learning approaches has been observed. Good dense representations are learned for words, senses, sentences, paragraphs, and documents. These embeddings are proved useful in capturing both syntactic and semantic features. Recent works are able to compute contextual embeddings, which can provide different representations for the same word in different contextual units. Consequently, state-of-the-art embedding methods along with their applications in different NLP tasks will be stated as the use of these pre-trained embeddings in various downstream NLP tasks introduced a substantial performance improvement.

functions (**Figure 1**). Such structures are called neurons due to the biological inspiration: inputs (*xi*) represent activations from nearby neurons, weights (*wi*) represent the synapse strength to nearby neurons, and activation function ( *f <sup>w</sup>*) is the cell body, and if the function output is strong enough, it will be sensed by the synapses

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

In 1957, Rosenblatt introduced perceptrons [2]. The idea was not different from the neuron of McCulloch and Pitts, but Rosenblatt came up with a way to make such artificial neurons learn. Given a training set of input-output pairs, weights are increased/decreased depending on the comparison between the perceptron's output and the correct output. Rosenblatt also implemented the idea of the perceptron in custom hardware and showed it could learn to classify simple shapes correctly with

Marvin Minsky who was the founder of MIT AI Lab and Seymour Papert together wrote a book related to the analysis on the limitations of perceptrons [4]. In this book, as an approach of AI, perceptrons were thought to have a dead end. A

Paul Werbos proposed that backpropagation can be used in neural networks [5]. He showed how to train multilayer perceptrons in his PhD thesis (1974), but due to the AI winter, it required a decade for researchers to work in this area. In 1986, this approach became popular with "Learning representations by back-propagating errors" by Rumelhart et al. [6]. First time in 1989, it was applied to a computer vision task which is handwritten digit classification [7]. It has demonstrated excellent performance on this task. However, after a short while, researchers started to face problems with the backpropagation algorithm. Deep (multilayer) neural networks trained with backpropagation did not work very well and particularly did not work as well as networks with fewer layers. It turned out that the magnitudes of

*Mark I Perceptron at the Cornell Aeronautical Laboratory, hardware implementation of the first perceptron*

single layer of neurons was not enough to solve complicated problems, and Rosenblatt's learning algorithm did not work for multiple layers. This conclusion caused a declining period for the funding and publications on AI, which is usually

of nearby neurons.

20 20 pixel-like inputs (**Figure 2**).

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

referred to as "AI winter."

**Figure 2.**

**3**

*(source: Cornell University Library [3]).*

The second path concentrates on visual data. It will introduce the use of deep learning for computer vision research area. In this aim, it will first cover the principles of convolutional neural networks (CNNs)—the fundamental structure while working on images and videos. On a typical CNN architecture, it will explain the main components such as convolutional, pooling, and classification layers. Then, it will go over one of the main tasks of computer vision, namely, image classification. Using several examples of image classification, it will explain several concepts related to training CNNs (regularization, dropout and data augmentation). Lastly, it will provide a discussion on visualizing and understanding the features learned by a CNN. Based on this discussion, it will go through the principles of how and when transfer learning should be applied with a concrete example of real-world four-class classification problem.

#### **2. Historical evolution of neural networks and their fundamental working principles**

#### **2.1 Historical evolution of neural networks**

Deep neural networks currently provide the best solutions to many problems in computer vision and natural language processing. Although we have been hearing the success news in recent years, artificial neural networks are not a new research area. In 1943, McCulloch and Pitts [1] built a neuron model that sums binary inputs, and outputs 1 if the sum exceeds a certain threshold value, and otherwise outputs 0. They demonstrated that such a neuron can model the basic OR/AND/NOT

#### **Figure 1.**

*A neuron that mimics the behavior of logical AND operator. It multiplies each input (x*<sup>1</sup> *and x*2*) and the bias unit* ð Þ þ1 *with a weight and thresholds the sum of these to output* 1 *if the sum is big enough (similar to our neurons that either fire or not).*

#### *Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

functions (**Figure 1**). Such structures are called neurons due to the biological inspiration: inputs (*xi*) represent activations from nearby neurons, weights (*wi*) represent the synapse strength to nearby neurons, and activation function ( *f <sup>w</sup>*) is the cell body, and if the function output is strong enough, it will be sensed by the synapses of nearby neurons.

In 1957, Rosenblatt introduced perceptrons [2]. The idea was not different from the neuron of McCulloch and Pitts, but Rosenblatt came up with a way to make such artificial neurons learn. Given a training set of input-output pairs, weights are increased/decreased depending on the comparison between the perceptron's output and the correct output. Rosenblatt also implemented the idea of the perceptron in custom hardware and showed it could learn to classify simple shapes correctly with 20 20 pixel-like inputs (**Figure 2**).

Marvin Minsky who was the founder of MIT AI Lab and Seymour Papert together wrote a book related to the analysis on the limitations of perceptrons [4]. In this book, as an approach of AI, perceptrons were thought to have a dead end. A single layer of neurons was not enough to solve complicated problems, and Rosenblatt's learning algorithm did not work for multiple layers. This conclusion caused a declining period for the funding and publications on AI, which is usually referred to as "AI winter."

Paul Werbos proposed that backpropagation can be used in neural networks [5]. He showed how to train multilayer perceptrons in his PhD thesis (1974), but due to the AI winter, it required a decade for researchers to work in this area. In 1986, this approach became popular with "Learning representations by back-propagating errors" by Rumelhart et al. [6]. First time in 1989, it was applied to a computer vision task which is handwritten digit classification [7]. It has demonstrated excellent performance on this task. However, after a short while, researchers started to face problems with the backpropagation algorithm. Deep (multilayer) neural networks trained with backpropagation did not work very well and particularly did not work as well as networks with fewer layers. It turned out that the magnitudes of

**Figure 2.**

*Mark I Perceptron at the Cornell Aeronautical Laboratory, hardware implementation of the first perceptron (source: Cornell University Library [3]).*

representation learning, where the main impact of deep learning approaches has been observed. Good dense representations are learned for words, senses, sentences, paragraphs, and documents. These embeddings are proved useful in capturing both syntactic and semantic features. Recent works are able to compute contextual embeddings, which can provide different representations for the same word in different contextual units. Consequently, state-of-the-art embedding methods along with their applications in different NLP tasks will be stated as the use of these pre-trained embeddings in various downstream NLP tasks introduced a

The second path concentrates on visual data. It will introduce the use of deep

learning for computer vision research area. In this aim, it will first cover the principles of convolutional neural networks (CNNs)—the fundamental structure while working on images and videos. On a typical CNN architecture, it will explain the main components such as convolutional, pooling, and classification layers. Then, it will go over one of the main tasks of computer vision, namely, image classification. Using several examples of image classification, it will explain several concepts related to training CNNs (regularization, dropout and data augmentation). Lastly, it will provide a discussion on visualizing and understanding the features learned by a CNN. Based on this discussion, it will go through the principles of how and when transfer learning should be applied with a concrete example of real-world

**2. Historical evolution of neural networks and their fundamental**

They demonstrated that such a neuron can model the basic OR/AND/NOT

*A neuron that mimics the behavior of logical AND operator. It multiplies each input (x*<sup>1</sup> *and x*2*) and the bias unit* ð Þ þ1 *with a weight and thresholds the sum of these to output* 1 *if the sum is big enough (similar to our*

Deep neural networks currently provide the best solutions to many problems in computer vision and natural language processing. Although we have been hearing the success news in recent years, artificial neural networks are not a new research area. In 1943, McCulloch and Pitts [1] built a neuron model that sums binary inputs, and outputs 1 if the sum exceeds a certain threshold value, and otherwise outputs 0.

substantial performance improvement.

*Data Mining - Methods, Applications and Systems*

four-class classification problem.

**2.1 Historical evolution of neural networks**

**working principles**

**Figure 1.**

**2**

*neurons that either fire or not).*

backpropagated errors shrink very rapidly and this prevents earlier layers to learn, which is today called as "the vanishing gradient problem." Again it took more than a decade for computers to handle more complex tasks. Some people prefer to name this period as the second AI winter.

The choice of activation function is critically important. In early days of multilayer networks, people used to employ *sigmoid* or *tanh* , which cause the problem named as vanishing gradient. Let's explain the vanishing gradient problem with the

Eq. (1) shows how the error in the final layer is backpropagated to a neuron in

weighted input to the *j*th neuron in layer *i*. Here, let's assume *sigmoid* is used as the

**Figure 4** shows the derivative of *sigmoid*, where we observe that the highest point derivative is equal to 25% of its original value. And most of the time, derivative is much less. Thus, at each layer *<sup>w</sup>*ð Þ*<sup>j</sup>* � *<sup>σ</sup>*<sup>0</sup> *<sup>z</sup>*ð Þ *<sup>j</sup>*þ<sup>1</sup> <sup>≤</sup>0*:*25 in Eq. (1). As a result,

Thus, gradients become very small (vanish), and updates on weights get smaller, and they begin to "learn" very slowly. Detailed explanation of the vanishing gradi-

*Plots for some activation functions. Sigmoid is on the left, rectified linear unit is in the middle, and leaky*

ð Þ <sup>4</sup> � *<sup>∂</sup><sup>L</sup> ∂a*<sup>1</sup>

ð Þ*<sup>i</sup>* denotes the value after the activation function is

ð Þ*<sup>i</sup>* . Finally, let *σ*<sup>0</sup> denote the derivative of *sigmoid*

ð Þ<sup>2</sup> becomes 16 (or more) times smaller than *<sup>∂</sup><sup>L</sup>*

ð Þ <sup>4</sup> (1)

ð Þ*<sup>i</sup>* denotes the

*∂a*<sup>1</sup> ð Þ 4 .

ð Þ<sup>2</sup> <sup>¼</sup> *<sup>w</sup>*ð Þ<sup>2</sup> � *<sup>σ</sup>*<sup>0</sup> *<sup>z</sup>*ð Þ<sup>3</sup> � *<sup>w</sup>*ð Þ<sup>3</sup> � *<sup>σ</sup>*<sup>0</sup> *<sup>z</sup>*<sup>1</sup>

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

the first hidden layer, where *w*ð Þ*<sup>i</sup>* denotes the weights in layer *i* and *z <sup>j</sup>*

function. Entities in Eq. (1) are plotted with thicker lines in **Figure 3**.

*∂a*<sup>1</sup>

network shown in **Figure 3**.

activation function. Then, *a <sup>j</sup>*

ð Þ*<sup>i</sup>* , i.e., *a <sup>j</sup>*

products decrease exponentially. *<sup>∂</sup><sup>L</sup>*

ent problem can be found in [8].

applied to *z <sup>j</sup>*

**Figure 4.**

**Figure 5.**

**5**

*Derivative of the sigmoid function.*

*rectified linear unit is on the right.*

*∂L ∂a*<sup>1</sup>

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

ð Þ*<sup>i</sup>* <sup>¼</sup> *<sup>σ</sup> <sup>z</sup> <sup>j</sup>*

Later, it was discovered that the initialization of weights has a critical importance for training, and with a better choice of nonlinear activation function, we can avoid the vanishing gradient problem. In the meantime, our computers got faster (especially thanks to GPUs), and huge amount of data became available for many tasks. G. Hinton and two of his graduate students demonstrated the effectiveness of deep networks at a challenging AI task: speech recognition. They managed to improve on a decade-old performance record on a standard speech recognition dataset. In 2012, a CNN (again G. Hinton and students) won against other machine learning approaches at the Large Scale Visual Recognition Challenge (ILSVRC) image classification task for the first time.

#### **2.2 Working principles of a deep neural network**

Technically any neural network with two or more hidden layers is "deep." However, in papers of recent years, deep networks correspond to the ones with many more layers. We show a simple network in **Figure 3**, where the first layer is the input layer, the last layer is the output layer, and the ones in between are the hidden layers.

In **Figure 3**, *a <sup>j</sup>* ð Þ*<sup>i</sup>* denotes the value after activation function is applied to the inputs in *j*th neuron of *i*th layer. If the predicted output of the network, which is *a*1 ð Þ <sup>4</sup> in this example, is close to the actual output, then the "loss" is low. Previously mentioned backpropagation algorithm uses derivatives to carry the loss to the previous layers. *<sup>∂</sup><sup>L</sup> ∂a*<sup>1</sup> ð Þ <sup>4</sup> represents the derivative of loss with respect to *a*<sup>1</sup> ð Þ <sup>4</sup> , whereas *∂L ∂a*<sup>1</sup> ð Þ<sup>2</sup> represents the derivative of loss with respect to a second layer neuron *a*<sup>1</sup> ð Þ<sup>2</sup> . The derivative of loss with respect to *a*<sup>1</sup> ð Þ<sup>2</sup> means how much of the final error (loss) is neuron *a*<sup>1</sup> ð Þ<sup>2</sup> responsible for.

Activation function is the element that gives a neural network its nonlinear representation capacity. Therefore, we always choose a nonlinear function. If activation function was chosen to be a linear function, each layer would perform a linear mapping of the input to the output. Thus, no matter how many layers were there, since linear functions are closed under composition, this would be equivalent to having a single (linear) layer.

#### **Figure 3.**

*A simple neural network with two hidden layers. Entities plotted with thicker lines are the ones included in Eq. (1), which will be used to explain the vanishing gradient problem.*

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

The choice of activation function is critically important. In early days of multilayer networks, people used to employ *sigmoid* or *tanh* , which cause the problem named as vanishing gradient. Let's explain the vanishing gradient problem with the network shown in **Figure 3**.

$$\frac{\partial L}{\partial a\_1^{(2)}} = w^{(2)} \cdot \sigma' \Big( z^{(3)} \Big) \cdot w^{(3)} \cdot \sigma' \Big( z\_1^{(4)} \Big) \cdot \frac{\partial L}{\partial a\_1^{(4)}} \tag{1}$$

Eq. (1) shows how the error in the final layer is backpropagated to a neuron in the first hidden layer, where *w*ð Þ*<sup>i</sup>* denotes the weights in layer *i* and *z <sup>j</sup>* ð Þ*<sup>i</sup>* denotes the weighted input to the *j*th neuron in layer *i*. Here, let's assume *sigmoid* is used as the activation function. Then, *a <sup>j</sup>* ð Þ*<sup>i</sup>* denotes the value after the activation function is applied to *z <sup>j</sup>* ð Þ*<sup>i</sup>* , i.e., *a <sup>j</sup>* ð Þ*<sup>i</sup>* <sup>¼</sup> *<sup>σ</sup> <sup>z</sup> <sup>j</sup>* ð Þ*<sup>i</sup>* . Finally, let *σ*<sup>0</sup> denote the derivative of *sigmoid* function. Entities in Eq. (1) are plotted with thicker lines in **Figure 3**.

**Figure 4** shows the derivative of *sigmoid*, where we observe that the highest point derivative is equal to 25% of its original value. And most of the time, derivative is much less. Thus, at each layer *<sup>w</sup>*ð Þ*<sup>j</sup>* � *<sup>σ</sup>*<sup>0</sup> *<sup>z</sup>*ð Þ *<sup>j</sup>*þ<sup>1</sup> <sup>≤</sup>0*:*25 in Eq. (1). As a result, products decrease exponentially. *<sup>∂</sup><sup>L</sup> ∂a*<sup>1</sup> ð Þ<sup>2</sup> becomes 16 (or more) times smaller than *<sup>∂</sup><sup>L</sup> ∂a*<sup>1</sup> ð Þ 4 . Thus, gradients become very small (vanish), and updates on weights get smaller, and they begin to "learn" very slowly. Detailed explanation of the vanishing gradient problem can be found in [8].

**Figure 4.** *Derivative of the sigmoid function.*

**Figure 5.**

*Plots for some activation functions. Sigmoid is on the left, rectified linear unit is in the middle, and leaky rectified linear unit is on the right.*

backpropagated errors shrink very rapidly and this prevents earlier layers to learn, which is today called as "the vanishing gradient problem." Again it took more than a decade for computers to handle more complex tasks. Some people prefer to name

Later, it was discovered that the initialization of weights has a critical importance for training, and with a better choice of nonlinear activation function, we can avoid the vanishing gradient problem. In the meantime, our computers got faster (especially thanks to GPUs), and huge amount of data became available for many tasks. G. Hinton and two of his graduate students demonstrated the effectiveness of deep networks at a challenging AI task: speech recognition. They managed to improve on a decade-old performance record on a standard speech recognition dataset. In 2012, a CNN (again G. Hinton and students) won against other machine learning approaches at the Large Scale Visual Recognition Challenge (ILSVRC)

Technically any neural network with two or more hidden layers is "deep." However, in papers of recent years, deep networks correspond to the ones with many more layers. We show a simple network in **Figure 3**, where the first layer is the input layer, the last layer is the output layer, and the ones in between are the

inputs in *j*th neuron of *i*th layer. If the predicted output of the network, which is

mentioned backpropagation algorithm uses derivatives to carry the loss to the

ð Þ<sup>2</sup> represents the derivative of loss with respect to a second layer neuron *a*<sup>1</sup>

Activation function is the element that gives a neural network its nonlinear representation capacity. Therefore, we always choose a nonlinear function. If activation function was chosen to be a linear function, each layer would perform a linear mapping of the input to the output. Thus, no matter how many layers were there, since linear functions are closed under composition, this would be equivalent

*A simple neural network with two hidden layers. Entities plotted with thicker lines are the ones included in*

*Eq. (1), which will be used to explain the vanishing gradient problem.*

ð Þ <sup>4</sup> in this example, is close to the actual output, then the "loss" is low. Previously

ð Þ <sup>4</sup> represents the derivative of loss with respect to *a*<sup>1</sup>

ð Þ*<sup>i</sup>* denotes the value after activation function is applied to the

ð Þ<sup>2</sup> means how much of the final error (loss) is

ð Þ <sup>4</sup> , whereas

ð Þ<sup>2</sup> . The

this period as the second AI winter.

*Data Mining - Methods, Applications and Systems*

image classification task for the first time.

hidden layers.

*a*1

*∂L ∂a*<sup>1</sup>

neuron *a*<sup>1</sup>

**Figure 3.**

**4**

In **Figure 3**, *a <sup>j</sup>*

previous layers. *<sup>∂</sup><sup>L</sup>*

*∂a*<sup>1</sup>

derivative of loss with respect to *a*<sup>1</sup>

to having a single (linear) layer.

ð Þ<sup>2</sup> responsible for.

**2.2 Working principles of a deep neural network**

Today, choices of activation function are different. A rectified linear unit (ReLU), which outputs zero for negative inputs and identical value for positive inputs, is enough to eliminate the vanishing gradient problem. To gain some other advantages, leaky ReLU and parametric ReLU (negative side is multiplied by a coefficient) are among the popular choices (**Figure 5**).

#### **3. Natural language processing**

Deep learning transformed the field of natural language processing (NLP). This transformation can be described by better representation learning through newly proposed neural language models and novel neural network architectures that are fine-tuned with respect to an NLP task.

Deep learning paved the way for neural language models, and these models introduced a substantial performance improvement over n-gram language models. More importantly, neural language models are able to learn good representations in their hidden layers. These representations are shown to capture both semantic and syntactic regularities that are useful for various downstream tasks.

> (*p wt* ð Þ j*wc* ) is calculated through softmax on context-middle word dot product vector (Eq. (2)). Finally, the output loss is calculated based on the cross-entropy loss between the system predicted output and the ground-truth middle word.

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

P

*exp wc*

In Skip-gram, the system predicts the most probable context words for a given input word. In terms of a language model, while CBOW predicts an individual word's probability, Skip-gram outputs the probabilities of a set of words, defined by a given context size. Due to high dimensionality in the output layer (all vocabulary words have to be considered), Skip-gram has higher computational complexity than CBOW (**Figure 7**). To deal with this issue, rather than traversing all vocabulary in the output layer, Skip-gram with negative sampling (SGNS) [13] formulates the problem as a binary classification where one class represents the current context's occurrence probability, whereas the other is all vocabulary terms' occurrence in the present context. In the latter probability calculation, a sampling approach is incorporated. As vocabulary terms are not distributed uniformly in contexts, sampling is performed from a distribution where the order of the frequency of vocabulary words in corpora is taken into consideration. SGNS incorporates this sampling idea by replacing the Skip-gram's objective function. The new objective function (Eq. (3)) depends on maximizing *P D*ð Þ ¼ 1j*w*,*c* , where *w*,*c* is the word-context pair. This probability denotes the probability of ð Þ *w*,*c* coming from the corpus data. Additionally, *P D* ¼ 0j*ui* ð Þ ,*c* should be maximized if *ui* ð Þ ,*c* pair is not included in the corpus data. In this condition, *ui* ð Þ ,*c* pair is sampled, as the name suggests negative

*k*

*i*¼1

Both word2vec variants produced word embeddings that can capture multiple

A regular extension to word2vec model was doc2vec [14], where the main goal is

degrees of similarity including both syntactic and semantic regularities.

to create a representation for different document levels, e.g., sentence and

log *σ* �*ui* �! � *c*

! � � � � (3)

ð Þ � *wt <sup>j</sup>*∈*<sup>V</sup> exp w <sup>j</sup>* � *wt*

� � (2)

*p wt* ð Þ¼ j*wc*

sampled *k* times.

**7**

**Figure 6.** *CBOW architecture.*

> X *w*,*c*

log *σ w* ! � *<sup>c</sup>* ! � � � � þ<sup>X</sup>

#### **3.1 Representation learning**

Representation learning through neural networks is based on the distributional hypothesis: "words with similar distributions have similar meanings" [9] where distribution means the neighborhood of a word, which is specified as a fixed-size surrounding window. Thus, the neighborhoods of words are fed into the neural network to learn representations implicitly.

Learned representations in hidden layers are termed as distributed representations [10]. Distributed representations are local in the sense that the set of activations to represent a concept is due to a subset of dimensions. For instance, cat and dog are hairy and animate. The set of activations to represent "being hairy" belongs to a specific subset of dimensions. In a similar way, a different subset of dimensions is responsible for the feature of "being animate." In the embeddings of both cat and dog, the local pattern of activations for "being hairy" and "being animate" is observed. In other words, the pattern of activations is local, and the conceptualization is global (e.g., cat and dog).

The idea of distributed representation was realized by [11] and other studies relied on it. Bengio et al. [11] proposed a neural language model that is based on a feed-forward neural network with a single hidden layer and optional direct connections between input and output layers.

The first breakthrough in representation learning was word2vec [12]. The authors removed the nonlinearity in the hidden layer in the proposed model architecture of [11]. This model update brought about a substantial improvement in computational complexity allowing the training using billions of words. Word2vec has two variants: continuous bag-of-words (CBOW) and Skip-gram.

In CBOW, a middle word is predicted given its context, the set of neighboring left and right words. When the input sentence "creativity is intelligence having fun" is processed, the system predicts the middle word "intelligence" given the left and right contexts (**Figure 6**). Every input word is in one-hot encoding where there is a vocabulary size (*V*) vector of all zeros except the one in that word's index. In the single hidden layer, instead of applying a nonlinear transformation, the average of the neighboring left and right vectors (*wc*) is computed to represent the context. As the order of words is not taken into consideration by averaging, it is named as a bagof-words model. Then the middle word's (*wt*) probability given the context

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

#### **Figure 6.** *CBOW architecture.*

Today, choices of activation function are different. A rectified linear unit (ReLU), which outputs zero for negative inputs and identical value for positive inputs, is enough to eliminate the vanishing gradient problem. To gain some other advantages, leaky ReLU and parametric ReLU (negative side is multiplied by a

Deep learning transformed the field of natural language processing (NLP). This transformation can be described by better representation learning through newly proposed neural language models and novel neural network architectures that are

Deep learning paved the way for neural language models, and these models introduced a substantial performance improvement over n-gram language models. More importantly, neural language models are able to learn good representations in their hidden layers. These representations are shown to capture both semantic and

Representation learning through neural networks is based on the distributional hypothesis: "words with similar distributions have similar meanings" [9] where distribution means the neighborhood of a word, which is specified as a fixed-size surrounding window. Thus, the neighborhoods of words are fed into the neural

Learned representations in hidden layers are termed as distributed representations [10]. Distributed representations are local in the sense that the set of activations to represent a concept is due to a subset of dimensions. For instance, cat and dog are hairy and animate. The set of activations to represent "being hairy" belongs to a specific subset of dimensions. In a similar way, a different subset of dimensions is responsible for the feature of "being animate." In the embeddings of both cat and dog, the local pattern of activations for "being hairy" and "being animate" is observed. In other words, the pattern of activations is local, and the conceptualiza-

The idea of distributed representation was realized by [11] and other studies relied on it. Bengio et al. [11] proposed a neural language model that is based on a feed-forward neural network with a single hidden layer and optional direct con-

The first breakthrough in representation learning was word2vec [12]. The authors removed the nonlinearity in the hidden layer in the proposed model architecture of [11]. This model update brought about a substantial improvement in computational complexity allowing the training using billions of words. Word2vec

In CBOW, a middle word is predicted given its context, the set of neighboring left and right words. When the input sentence "creativity is intelligence having fun" is processed, the system predicts the middle word "intelligence" given the left and right contexts (**Figure 6**). Every input word is in one-hot encoding where there is a vocabulary size (*V*) vector of all zeros except the one in that word's index. In the single hidden layer, instead of applying a nonlinear transformation, the average of the neighboring left and right vectors (*wc*) is computed to represent the context. As the order of words is not taken into consideration by averaging, it is named as a bag-

has two variants: continuous bag-of-words (CBOW) and Skip-gram.

of-words model. Then the middle word's (*wt*) probability given the context

syntactic regularities that are useful for various downstream tasks.

coefficient) are among the popular choices (**Figure 5**).

**3. Natural language processing**

*Data Mining - Methods, Applications and Systems*

fine-tuned with respect to an NLP task.

network to learn representations implicitly.

**3.1 Representation learning**

tion is global (e.g., cat and dog).

**6**

nections between input and output layers.

(*p wt* ð Þ j*wc* ) is calculated through softmax on context-middle word dot product vector (Eq. (2)). Finally, the output loss is calculated based on the cross-entropy loss between the system predicted output and the ground-truth middle word.

$$p(w\_t|w\_c) = \frac{\exp\left(w\_t \cdot w\_t\right)}{\sum\_{j \in V} \exp\left(w\_j \cdot w\_t\right)}\tag{2}$$

In Skip-gram, the system predicts the most probable context words for a given input word. In terms of a language model, while CBOW predicts an individual word's probability, Skip-gram outputs the probabilities of a set of words, defined by a given context size. Due to high dimensionality in the output layer (all vocabulary words have to be considered), Skip-gram has higher computational complexity than CBOW (**Figure 7**). To deal with this issue, rather than traversing all vocabulary in the output layer, Skip-gram with negative sampling (SGNS) [13] formulates the problem as a binary classification where one class represents the current context's occurrence probability, whereas the other is all vocabulary terms' occurrence in the present context. In the latter probability calculation, a sampling approach is incorporated. As vocabulary terms are not distributed uniformly in contexts, sampling is performed from a distribution where the order of the frequency of vocabulary words in corpora is taken into consideration. SGNS incorporates this sampling idea by replacing the Skip-gram's objective function. The new objective function (Eq. (3)) depends on maximizing *P D*ð Þ ¼ 1j*w*,*c* , where *w*,*c* is the word-context pair. This probability denotes the probability of ð Þ *w*,*c* coming from the corpus data. Additionally, *P D* ¼ 0j*ui* ð Þ ,*c* should be maximized if *ui* ð Þ ,*c* pair is not included in the corpus data. In this condition, *ui* ð Þ ,*c* pair is sampled, as the name suggests negative sampled *k* times.

$$\sum\_{\overrightarrow{w},\varepsilon} \left( \log \sigma \left( \overrightarrow{w} \cdot \overrightarrow{\varepsilon} \right) \right) + \sum\_{i=1}^{k} \left( \log \sigma \left( \overrightarrow{-u\_{i}} \cdot \overrightarrow{\varepsilon} \right) \right) \tag{3}$$

Both word2vec variants produced word embeddings that can capture multiple degrees of similarity including both syntactic and semantic regularities.

A regular extension to word2vec model was doc2vec [14], where the main goal is to create a representation for different document levels, e.g., sentence and

In a similar way, the words *hard, difficult*, and *tough* are embedded into closer points in the space. To address both syntactic and semantic features, Kim et al. [16] used a mixture of character- and word-level features. In their model, at the lowest level of hierarchy, character-level features are processed by a CNN; after transferring these features over a highway network, high-level features are learned by the use of a long short-term memory (LSTM). Thus, the resulting embeddings showed good syntactic and semantic patterns. For instance, the closest words to the word *richard* are returned as *eduard*, *gerard*, *edward*, and *carl*, where all of them are person names and have syntactic similarity to the query word. Due to characteraware processing, their models are able to produce good representations for

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

The idea of capturing syntactic features at a low level of hierarchy and the semantic ones at higher levels was realized ultimately by the Embeddings from Language Models (ELMo) [17]. ELMo proposes a deep bidirectional language model to learn complex features. Once these features are learned, the pre-trained model is used as an external knowledge source to the fine-tuned model that is trained using task-specific data. Thus, in addition to static embeddings from the pre-trained

Another drawback of previous word embeddings is they unite all the senses of a

In NLP, different neural network solutions have been used in various down-

Language data are temporal in nature so recurrent neural networks (RNNs) seem as a good fit to the task in general. RNNs have been used to learn long-range dependencies. However, because of the dependency to the previous time steps in computations, they have efficiency problems. Furthermore, when the length of sequences gets longer, an information loss occurs due to the vanishing gradient

Long short-term memory architectures are proposed to tackle the problem of information loss in the case of long sequences. Gated recurrent units (GRUs) are another alternative to LSTMs. They use a gate mechanism to learn how much of the

Convolutional neural networks have been used to capture short-ranging dependencies like learning word representation over characters and sentence representa-

independent processing of features. Moreover, through the use of different convolution filter sizes (overlapping localities) and then concatenation, their learning

Machine translation is a core NLP task that has witnessed innovative neural network solutions that gained wide application afterwards. Neural machine translation aims to translate sequences from a source language into a target language using neural network architectures. Theoretically, it is a conditional language model where the next word is dependent on the previous set of words in the target sequence and the source sentence at the same time. In traditional language

past information to preserve at the next time step and how much to erase.

tion over its n-grams. Compared to RNNs, they are quite efficient due to

model, contextual embeddings can be taken from the fine-tuned one.

word into one representation. Thus, different contextual meanings cannot be addressed. The brand new ELMo and Bidirectional Encoder Representations from Transformers (BERT) [18] models resolve this issue by providing different representations for every occurrence of a word. BERT uses bidirectional Transformer language model integrated with a masked language model to provide a fine-tuned language model that is able to provide different representations with respect to

out-of-vocabulary words.

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

different contexts.

stream tasks.

problem.

**9**

regions can be extended.

**3.2 NLP with neural network solutions**

paragraph. Their architecture is quite similar to the word2vec except for the extension with a document vector. They generate a vector for each document and word. The system takes the document vector and its words' vectors as an input. Thus, the document vectors are adjusted with regard to all the words in this document. At the end, the system provides both document and word vectors. They propose two architectures that are known as distributed memory model of paragraph vectors (DM) and distributed bag-of-words model of paragraph vectors (DBOW).

DM: In this architecture, inputs are the words in a context except for the last word and document, and the output is the last word of the context. The word vectors and document vector are concatenated while they are fed into the system.

DBOW: The input of the architecture is a document vector. The model predicts the words randomly sampled from the document.

An important extension to word2vec and its variants is fastText [15], where they considered to use characters together with words to learn better representations for words. In fastText language model, the score between a context word and the middle word is computed based on all character n-grams of the word as well as the word itself. Here n-grams are contiguous sequences of *n* letters like unigram for a single letter, bigram for two consecutive letters, trigram for three letters in succession, etc. In Eq. (4), *vc* represents a context vector, *zg* is a vector associated with each n-gram, and *Gw* is the set of all character n-grams of the word *w* together with itself.

$$s(w, c) = \sum\_{\mathcal{g} \in G\_{\mathcal{w}}} \left( \mathbf{z}\_{\mathcal{g}}^T \boldsymbol{\nu}\_c \right) \tag{4}$$

The idea of using the smallest syntactic units in the representation of words introduced an improvement in morphologically rich languages and is capable to compute a representation for out-of-vocabulary words.

The recent development in representation learning is the introduction of contextual representations. Early word embeddings have some problems. Although they can learn syntactic and semantic regularities, they are not so good in capturing a mixture of them. For example, they can capture the syntactic pattern *look-looks-looked*.

#### *Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

In a similar way, the words *hard, difficult*, and *tough* are embedded into closer points in the space. To address both syntactic and semantic features, Kim et al. [16] used a mixture of character- and word-level features. In their model, at the lowest level of hierarchy, character-level features are processed by a CNN; after transferring these features over a highway network, high-level features are learned by the use of a long short-term memory (LSTM). Thus, the resulting embeddings showed good syntactic and semantic patterns. For instance, the closest words to the word *richard* are returned as *eduard*, *gerard*, *edward*, and *carl*, where all of them are person names and have syntactic similarity to the query word. Due to characteraware processing, their models are able to produce good representations for out-of-vocabulary words.

The idea of capturing syntactic features at a low level of hierarchy and the semantic ones at higher levels was realized ultimately by the Embeddings from Language Models (ELMo) [17]. ELMo proposes a deep bidirectional language model to learn complex features. Once these features are learned, the pre-trained model is used as an external knowledge source to the fine-tuned model that is trained using task-specific data. Thus, in addition to static embeddings from the pre-trained model, contextual embeddings can be taken from the fine-tuned one.

Another drawback of previous word embeddings is they unite all the senses of a word into one representation. Thus, different contextual meanings cannot be addressed. The brand new ELMo and Bidirectional Encoder Representations from Transformers (BERT) [18] models resolve this issue by providing different representations for every occurrence of a word. BERT uses bidirectional Transformer language model integrated with a masked language model to provide a fine-tuned language model that is able to provide different representations with respect to different contexts.

#### **3.2 NLP with neural network solutions**

In NLP, different neural network solutions have been used in various downstream tasks.

Language data are temporal in nature so recurrent neural networks (RNNs) seem as a good fit to the task in general. RNNs have been used to learn long-range dependencies. However, because of the dependency to the previous time steps in computations, they have efficiency problems. Furthermore, when the length of sequences gets longer, an information loss occurs due to the vanishing gradient problem.

Long short-term memory architectures are proposed to tackle the problem of information loss in the case of long sequences. Gated recurrent units (GRUs) are another alternative to LSTMs. They use a gate mechanism to learn how much of the past information to preserve at the next time step and how much to erase.

Convolutional neural networks have been used to capture short-ranging dependencies like learning word representation over characters and sentence representation over its n-grams. Compared to RNNs, they are quite efficient due to independent processing of features. Moreover, through the use of different convolution filter sizes (overlapping localities) and then concatenation, their learning regions can be extended.

Machine translation is a core NLP task that has witnessed innovative neural network solutions that gained wide application afterwards. Neural machine translation aims to translate sequences from a source language into a target language using neural network architectures. Theoretically, it is a conditional language model where the next word is dependent on the previous set of words in the target sequence and the source sentence at the same time. In traditional language

paragraph. Their architecture is quite similar to the word2vec except for the extension with a document vector. They generate a vector for each document and word. The system takes the document vector and its words' vectors as an input. Thus, the document vectors are adjusted with regard to all the words in this document. At the end, the system provides both document and word vectors. They propose two architectures that are known as distributed memory model of paragraph vectors (DM) and distributed bag-of-words model of paragraph vectors (DBOW).

DM: In this architecture, inputs are the words in a context except for the last word and document, and the output is the last word of the context. The word vectors and document vector are concatenated while they are fed into the system. DBOW: The input of the architecture is a document vector. The model predicts

An important extension to word2vec and its variants is fastText [15], where they considered to use characters together with words to learn better representations for words. In fastText language model, the score between a context word and the middle word is computed based on all character n-grams of the word as well as the word itself. Here n-grams are contiguous sequences of *n* letters like unigram for a single letter, bigram for two consecutive letters, trigram for three letters in succession, etc. In Eq. (4), *vc* represents a context vector, *zg* is a vector associated with each n-gram, and *Gw* is the set of all character n-grams of the word *w* together with

*s w*ð Þ¼ ,*<sup>c</sup>* <sup>X</sup>

*g* ∈ *Gw*

The recent development in representation learning is the introduction of contextual representations. Early word embeddings have some problems. Although they can learn syntactic and semantic regularities, they are not so good in capturing a mixture of them. For example, they can capture the syntactic pattern *look-looks-looked*.

The idea of using the smallest syntactic units in the representation of words introduced an improvement in morphologically rich languages and is capable to

*zT <sup>g</sup> vc* � �

(4)

the words randomly sampled from the document.

*Data Mining - Methods, Applications and Systems*

compute a representation for out-of-vocabulary words.

itself.

**8**

**Figure 7.**

*Skip-gram architecture.*

modeling, the next word's probability is computed based solely on the previous set of words. Thus, in conditional language modeling, conditional means conditioned on the source sequence's representation. In machine translation, source sequence's processing is termed as encoder part of the model, whereas the next word prediction task in the target language is called decoder. In probabilistic terms, machine translation aims to maximize the probability of the target sequence *y* given the source sequence *x* as follows.

$$\underset{\mathcal{Y}}{\text{arg }\max} \, P(\,\,\mathcal{Y}|\mathcal{X}) \tag{5}$$

This conditional probability calculation can be conducted by the product of component conditional probabilities at each time step where there is an assumption that the probabilities at each time step are independent from each other (Eq. (6)).

$$\begin{aligned} P(y|\mathbf{x}) &= P(y\_1|\mathbf{x})P(y\_2|y\_1, \mathbf{x})P(y\_3|y\_1, y\_2, \mathbf{x}), \dots, P(y\_t|y\_1, \dots, y\_{t-1}, \mathbf{x}) \\ &= \prod\_{i=1}^t P(y\_i|y\_1, \dots, y\_{i-1}, \mathbf{x}) \end{aligned} \tag{6}$$

The first breakthrough neural machine translation model was an LSTM-based encoder-decoder solution [19]. In this model, source sentence is represented by the last hidden layer of encoder LSTM. In the decoder part, the next word prediction is based on both the encoder's source representation and the previous set of words in the target sequence. The model introduced a significant performance boost at the time of its release.

In neural machine translation, the problem of maximizing the probability of a target sequence given the source sequence can be broken down into two components by applying Bayes rule on Eq. (5): the probability of a source sequence given the target and the target sequence's probability (Eq. (7)).

$$\arg\max\_{\mathcal{Y}} P(\boldsymbol{\omega}|\boldsymbol{y})P(\boldsymbol{y})\tag{7}$$

from the current hidden state of the decoder (each QUERY in **Figure 8**), *α<sup>i</sup>*

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

*<sup>α</sup><sup>i</sup>* <sup>¼</sup> *exp hi* P

to the idea of a sole attention-based architecture called Transformer [21]. The Transformer architecture produced even better results in neural machine translation. More importantly, it has become state-of-the-art solution in language modeling and started to be used as a pre-trained language model. The use of it as a pretrained language model and the transfer of this model's knowledge to other models

*oi* ¼ *αihi <sup>K</sup>* <sup>¼</sup> <sup>X</sup> *i oi*

introduced performance boost in a wide variety of NLP tasks.

the next word prediction in translation.

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

**Figure 8.**

*Sequence-to-sequence attention.*

**4. Computer vision and CNNs**

**11**

(WEIGHTS in **Figure 8**) are attention weights, and *K* (OUTPUT in **Figure 8**) is the attention output that is combined with the last hidden state of the decoder to make

> ð Þ � *wi <sup>j</sup> exp h <sup>j</sup>* � *w <sup>j</sup>* � �

The success of attention in addressing alignment in machine translation gave rise

The contribution of attention is not limited to the performance boost introduced but is also related to supporting explainability in deep learning. The visualization of attention provides a clue to the implicit features learned for the task at hand.

To observe the performance of the developed methods on computer vision problems, several competitions are arranged all around the world. One of them is Large Scale Visual Recognition Challenge [22]. This event contains several tasks which are image classification, object detection, and object localization. In image classification task, the aim is to predict the class of images in the test set given a set of discrete labels, such as dog, cat, truck, plane, etc. This is not a trivial task since

(8)

In this alternative formulation, *P x*ð Þ j*y* is termed as translation model and *P y* ð Þ is a language model. Translation model aims to learn correspondences between source and target pairs using parallel training corpus. This learning objective is related to the task of learning word-level correspondences between sentence pairs. This alignment task is vital in that a correct translation requires to generate the counterpart word(s) for the local set of words in the source sentence. For instance, the French word group "tremblement de terre" must be translated into English as the word "earthquake," and these correspondences must be learned in the process.

Bandanau et al. [20] propose an attention mechanism to directly connect to each word in the encoder part in predicting the next word in each decoder step. This mechanism provides a solution to alignment in that every word in translation is predicted by considering all words in the source sentence, and the predicted word's correspondences are learned by the weights in the attention layer (**Figure 8**).

Attention is a weighted sum of values with respect to a query. The learned weights serve as the degree of query's interaction with the values at hand. In the case of translation, values are encoder hidden states, and query is decoder hidden state at the current time step. Thus, weights are expected to show each translation step's grounding on the encoder hidden states.

Eq. (8) gives the formulae for an attention mechanism. Here *hi* represents each hidden state in the encoder (VALUES in **Figure 8**), *wi* is the query vector coming

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

modeling, the next word's probability is computed based solely on the previous set of words. Thus, in conditional language modeling, conditional means conditioned on the source sequence's representation. In machine translation, source sequence's processing is termed as encoder part of the model, whereas the next word prediction task in the target language is called decoder. In probabilistic terms, machine translation aims to maximize the probability of the target sequence *y* given the

> arg max *y*

*P y* ð Þ¼ <sup>j</sup>*<sup>x</sup> P y*1j*<sup>x</sup>* � �*P y*2j*y*1, *<sup>x</sup>* � �*P y*3j*y*1, *<sup>y</sup>*2, *<sup>x</sup>* � �, … , *P yt*

the target and the target sequence's probability (Eq. (7)).

step's grounding on the encoder hidden states.

**10**

arg max *y*

This conditional probability calculation can be conducted by the product of component conditional probabilities at each time step where there is an assumption that the probabilities at each time step are independent from each other (Eq. (6)).

The first breakthrough neural machine translation model was an LSTM-based encoder-decoder solution [19]. In this model, source sentence is represented by the last hidden layer of encoder LSTM. In the decoder part, the next word prediction is based on both the encoder's source representation and the previous set of words in the target sequence. The model introduced a significant performance boost at the

In neural machine translation, the problem of maximizing the probability of a target sequence given the source sequence can be broken down into two components by applying Bayes rule on Eq. (5): the probability of a source sequence given

In this alternative formulation, *P x*ð Þ j*y* is termed as translation model and *P y* ð Þ is a language model. Translation model aims to learn correspondences between source and target pairs using parallel training corpus. This learning objective is related to the task of learning word-level correspondences between sentence pairs. This alignment task is vital in that a correct translation requires to generate the counterpart word(s) for the local set of words in the source sentence. For instance, the French word group "tremblement de terre" must be translated into English as the word "earthquake," and these correspondences must be learned in the process.

Bandanau et al. [20] propose an attention mechanism to directly connect to each word in the encoder part in predicting the next word in each decoder step. This mechanism provides a solution to alignment in that every word in translation is predicted by considering all words in the source sentence, and the predicted word's correspondences are learned by the weights in the attention layer (**Figure 8**). Attention is a weighted sum of values with respect to a query. The learned weights serve as the degree of query's interaction with the values at hand. In the case of translation, values are encoder hidden states, and query is decoder hidden state at the current time step. Thus, weights are expected to show each translation

Eq. (8) gives the formulae for an attention mechanism. Here *hi* represents each hidden state in the encoder (VALUES in **Figure 8**), *wi* is the query vector coming

*P y* ð Þ j*x* (5)

<sup>j</sup> *<sup>y</sup>*1, … , *yt*�<sup>1</sup>, *<sup>x</sup>* � �

*P x*ð Þ j *y P y* ð Þ (7)

<sup>j</sup>*y*1, … , *yi*�<sup>1</sup>, *<sup>x</sup>* � � (6)

source sequence *x* as follows.

<sup>¼</sup> <sup>Y</sup>*<sup>t</sup> i*¼1

time of its release.

*P yi*

*Data Mining - Methods, Applications and Systems*

from the current hidden state of the decoder (each QUERY in **Figure 8**), *α<sup>i</sup>* (WEIGHTS in **Figure 8**) are attention weights, and *K* (OUTPUT in **Figure 8**) is the attention output that is combined with the last hidden state of the decoder to make the next word prediction in translation.

$$\begin{aligned} a\_i &= \frac{\exp\left(h\_i \cdot w\_i\right)}{\sum\_j \exp\left(h\_j \cdot w\_j\right)}\\ o\_i &= a\_i h\_i\\ K &= \sum\_i o\_i \end{aligned} \tag{8}$$

The success of attention in addressing alignment in machine translation gave rise to the idea of a sole attention-based architecture called Transformer [21]. The Transformer architecture produced even better results in neural machine translation. More importantly, it has become state-of-the-art solution in language modeling and started to be used as a pre-trained language model. The use of it as a pretrained language model and the transfer of this model's knowledge to other models introduced performance boost in a wide variety of NLP tasks.

The contribution of attention is not limited to the performance boost introduced but is also related to supporting explainability in deep learning. The visualization of attention provides a clue to the implicit features learned for the task at hand.

#### **4. Computer vision and CNNs**

To observe the performance of the developed methods on computer vision problems, several competitions are arranged all around the world. One of them is Large Scale Visual Recognition Challenge [22]. This event contains several tasks which are image classification, object detection, and object localization. In image classification task, the aim is to predict the class of images in the test set given a set of discrete labels, such as dog, cat, truck, plane, etc. This is not a trivial task since

different images of the same class have quite different instances and varying viewpoints, illumination, deformation, occlusion, etc.

All competitors in ILSVRC train their model on ImageNet [22] dataset. ImageNet 2012 dataset contains 1.2 million images and 1000 classes. Classification performances of proposed methods were compared according to two different evaluation criteria which are top 1 and top 5 score. In top 5 criterion, for each image top 5 guesses of the algorithm are considered. If actual image category is one of these five labels, then the image is counted as correctly classified. Total number of incorrect answers in this sense is called top 5 error.

An outstanding performance was observed by a CNN (convolutional neural network) in 2012. AlexNet [23] got the first place in classification task achieving 16.4% error rate. There was a huge difference between the first (16.4%) and second place (26.1%). In ILSVRC 2014, GoogleNet [24] took the first place achieving 6.67% error rate. Positive effect of network depth was observed. One year later, ResNet took the first place achieving 3.6% error rate [25] with a CNN of 152 layers. In the following years, even lower error rates were achieved with several modifications. Please note that the human performance on the image classification task was reported to be 5.1% error [22].

#### **4.1 Architecture of a typical CNN**

CNNs are the fundamental structures while working on images and videos. A typical CNN is actually composed of several layers interleaved with each other.

#### *4.1.1 Convolutional layer*

Convolutional layer is the core building block of a CNN. It contains plenty of learnable filters (or kernels). Each filter is convolved across width and height of input images. At the end of training process, filters of network are able to identify specific types of appearances (or patterns). A mathematical example is given to illustrate how convolutional layers work (**Figure 9**). In this example, a 5 � 5 RGB image is given to the network. Since images are represented as 3D arrays of numbers, input consists of three matrices. It is convolved with a filter of size 3 � 3 � 3 (height, weight, and depth). In this example, convolution is applied by moving the filter one pixel at a time, i.e., stride size = 1. First convolution operation can be seen at **Figure 9a**. After moving the kernel one pixel to the right, second convolution operation can be seen at **Figure 9b**. Element-wise multiplication ⊙ is applied in each convolution phase. Thus the operation in **Figure 9a** is shown below Eq. (9).

$$
\begin{aligned}
\begin{bmatrix}
\mathbf{1} & \mathbf{1} & \mathbf{0} \\
\mathbf{1} & \mathbf{1} & \mathbf{2} \\
\mathbf{2} & \mathbf{1} & \mathbf{1}
\end{bmatrix} \odot \begin{bmatrix}
\mathbf{1} & \mathbf{1} & \mathbf{0} \\
\mathbf{0} & \mathbf{0} & \mathbf{1} \\
\end{bmatrix} + \begin{bmatrix}
\mathbf{1} & \mathbf{2} & \mathbf{1} \\
\mathbf{1} & \mathbf{0} & \mathbf{0} \\
\mathbf{2} & \mathbf{2} & \mathbf{0}
\end{bmatrix} \odot \begin{bmatrix}
\mathbf{0} & \mathbf{0} & \mathbf{1} \\
\mathbf{1} & \mathbf{1} & \mathbf{0} \\
\mathbf{0} & -\mathbf{1} & -\mathbf{1}
\end{bmatrix} \\
&+ \begin{bmatrix}
\mathbf{2} & \mathbf{2} & \mathbf{0} \\
\mathbf{0} & \mathbf{1} & \mathbf{2} \\
\mathbf{0} & \mathbf{1} & \mathbf{1}
\end{bmatrix} \odot \begin{bmatrix}
\mathbf{1} & \mathbf{1} & \mathbf{0} \\
\mathbf{1} & \mathbf{1} & \mathbf{0} \\
\mathbf{0} & -\mathbf{1} & -\mathbf{1}
\end{bmatrix} + \mathbf{1} = \mathbf{8}
\end{aligned} \tag{9}$$

*4.1.2 Pooling layer*

**Figure 10.**

**Figure 9.**

*with filter W1.*

previous layer.

**13**

average pooling can also be used.

Pooling layer is commonly used between convolutional layers to reduce the number of parameters in the upcoming layers. It makes the representations smaller and the algorithm much faster. With max pooling, filter takes the largest number in the region covered by the matrix on which it is applied. Example input, on which 2 � 2 max pooling is applied, is shown in **Figure 11**. If the input size is *w* � *h* � *n*, then the output size is ð Þ� *w=*2 ð Þ� *h=*2 *n*. Techniques such as min pooling and

*Formation of a convolution layer by applying* n *number of learnable filters on the previous layer. Each activation map is formed by convolving a different filter on the whole input. In this example input to the convolution is the RGB image itself (depth = 3). For every further layer, input is its previous layer. After*

*convolution, width and height of the next layer may or may not decrease.*

*Convolution process. (a) First convolution operation applied with filter W1. Computation gives us the top-left member of an activation map in the next layer. (b) Second convolution operation, again applied*

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

Standard CNNs generally have several convolution layers, followed by pooling layers and at the end a few fully connected layers (**Figure 12**). CNNs are similar to standard neural networks, but instead of connecting weights to all units of the previous layer, a convolution operation is applied on the units (voxels) of the previous layer. It enables us scale weights in an efficient way since a filter has a fixed number of weights and it is independent of the number of the voxels in the

Convolution depicted in **Figure 9** is performed with one filter which results in one matrix (called activation map) in the convolution layer. Using *n* filters for the input in **Figure 9** produces a convolution layer of depth *n* (**Figure 10**).

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

#### **Figure 9.**

different images of the same class have quite different instances and varying view-

An outstanding performance was observed by a CNN (convolutional neural network) in 2012. AlexNet [23] got the first place in classification task achieving 16.4% error rate. There was a huge difference between the first (16.4%) and second place (26.1%). In ILSVRC 2014, GoogleNet [24] took the first place achieving 6.67% error rate. Positive effect of network depth was observed. One year later, ResNet took the first place achieving 3.6% error rate [25] with a CNN of 152 layers. In the following years, even lower error rates were achieved with several modifications. Please note that the human performance on the image classification task was

CNNs are the fundamental structures while working on images and videos. A typical CNN is actually composed of several layers interleaved with each other.

Convolutional layer is the core building block of a CNN. It contains plenty of learnable filters (or kernels). Each filter is convolved across width and height of input images. At the end of training process, filters of network are able to identify specific types of appearances (or patterns). A mathematical example is given to illustrate how convolutional layers work (**Figure 9**). In this example, a 5 � 5 RGB image is given to the network. Since images are represented as 3D arrays of numbers, input consists of three matrices. It is convolved with a filter of size 3 � 3 � 3 (height, weight, and depth). In this example, convolution is applied by moving the filter one pixel at a time, i.e., stride size = 1. First convolution operation can be seen at **Figure 9a**. After moving the kernel one pixel to the right, second convolution operation can be seen at **Figure 9b**. Element-wise multiplication ⊙ is applied in each convolution phase. Thus the operation in **Figure 9a** is shown below Eq. (9).

All competitors in ILSVRC train their model on ImageNet [22] dataset. ImageNet 2012 dataset contains 1.2 million images and 1000 classes. Classification performances of proposed methods were compared according to two different evaluation criteria which are top 1 and top 5 score. In top 5 criterion, for each image top 5 guesses of the algorithm are considered. If actual image category is one of these five labels, then the image is counted as correctly classified. Total number of

points, illumination, deformation, occlusion, etc.

*Data Mining - Methods, Applications and Systems*

incorrect answers in this sense is called top 5 error.

reported to be 5.1% error [22].

*4.1.1 Convolutional layer*

**4.1 Architecture of a typical CNN**

110 112 211

þ

> 220 012 011

input in **Figure 9** produces a convolution layer of depth *n* (**Figure 10**).

Convolution depicted in **Figure 9** is performed with one filter which results in one matrix (called activation map) in the convolution layer. Using *n* filters for the

121 100 220

11 0 11 0 0 �1 �1

00 1 11 0 0 �1 �1

þ 1 ¼ 8

(9)

**12**

*Convolution process. (a) First convolution operation applied with filter W1. Computation gives us the top-left member of an activation map in the next layer. (b) Second convolution operation, again applied with filter W1.*

#### **Figure 10.**

*Formation of a convolution layer by applying* n *number of learnable filters on the previous layer. Each activation map is formed by convolving a different filter on the whole input. In this example input to the convolution is the RGB image itself (depth = 3). For every further layer, input is its previous layer. After convolution, width and height of the next layer may or may not decrease.*

#### *4.1.2 Pooling layer*

Pooling layer is commonly used between convolutional layers to reduce the number of parameters in the upcoming layers. It makes the representations smaller and the algorithm much faster. With max pooling, filter takes the largest number in the region covered by the matrix on which it is applied. Example input, on which 2 � 2 max pooling is applied, is shown in **Figure 11**. If the input size is *w* � *h* � *n*, then the output size is ð Þ� *w=*2 ð Þ� *h=*2 *n*. Techniques such as min pooling and average pooling can also be used.

Standard CNNs generally have several convolution layers, followed by pooling layers and at the end a few fully connected layers (**Figure 12**). CNNs are similar to standard neural networks, but instead of connecting weights to all units of the previous layer, a convolution operation is applied on the units (voxels) of the previous layer. It enables us scale weights in an efficient way since a filter has a fixed number of weights and it is independent of the number of the voxels in the previous layer.

#### *Data Mining - Methods, Applications and Systems*

with a high generalization capacity. Its high accuracy should not be only for training samples. In general, we should increase the size and variety of the training data, and we should avoid training an excessively complex model (simply called overfitting). Since it is not always easy to obtain more training data and to pick the best complexity for our model, let's discuss a few popular techniques to increase the gener-

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

This is a term, *R W*ð Þ, added to the data loss with a coefficient (*λ*) called regularization strength (Eq. (10)). Regularization loss can be a sum of L1 or L2 norm of weights. The interpretation of *R W*ð Þ is that we want smaller weights to be able to achieve smoother models for better generalization. It means that no input

*i*¼1

sponds to removing some units in the network [26]. The neurons which are "dropped out" in this way do not contribute to the forward pass (computation of loss for a given input) and do not participate in backpropagation (**Figure 14**). In each forward pass, a random set of neurons are dropped (with a hyperparameter of

Another way to prevent overfitting is a technique called dropout, which corre-

The more training samples for a model, the more successful the model will be. However, it is rarely possible to obtain large-size datasets either because it is hard to collect more samples or it is expensive to annotate large number of samples. Therefore, to increase the size of existing raw data, producing synthetic data is sometimes preferred. For visual data, data size can be increased by rotating the picture at different angles, random translations, rotations, crops, flips, or altering brightness

Short after people realized that CNNs are very powerful nonlinear models for computer vision problems, they started to seek an insight of why these models

*Li* þ *λ* � *R W*ð Þ (11)

dimension can have a very large influence on the scores all by itself.

*<sup>L</sup>* <sup>¼</sup> <sup>1</sup> *N* X *N*

alization capacity.

*4.2.2 Dropout*

dropping probability, usually 0.5).

*4.2.3 Data augmentation*

and contrast [27].

**Figure 14.**

**15**

*Applying dropout in a neural net.*

**4.3 Transfer learning**

*4.2.1 Regularization loss*

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

**Figure 12.** *A typical CNN for image classification task.*


#### **Figure 13.**

```
An example of softmax classification loss calculation. Computed loss, Li, is only for the ith sample in the dataset.
```
#### *4.1.3 Classification layer*

What we have in the last fully connected layer of a classification network is the output scores for each class. It may seem trivial to select the class with the highest score to make a decision; however we need to define a loss to be able to train the network. Loss is defined according to the scores obtained for the classes. A common practice is to use softmax function, which first converts the class scores into normalized probabilities (Eq. (10)):

$$p\_j = \frac{\epsilon^{\rho\_j}}{\sum\_k \epsilon^{\rho\_k}} \tag{10}$$

where *k* is the number of classes, *oj* are the output neurons (scores), and *pj* are the normalized probabilities. Softmax loss is equal to the log of the normalized probability of the correct class. An example calculation of softmax loss with three classes is given in **Figure 13**.

#### **4.2 Generalization capability of CNNs**

The ability of a model to make correct predictions for new samples after trained on the training set is defined as generalization. Thus, we would like to train a CNN

#### *Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

with a high generalization capacity. Its high accuracy should not be only for training samples. In general, we should increase the size and variety of the training data, and we should avoid training an excessively complex model (simply called overfitting). Since it is not always easy to obtain more training data and to pick the best complexity for our model, let's discuss a few popular techniques to increase the generalization capacity.

#### *4.2.1 Regularization loss*

This is a term, *R W*ð Þ, added to the data loss with a coefficient (*λ*) called regularization strength (Eq. (10)). Regularization loss can be a sum of L1 or L2 norm of weights. The interpretation of *R W*ð Þ is that we want smaller weights to be able to achieve smoother models for better generalization. It means that no input dimension can have a very large influence on the scores all by itself.

$$L = \frac{1}{N} \sum\_{i=1}^{N} L\_i + \lambda \cdot R(W) \tag{11}$$

#### *4.2.2 Dropout*

Another way to prevent overfitting is a technique called dropout, which corresponds to removing some units in the network [26]. The neurons which are "dropped out" in this way do not contribute to the forward pass (computation of loss for a given input) and do not participate in backpropagation (**Figure 14**). In each forward pass, a random set of neurons are dropped (with a hyperparameter of dropping probability, usually 0.5).

#### *4.2.3 Data augmentation*

The more training samples for a model, the more successful the model will be. However, it is rarely possible to obtain large-size datasets either because it is hard to collect more samples or it is expensive to annotate large number of samples. Therefore, to increase the size of existing raw data, producing synthetic data is sometimes preferred. For visual data, data size can be increased by rotating the picture at different angles, random translations, rotations, crops, flips, or altering brightness and contrast [27].

#### **4.3 Transfer learning**

Short after people realized that CNNs are very powerful nonlinear models for computer vision problems, they started to seek an insight of why these models

**Figure 14.** *Applying dropout in a neural net.*

*4.1.3 Classification layer*

**Figure 11.** *Max pooling.*

**Figure 12.**

**Figure 13.**

**14**

malized probabilities (Eq. (10)):

*A typical CNN for image classification task.*

*Data Mining - Methods, Applications and Systems*

classes is given in **Figure 13**.

**4.2 Generalization capability of CNNs**

What we have in the last fully connected layer of a classification network is the output scores for each class. It may seem trivial to select the class with the highest score to make a decision; however we need to define a loss to be able to train the network. Loss is defined according to the scores obtained for the classes. A common practice is to use softmax function, which first converts the class scores into nor-

*An example of softmax classification loss calculation. Computed loss, Li, is only for the* i*th sample in the dataset.*

*<sup>p</sup> <sup>j</sup>* <sup>¼</sup> *<sup>e</sup>oj* P *<sup>k</sup>eok*

where *k* is the number of classes, *oj* are the output neurons (scores), and *pj* are the normalized probabilities. Softmax loss is equal to the log of the normalized probability of the correct class. An example calculation of softmax loss with three

The ability of a model to make correct predictions for new samples after trained on the training set is defined as generalization. Thus, we would like to train a CNN

(10)

perform so well. To this aim, researchers proposed visualization techniques that provide an understanding of what features are learned in different layers of a CNN [28]. It turns out that first convolutional layers are responsible for learning low-level features (edges, lines, etc.), whereas as we go further in the convolutional layers, specific shapes and even distinctive patterns can be learned (**Figure 15**).

In early days of observing the great performance of CNNs, it was believed that one needs a very large dataset in order to use CNNs. Later, it was discovered that, since the pre-trained models already learned to distinguish some patterns, they provide great benefits for new problems and new datasets from varying domains. Transfer learning is the name of training a new model with transferring weights from a related model that had already been trained.

If the dataset in our new task is small but similar to the one that was used in pretrained model, then it would work to change the classification layer (according to our classes) and train this last layer. However, if our dataset is also big enough, we can include a few more layers (starting from the fully connected layers at the end) to our retraining scheme, which is also called fine-tuning. For instance, if a face recognition model trained with a large database is available and you would like to use that model with the faces in your company, that would constitute an ideal case of transferring the weights from the pre-trained model and fine-tune one or two layers with your local database. On the other hand, if the dataset in our new task is not similar to the one used in pre-trained model, then we would need a larger dataset and need to retrain a larger number of layers. An example of this case is learning to classify CT (computer tomography) images using a CNN pre-trained on ImageNet dataset. In this situation, the complex patterns (cf. **Figure 15c** and **d**) that were learned within the pre-trained model are not much useful for your new task. If both the new dataset is small and images are much different from those of a trained model, then users should not expect any benefit from transferring weights. In such cases users should find a way to enlarge the dataset and train a CNN from scratch using the newly collected training data. The cases that a practitioner may encounter from the transfer learning point of view are summarized in **Table 1**.

To emphasize the importance of transfer learning, let us present a small experiment where the same model is trained with and without transfer learning. Our task is the classification of animals (four classes) from their images. Classes are zebra, leopard, elephant, and bear where each class has 350 images collected from the Internet (**Figure 16**). Transfer learning is performed using an AlexNet [23] pretrained on ImageNet dataset. We have replaced the classification layer with a fourneuron layer (one for each class) which was originally 1000 (number of classes in ImageNet). In training conducted with transfer learning, we reached a 98.81% accuracy on the validation set after five epochs (means after seeing the dataset five times during training). Readers can observe that accuracy is quite satisfactory even after one epoch (**Figure 17a**). On the other hand, in training without transfer learning, we could reach only 76.90% accuracy even after 40 epochs (**Figure 17b**). Trying different hyperparameters (regularization strength, learning rate, etc.) could have a chance to increase accuracy a little bit more, but this does not alleviate

*Training and validation set accuracies obtained (a) with transfer learning and (b) without transfer learning.*

*Example images for each class used in the experiment of transfer learning for animal classification.*

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

the importance of applying transfer learning.

**Figure 16.**

**Figure 17.**

**17**

#### **Figure 15.**

*Image patches corresponding to the highest activations in a random subset of feature maps. First layer's high activations occur at patches of distinct low-level features such as edges (a) and lines (b); further layers' neurons learn to fire at more complex structures such as geometric shapes (c) or patterns on an animal (d). Since activations in the first layer correspond to small areas on images, resolution of patches in (a) and (b) is low.*


#### **Table 1.**

*Strategies of transfer learning according to the size of the new dataset and its similarity to the one used in pretrained model.*

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

#### **Figure 16.**

perform so well. To this aim, researchers proposed visualization techniques that provide an understanding of what features are learned in different layers of a CNN

low-level features (edges, lines, etc.), whereas as we go further in the convolutional layers, specific shapes and even distinctive patterns can be learned (**Figure 15**). In early days of observing the great performance of CNNs, it was believed that one needs a very large dataset in order to use CNNs. Later, it was discovered that, since the pre-trained models already learned to distinguish some patterns, they provide great benefits for new problems and new datasets from varying domains. Transfer learning is the name of training a new model with transferring weights

If the dataset in our new task is small but similar to the one that was used in pretrained model, then it would work to change the classification layer (according to our classes) and train this last layer. However, if our dataset is also big enough, we can include a few more layers (starting from the fully connected layers at the end) to our retraining scheme, which is also called fine-tuning. For instance, if a face recognition model trained with a large database is available and you would like to use that model with the faces in your company, that would constitute an ideal case of transferring the weights from the pre-trained model and fine-tune one or two layers with your local database. On the other hand, if the dataset in our new task is not similar to the one used in pre-trained model, then we would need a larger dataset and need to retrain a larger number of layers. An example of this case is learning to classify CT (computer tomography) images using a CNN pre-trained on ImageNet dataset. In this situation, the complex patterns (cf. **Figure 15c** and **d**) that were learned within the pre-trained model are not much useful for your new task. If both the new dataset is small and images are much different from those of a trained model, then users should not expect any benefit from transferring weights. In such cases users should find a way to enlarge the dataset and train a CNN from scratch using the newly collected training data. The cases that a practitioner may encounter from the transfer learning point of view are summarized in **Table 1**.

*Image patches corresponding to the highest activations in a random subset of feature maps. First layer's high activations occur at patches of distinct low-level features such as edges (a) and lines (b); further layers' neurons learn to fire at more complex structures such as geometric shapes (c) or patterns on an animal (d). Since activations in the first layer correspond to small areas on images, resolution of patches in (a) and (b) is low.*

A lot of data Fine-tune a few layers Fine-tune a larger number of layers

*Strategies of transfer learning according to the size of the new dataset and its similarity to the one used in pre-*

Very little data Replace the classification layer Not recommended

**Very similar dataset Very different dataset**

[28]. It turns out that first convolutional layers are responsible for learning

from a related model that had already been trained.

*Data Mining - Methods, Applications and Systems*

**Figure 15.**

**Table 1.**

**16**

*trained model.*

*Example images for each class used in the experiment of transfer learning for animal classification.*

**Figure 17.** *Training and validation set accuracies obtained (a) with transfer learning and (b) without transfer learning.*

To emphasize the importance of transfer learning, let us present a small experiment where the same model is trained with and without transfer learning. Our task is the classification of animals (four classes) from their images. Classes are zebra, leopard, elephant, and bear where each class has 350 images collected from the Internet (**Figure 16**). Transfer learning is performed using an AlexNet [23] pretrained on ImageNet dataset. We have replaced the classification layer with a fourneuron layer (one for each class) which was originally 1000 (number of classes in ImageNet). In training conducted with transfer learning, we reached a 98.81% accuracy on the validation set after five epochs (means after seeing the dataset five times during training). Readers can observe that accuracy is quite satisfactory even after one epoch (**Figure 17a**). On the other hand, in training without transfer learning, we could reach only 76.90% accuracy even after 40 epochs (**Figure 17b**). Trying different hyperparameters (regularization strength, learning rate, etc.) could have a chance to increase accuracy a little bit more, but this does not alleviate the importance of applying transfer learning.

#### **5. Conclusions**

Deep learning has become the dominant machine learning approach due to the availability of vast amounts of data and improved computational resources. The main transformation was observed in text and image analysis.

**References**

115-133

log/ss:550351

1974

[1] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics. 1943;**5**:

*DOI: http://dx.doi.org/10.5772/intechopen.91813*

representations. In: Parallel Distributed

[11] Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. Journal of Machine Learning Research. 2003;**3**:1137-1155

[12] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: Workshop Proceedings of

[13] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. CoRR. abs/ 1310.4546. Available from: http://arxiv.

[14] Le Q, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the 31st

International Conference on Machine

[15] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (TACL). 2016;**5**:135-146

[16] Kim Y, Jernite Y, Sontag D, Rush AM. Character-aware neural language models. In: Proceedings of Thirtieth AAAI Conference on Artificial

[17] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In:

Intelligence (AAAI). 2016

Proceedings of NAACL. 2018

[18] Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for

International Conference on

Learning (ICML). 2014

Processing: Explorations in the Microstructure of Cognition. Vol. 1. Cambridge, MA: MIT Press; 1986.

pp. 77-109

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision*

ICLR. 2013

org/abs/1310.4546

[2] Rosenblatt F. The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory; 1957

[3] Images from the Rare Book and Manuscript Collections. Cornell University Library. Available from: https://digital.library.cornell.edu/cata

[4] Minsky M, Papert S. Perceptrons. An Introduction to Computational Geometry.

[5] Werbos P. Beyond regression: New tools for prediction and analysis in the behavioral sciences [PhD thesis]. Cambridge, MA: Harvard University;

Williams RJ. Learning representations by back-propagating errors. Nature.

Cambridge, MA: MIT Press; 1969

[6] Rumelhart DE, Hinton GE,

[7] LeCun Y, Jackel LD, Boser B, Denker JS, Graf HP, Guyon I, et al. Handwritten digit recognition: Applications of neural network chips

and automatic learning. IEEE Communications Magazine. 1989;

[8] Nielsen M. Neural Network and Deep Learning. Available from: http:// neuralnetworksanddeeplearning.com/ chap5.html [Accessed: 30 December

[9] Harris Z. Distributional structure.

Word. 1954;**10**(23):146-162

[10] Hinton GE, McClelland JL, Rumelhart DE. Distributed

1986;**323**:533-536

**27**(11):41-46

2019]

**19**

In NLP, change can be described in two major lines. The first line is learning better representations through ever-improving neural language models. Currently, self-attention-based Transformer language model is state-of-the-art, and learned representations are capable to capture a mix of syntactic and semantic features and are context-dependent. The second line is related to neural network solutions in different NLP tasks. Although LSTMs proved useful in capturing long-term dependencies in the nature of temporal data, the recent trend has been to transfer the pretrained language models' knowledge into fine-tuned task-specific models. Selfattention neural network mechanism has become the dominant scheme in pretrained language models. This transfer learning solution outperformed existing approaches in a significant way.

In the field of computer vision, CNNs are the best performing solutions. There are very deep CNN architectures that are fine-tuned, thanks to huge amounts of training data. The use of pre-trained models in different vision tasks is a common methodology as well.

One common disadvantage of deep learning solutions is the lack of insights due to learning implicitly. Thus, attention mechanism together with visualization seems promising in both NLP and vision tasks. The fields are in the quest of more explainable solutions.

One final remark is on the rise of multimodal solutions. Till now question answering has been an intersection point. Future work are expected to be devoted to multimodal solutions.

#### **Author details**

Selma Tekir\*† and Yalin Bastanlar† Computer Engineering Department, Izmir Institute of Technology, Izmir, Turkey

\*Address all correspondence to: selmatekir@iyte.edu.tr

† These authors are contributed equally.

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Deep Learning: Exemplar Studies in Natural Language Processing and Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.91813*

#### **References**

**5. Conclusions**

approaches in a significant way.

methodology as well.

explainable solutions.

to multimodal solutions.

**Author details**

**18**

Selma Tekir\*† and Yalin Bastanlar†

† These authors are contributed equally.

provided the original work is properly cited.

\*Address all correspondence to: selmatekir@iyte.edu.tr

Deep learning has become the dominant machine learning approach due to the availability of vast amounts of data and improved computational resources. The

In NLP, change can be described in two major lines. The first line is learning better representations through ever-improving neural language models. Currently, self-attention-based Transformer language model is state-of-the-art, and learned representations are capable to capture a mix of syntactic and semantic features and are context-dependent. The second line is related to neural network solutions in different NLP tasks. Although LSTMs proved useful in capturing long-term dependencies in the nature of temporal data, the recent trend has been to transfer the pretrained language models' knowledge into fine-tuned task-specific models. Selfattention neural network mechanism has become the dominant scheme in pretrained language models. This transfer learning solution outperformed existing

In the field of computer vision, CNNs are the best performing solutions. There are very deep CNN architectures that are fine-tuned, thanks to huge amounts of training data. The use of pre-trained models in different vision tasks is a common

One common disadvantage of deep learning solutions is the lack of insights due to learning implicitly. Thus, attention mechanism together with visualization seems

promising in both NLP and vision tasks. The fields are in the quest of more

One final remark is on the rise of multimodal solutions. Till now question answering has been an intersection point. Future work are expected to be devoted

Computer Engineering Department, Izmir Institute of Technology, Izmir, Turkey

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

main transformation was observed in text and image analysis.

*Data Mining - Methods, Applications and Systems*

[1] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics. 1943;**5**: 115-133

[2] Rosenblatt F. The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell Aeronautical Laboratory; 1957

[3] Images from the Rare Book and Manuscript Collections. Cornell University Library. Available from: https://digital.library.cornell.edu/cata log/ss:550351

[4] Minsky M, Papert S. Perceptrons. An Introduction to Computational Geometry. Cambridge, MA: MIT Press; 1969

[5] Werbos P. Beyond regression: New tools for prediction and analysis in the behavioral sciences [PhD thesis]. Cambridge, MA: Harvard University; 1974

[6] Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature. 1986;**323**:533-536

[7] LeCun Y, Jackel LD, Boser B, Denker JS, Graf HP, Guyon I, et al. Handwritten digit recognition: Applications of neural network chips and automatic learning. IEEE Communications Magazine. 1989; **27**(11):41-46

[8] Nielsen M. Neural Network and Deep Learning. Available from: http:// neuralnetworksanddeeplearning.com/ chap5.html [Accessed: 30 December 2019]

[9] Harris Z. Distributional structure. Word. 1954;**10**(23):146-162

[10] Hinton GE, McClelland JL, Rumelhart DE. Distributed

representations. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1. Cambridge, MA: MIT Press; 1986. pp. 77-109

[11] Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language model. Journal of Machine Learning Research. 2003;**3**:1137-1155

[12] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: Workshop Proceedings of ICLR. 2013

[13] Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. CoRR. abs/ 1310.4546. Available from: http://arxiv. org/abs/1310.4546

[14] Le Q, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML). 2014

[15] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (TACL). 2016;**5**:135-146

[16] Kim Y, Jernite Y, Sontag D, Rush AM. Character-aware neural language models. In: Proceedings of Thirtieth AAAI Conference on Artificial Intelligence (AAAI). 2016

[17] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. In: Proceedings of NAACL. 2018

[18] Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for

language understanding. In: Proceedings of NAACL. 2019

[19] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems (NIPS). 2014

[20] Bandanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of 3rd International Conference on Learning Representations (ICLR). 2015

[21] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems (NIPS). 2017

[22] Russakovsky O et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV). 2015;**115**(3):211-252

[23] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Proceedings of NIPS. 2012

[24] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of CVPR. 2015

[25] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of CVPR. 2016

[26] Srivastava N et al. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine learning Research. 2014;**15**(1): 1929-1958

[27] Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge, MA: MIT Press; 2016

[28] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Proceedings of ECCV. 2014

**Chapter 2**

**Abstract**

*Bouchra Lamrini*

bias-variance trade-off

**1. Introduction**

reasoning.

**21**

Contribution to Decision Tree

Induction with Python: A Review

Among the learning algorithms, one of the most popular and easiest to understand is the decision tree induction. The popularity of this method is related to three nice characteristics: interpretability, efficiency, and flexibility. Decision tree can be used for both classification and regression kind of problem. Automatic learning of a decision tree is characterised by the fact that it uses logic and mathematics to generate rules instead of selecting them based on intuition and subjectivity. In this review, we present essential steps to understand the fundamental concepts and mathematics behind decision tree from training to building. We study criteria and pruning algorithms, which have been proposed to control complexity and optimize decision tree performance. A discussion around several works and tools will be exposed to analyze the techniques of variance reduction, which do not improve or change the representation bias of decision tree. We chose *Pima Indians Diabetes* dataset to cover essential questions to understand pruning process. The paper's original contribution is to provide an up-to-date overview that is fully focused on implemented algorithms to build and optimize decision trees. This contributes to

evolve future developments of decision tree induction.

**Keywords:** decision tree, induction learning, classification, pruning,

Decision tree induction is the most known and developed model of machine learning methods often used in data mining and business intelligence for prediction and diagnostic tasks [1–4]. It is used in classification problems, regression problems or time-dependent prediction. The main strength of decision tree induction is its interpretability characteristics. It is a graphical method designed for problems involving a sequence of decisions and successive events. More precisely, his results formalise the reasoning that an expert could have to reproduce the sequence of decisions and find a characteristic of an object. The main advantage of this type of model is that a human being can easily understand and reproduce decision sequence to predict the target category of a new instance. The results provide a graphic structure or a base of rules facilitates understanding and corresponds to human

Learning by decision tree is part of supervised learning, where the class of each object in the database is given. The goal is to build a model from a set of examples

#### **Chapter 2**

language understanding. In: Proceedings

*Data Mining - Methods, Applications and Systems*

[28] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: Proceedings of ECCV.

2014

[19] Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems (NIPS). 2014

[20] Bandanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of 3rd International

[21] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al.

[22] Russakovsky O et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV). 2015;**115**(3):211-252

Hinton GE. Imagenet classification with deep convolutional neural networks. In:

Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In:

[25] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition.

Machine learning Research. 2014;**15**(1):

[27] Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge, MA: MIT

Conference on Learning Representations (ICLR). 2015

Attention is all you need. In: Proceedings of Advances in Neural Information Processing Systems (NIPS).

[23] Krizhevsky A, Sutskever I,

Proceedings of NIPS. 2012

[24] Szegedy C, Liu W, Jia Y,

Proceedings of CVPR. 2015

In: Proceedings of CVPR. 2016

1929-1958

Press; 2016

**20**

[26] Srivastava N et al. Dropout: A simple way to prevent neural networks from overfitting. The Journal of

2017

of NAACL. 2019

## Contribution to Decision Tree Induction with Python: A Review

*Bouchra Lamrini*

#### **Abstract**

Among the learning algorithms, one of the most popular and easiest to understand is the decision tree induction. The popularity of this method is related to three nice characteristics: interpretability, efficiency, and flexibility. Decision tree can be used for both classification and regression kind of problem. Automatic learning of a decision tree is characterised by the fact that it uses logic and mathematics to generate rules instead of selecting them based on intuition and subjectivity. In this review, we present essential steps to understand the fundamental concepts and mathematics behind decision tree from training to building. We study criteria and pruning algorithms, which have been proposed to control complexity and optimize decision tree performance. A discussion around several works and tools will be exposed to analyze the techniques of variance reduction, which do not improve or change the representation bias of decision tree. We chose *Pima Indians Diabetes* dataset to cover essential questions to understand pruning process. The paper's original contribution is to provide an up-to-date overview that is fully focused on implemented algorithms to build and optimize decision trees. This contributes to evolve future developments of decision tree induction.

**Keywords:** decision tree, induction learning, classification, pruning, bias-variance trade-off

#### **1. Introduction**

Decision tree induction is the most known and developed model of machine learning methods often used in data mining and business intelligence for prediction and diagnostic tasks [1–4]. It is used in classification problems, regression problems or time-dependent prediction. The main strength of decision tree induction is its interpretability characteristics. It is a graphical method designed for problems involving a sequence of decisions and successive events. More precisely, his results formalise the reasoning that an expert could have to reproduce the sequence of decisions and find a characteristic of an object. The main advantage of this type of model is that a human being can easily understand and reproduce decision sequence to predict the target category of a new instance. The results provide a graphic structure or a base of rules facilitates understanding and corresponds to human reasoning.

Learning by decision tree is part of supervised learning, where the class of each object in the database is given. The goal is to build a model from a set of examples

associated with the classes to find a description for each of the classes from the common properties between the examples. Once this model has been built, we can extract a set of classification rules. In this model, the extracted rules are then used to classify new objects whose class is unknown. The classification is done by travelling a path from the root to a leaf. The class returned (default class) is the one that is most frequent among the examples on the sheet. At each internal node (decision node) of the tree, there is a test (question) which corresponds to an attribute in the learning base and a branch corresponding to each of the possible values of the attribute. At each leaf node, there is a class value. A path from the root to a node therefore corresponds to a series of attributes (questions) with their values (answers). This flowchart-like structure with recursive partitioning helps user in decision-making. It is this visualisation, which easily mimics the human-level thinking. That is why decision trees are easy to understand and interpret.

**2. A brief history of decision tree induction**

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

*Contribution to Decision Tree Induction with Python: A Review*

the objects shown to it.

[10, 11], which will be explained later.

of the rule improves without it.

**23**

There are many induction systems that build decision trees. Hunt et al. [5] were

Quinlan [6] proposed Iterative Dichotomiser 3 (ID3), which takes up certain concepts of CLS. ID3 was developed following a challenge induction task on the study of end of chess games posed by Donald Michie. Analogue concept learning system (ACLS) [7] is a generalisation of ID3. CLS and ID3 require that each attribute used to describe the object takes its values in a finite set. In addition to this type of attribute, ACLS allows the use of attributes whose values can be integer. ASSIS-TANT [8], which is a descendant of ID3, allows the use of continuous attributes and builds a binary decision tree. ASSISTANT avoids overfitting by using a pruning technique, which has resulted in ASSISTANT-86 [9]. Another descendant of ID3 is

There is another family of induction systems, such as the algorithm of the star AQ [12], which induces a set of decision rules from a base of examples. AQ builds an *R* function that covers positive examples and rejects negative ones. CN2 [13] learns a set of unordered rules of the form "IF-THEN" from a set of examples. For this, CN2 performs a top-down search (from general to specific) in the rule space, looking for the best rule, then removes the examples covered by this rule and repeats this process until no good rule is found. CN2's strategy is similar to that of AQ in that it eliminates the examples covered by the discovered rule, but it also

Statisticians have attributed the authorship of decision tree building to Morgan and Sonquist [1], who are the first researchers to introduce the automatic interaction detector (AID) method. This method is applied to learning problems whose attribute to predict (the class) is quantitative. It works sequentially and is independent of the extent of linearity in the classifications or the order in which the explanatory factors are introduced. Morgan and Sonquist were among the first to

Several extensions have been proposed: theta AID (THAID) [2] and chi-squared AID (CHAID) [3] which uses chi-square as the independence gap to choose the best partitioning attribute. There is also a method proposed by [4] called classification and regression tree (CART) which builds a binary decision tree using the feature

Quinlan [11] then proposes the *C4.5* algorithm for IT community. C4.5 removed the restriction that entities must be categorical by dynamically defining a discrete attribute based on numerical variables. This discretization process splits the continuous attribute value into a discrete set of intervals. C4.5 then converts the trees generated at the end of learning step into sets of if-then rules. This accuracy of each rule is well taken into account to determine the order in which they must be applied. Pruning is performed by removing the rule's precondition if the precision

differs in that it specialises a starting rule instead of generalising it.

use decision trees and among the first to use regression trees.

and threshold that yield the largest information gain at each node.

the first in this field to study machine learning using examples. Their concept learning system (CLS) framework builds a decision tree that tries to minimise the cost of classifying an object. There are two types of costs: (1) the cost of determining

the value of a property of the object *OA* exhibited by the object and (2) the misclassification cost of deciding that the object belongs to class *C* when its real class is *K*. The CLS method uses a strategy called *Lookahead* which consists of exploring the space of all possible decision trees to a fixed depth and choosing an action to minimise the cost in this limited space and then moving one level down in the tree. Depending on the depth of the *Lookahead* chosen, CLS can require a substantial amount of computation but has been able to unearth subtle patterns in

Another advantage of decision tree induction is its ability to automatically identify the most discriminating features for an use case, i.e., the most representative data inputs for a given task. This is explained by its flexibility and autonomy as a model with little assumption on the hypothesis space. It is an approach that remains particularly useful for input space problems and a powerful tool able to handle very large-scale problems, thus particularly useful in big data mining. However, it is generally less accurate than other machine learning models like neural networks.

In brief, this learning algorithm has the following three essential characteristics:


The review is organised into three parts. The first aims at introducing a brief history of decision tree induction. We present mathematically basics and search strategy used to train and build a decision tree. We discuss the supervised learning problem and the trade-off between a model's ability to minimise bias and variance. In this regard, we are extending our investigation to fundamental aspects, such as ensemble meta-algorithms and pruning methods, which we must put in advance for building an optimal decision tree. In the second section, we introduce some results obtained by means of the *Scikit-Learn Python* modules and *Pima Indians Diabetes* data in order to feed our discussions and our perspectives in terms of future developments and applications of Python community. The third section is devoted to the improvements of decision tree induction in order to improve its performance. We have collected some technical discussions that we raise given our experience in Research and Development (R&D). Finally, the conclusions give a general synthesis of the survey developed and discuss some ideas for future works.

#### **2. A brief history of decision tree induction**

associated with the classes to find a description for each of the classes from the common properties between the examples. Once this model has been built, we can extract a set of classification rules. In this model, the extracted rules are then used to classify new objects whose class is unknown. The classification is done by travelling a path from the root to a leaf. The class returned (default class) is the one that is most frequent among the examples on the sheet. At each internal node (decision node) of the tree, there is a test (question) which corresponds to an attribute in the learning base and a branch corresponding to each of the possible values of the attribute. At each leaf node, there is a class value. A path from the root to a node therefore corresponds to a series of attributes (questions) with their values (answers). This flowchart-like structure with recursive partitioning helps user in decision-making. It is this visualisation, which easily mimics the human-level thinking. That is why decision trees are easy to understand and interpret.

*Data Mining - Methods, Applications and Systems*

Another advantage of decision tree induction is its ability to automatically identify the most discriminating features for an use case, i.e., the most representative data inputs for a given task. This is explained by its flexibility and autonomy as a model with little assumption on the hypothesis space. It is an approach that remains particularly useful for input space problems and a powerful tool able to handle very large-scale problems, thus particularly useful in big data mining. However, it is generally less accurate than other machine learning models like neural networks. In brief, this learning algorithm has the following three essential characteristics:

• Interpretability: Because of its flowchart-like structure, the way attributes

• Efficiency: The induction process is done by a top-down algorithm which recursively splits terminal nodes of the current tree until they all contain elements of only one class. Practically, the algorithm is very fast in terms of running time and can be used on very large datasets (e.g. of millions of objects

• Flexibility: This method does not make any hypothesis about the problem under consideration. It can handle both continuous and discrete attributes. Predictions at leaf nodes may be symbolic or numerical (in which case, trees are called regression trees). In addition, the tree induction method can be easily

combinations of attributes) or providing a prediction at terminal nodes by

The review is organised into three parts. The first aims at introducing a brief history of decision tree induction. We present mathematically basics and search strategy used to train and build a decision tree. We discuss the supervised learning problem and the trade-off between a model's ability to minimise bias and variance. In this regard, we are extending our investigation to fundamental aspects, such as ensemble meta-algorithms and pruning methods, which we must put in advance for building an optimal decision tree. In the second section, we introduce some results obtained by means of the *Scikit-Learn Python* modules and *Pima Indians Diabetes* data in order to feed our discussions and our perspectives in terms of future developments and applications of Python community. The third section is devoted to the improvements of decision tree induction in order to improve its performance. We have collected some technical discussions that we raise given our experience in Research and Development (R&D). Finally, the conclusions give a general synthesis

extended by improving tests at tree nodes (e.g. introducing linear

of the survey developed and discuss some ideas for future works.

interact to give a prediction is very readable.

and thousands of features).

means of another model.

**22**

There are many induction systems that build decision trees. Hunt et al. [5] were the first in this field to study machine learning using examples. Their concept learning system (CLS) framework builds a decision tree that tries to minimise the cost of classifying an object. There are two types of costs: (1) the cost of determining the value of a property of the object *OA* exhibited by the object and (2) the misclassification cost of deciding that the object belongs to class *C* when its real class is *K*. The CLS method uses a strategy called *Lookahead* which consists of exploring the space of all possible decision trees to a fixed depth and choosing an action to minimise the cost in this limited space and then moving one level down in the tree. Depending on the depth of the *Lookahead* chosen, CLS can require a substantial amount of computation but has been able to unearth subtle patterns in the objects shown to it.

Quinlan [6] proposed Iterative Dichotomiser 3 (ID3), which takes up certain concepts of CLS. ID3 was developed following a challenge induction task on the study of end of chess games posed by Donald Michie. Analogue concept learning system (ACLS) [7] is a generalisation of ID3. CLS and ID3 require that each attribute used to describe the object takes its values in a finite set. In addition to this type of attribute, ACLS allows the use of attributes whose values can be integer. ASSIS-TANT [8], which is a descendant of ID3, allows the use of continuous attributes and builds a binary decision tree. ASSISTANT avoids overfitting by using a pruning technique, which has resulted in ASSISTANT-86 [9]. Another descendant of ID3 is [10, 11], which will be explained later.

There is another family of induction systems, such as the algorithm of the star AQ [12], which induces a set of decision rules from a base of examples. AQ builds an *R* function that covers positive examples and rejects negative ones. CN2 [13] learns a set of unordered rules of the form "IF-THEN" from a set of examples. For this, CN2 performs a top-down search (from general to specific) in the rule space, looking for the best rule, then removes the examples covered by this rule and repeats this process until no good rule is found. CN2's strategy is similar to that of AQ in that it eliminates the examples covered by the discovered rule, but it also differs in that it specialises a starting rule instead of generalising it.

Statisticians have attributed the authorship of decision tree building to Morgan and Sonquist [1], who are the first researchers to introduce the automatic interaction detector (AID) method. This method is applied to learning problems whose attribute to predict (the class) is quantitative. It works sequentially and is independent of the extent of linearity in the classifications or the order in which the explanatory factors are introduced. Morgan and Sonquist were among the first to use decision trees and among the first to use regression trees.

Several extensions have been proposed: theta AID (THAID) [2] and chi-squared AID (CHAID) [3] which uses chi-square as the independence gap to choose the best partitioning attribute. There is also a method proposed by [4] called classification and regression tree (CART) which builds a binary decision tree using the feature and threshold that yield the largest information gain at each node.

Quinlan [11] then proposes the *C4.5* algorithm for IT community. C4.5 removed the restriction that entities must be categorical by dynamically defining a discrete attribute based on numerical variables. This discretization process splits the continuous attribute value into a discrete set of intervals. C4.5 then converts the trees generated at the end of learning step into sets of if-then rules. This accuracy of each rule is well taken into account to determine the order in which they must be applied. Pruning is performed by removing the rule's precondition if the precision of the rule improves without it.

Many decision tree algorithms have been developed over the years, for example, SPRINT by Shafer et al. [14] and SLIQ by Mehta et al. [15]. One of the studies comparing decision trees and other learning algorithms was carried out by Tjen-Sien et al. [16]. The study shows that C4.5 has a very good combination of error rate and speed. C4.5 assumes that the training data is in memory. Gehrke et al. [17] proposed Rainforest, an approach to develop a fast and scalable algorithm. In [18], Kotsiantis represents a synthesis of the main basic problems of decision trees and current research work. The references cited cover the main theoretical problems that can lead the researcher into interesting directions of research and suggest possible combinations of biases to explore.

0.22.1) implementation of scikit-learn library does not support categorical variables. A data encoding is mandatory at this stage (the labels transform into a value between 0 and nbClasses-1). The algorithm options are described in the Python

The algorithm below generally summarises the learning phase of a decision tree which begins at the top of the tree with a root node containing all the objects of the

In order for the tree to be easily interpreted, its size must be minimum. Thus, the test selection procedure applied at each node aims to choose the test (the attributethreshold pair) which separates the objects from the current sub-sample in an optimal way, i.e. which reduces the uncertainty linked to the output variable within successor nodes. An entropy measurement (score based on a normalisation of the Shannon information measurement) allows to evaluate the gain of information provided by the test carried out. Once the model has been built, we can infer the class of a new object by propagating it in the tree from top to bottom according to the tests performed. The chosen test separates the current sample of objects into two sub-samples which are found in the successors of this node. Each test at a node makes it possible to direct any object to one of the two successors of this node according to the value of the attribute tested at this node. In other words, a decision tree is seen as a function which attributes to any object the class associated with the terminal node to which the object is directed following tests to the internal nodes of

the tree. **Figure 1** illustrates an example using two input attributes with the

be used or even the distribution of class probabilities if a stop criterion has

a simpler tree whose reliability is maximum, i.e. the classification error rate is minimal. However, a successful and very precise model on the learning set is not necessarily generalizable to unknown examples (objects), especially in the presence of noisy data. In this case, two sources of error expressed in the form of bias (difference between the real value and the estimated value) and the variance can generally influence the precision of a model. Several bibliographic analyses ([21]

interrupted development before reaching "pure" nodes.

The induction algorithm continues to develop the tree until the terminal nodes contain sub-samples of objects that have the same output value. The label associated with a leaf in the tree is determined from the objects in the learning set that have been directed to this leaf. The majority class among the classes of these objects can

The principal objective of an induction algorithm is to build on the learning data

partitioning of the input space it implies.

<sup>3</sup> https://scikit-learn.org/stable/modules/tree.html#

**25**

documentation<sup>3</sup>

learning set:

.

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

*Contribution to Decision Tree Induction with Python: A Review*

#### **2.1 Mathematical basics and search strategy**

The automatic learning of the rules in a decision tree consists in separating the learning objects into disjoint sub-samples of objects (which have no elements in common) where the majority of objects ideally have the same value for the output variable, i.e. the same class in the case of a classification problem. Each internal node performs a test on an input attribute. This test is determined automatically based on the initial training sample and according to test selection procedures that differ from one tree induction algorithm to another. For attributes with numerical values (or after encoding data), the test consists in comparing the value of an attribute with a numerical value which is called discretization threshold. According to the algorithm used, the terminal nodes of the tree are labelled either by the majority class of objects in the training sample which have reached this sheet following successive separations or by a distribution of probabilities of the classes by frequency of these objects in each class.

As indicated above, the main learning algorithms using decision trees are C4.5 [11] and CART [4]. The CART algorithm is very similar to C4.5, except for a few properties [19, 20], but it differs in that it supports numerical target variables (regression) and does not compute rule sets. The CART algorithm can be used to construct classification and regression decision trees, depending on whether the dependent variable is categorical or numeric. It also handles missing attribute values. The decision tree built by the CART algorithm is always a binary decision tree (each node has only two child nodes), also called hierarchical optimal discriminate analysis (HODA). The measurement of impurity (or purity) used in the decision tree by CART is the Gini index (C4.5 uses the notion of entropy) for classification tasks. In regression tasks, the fit method takes inputs and target arguments as in the classification setting, only that in this case target, it is expected to have floating point values (continuous values) instead of integer values. For a leaf *Li*, common criteria to minimise as for determining locations for future splits are mean squared error (MSE), which minimises the *Li* þ 1 error using mean values at terminal nodes, and mean absolute error (MAE), which minimises the *Li* error using median values at terminal nodes.

Several software for decision trees building are available, most of them referenced in the literature. We cite the chi-squared automatic interaction detector (CHAID) method implemented in the SIPINA<sup>1</sup> tool which seeks to produce a tree of limited size, allowing to initiate a data exploration. WEKA<sup>2</sup> uses C4.5 algorithm, and there is no need to discretize any of the attributes, and scikit-learn Python library uses an optimised version of the CART algorithm. The current (version

<sup>1</sup> http://eric.univ-lyon2.fr/�ricco/sipina.html

<sup>2</sup> https://www.cs.waikato.ac.nz/ml/weka/

Many decision tree algorithms have been developed over the years, for example,

The automatic learning of the rules in a decision tree consists in separating the learning objects into disjoint sub-samples of objects (which have no elements in common) where the majority of objects ideally have the same value for the output variable, i.e. the same class in the case of a classification problem. Each internal node performs a test on an input attribute. This test is determined automatically based on the initial training sample and according to test selection procedures that differ from one tree induction algorithm to another. For attributes with numerical values (or after encoding data), the test consists in comparing the value of an attribute with a numerical value which is called discretization threshold. According to the algorithm used, the terminal nodes of the tree are labelled either by the majority class of objects in the training sample which have reached this sheet following successive separations or by a distribution of probabilities of the classes

As indicated above, the main learning algorithms using decision trees are C4.5 [11] and CART [4]. The CART algorithm is very similar to C4.5, except for a few properties [19, 20], but it differs in that it supports numerical target variables (regression) and does not compute rule sets. The CART algorithm can be used to construct classification and regression decision trees, depending on whether the dependent variable is categorical or numeric. It also handles missing attribute values. The decision tree built by the CART algorithm is always a binary decision tree (each node has only two child nodes), also called hierarchical optimal discriminate analysis (HODA). The measurement of impurity (or purity) used in the decision tree by CART is the Gini index (C4.5 uses the notion of entropy) for classification tasks. In regression tasks, the fit method takes inputs and target arguments as in the classification setting, only that in this case target, it is expected to have floating point values (continuous values) instead of integer values. For a leaf *Li*, common criteria to minimise as for determining locations for future splits are mean squared error (MSE), which minimises the *Li* þ 1 error using mean values at terminal nodes, and mean absolute error (MAE), which minimises the *Li* error

Several software for decision trees building are available, most of them referenced in the literature. We cite the chi-squared automatic interaction detector (CHAID) method implemented in the SIPINA<sup>1</sup> tool which seeks to produce a tree of limited size, allowing to initiate a data exploration. WEKA<sup>2</sup> uses C4.5 algorithm, and there is no need to discretize any of the attributes, and scikit-learn Python library uses an optimised version of the CART algorithm. The current (version

SPRINT by Shafer et al. [14] and SLIQ by Mehta et al. [15]. One of the studies comparing decision trees and other learning algorithms was carried out by Tjen-Sien et al. [16]. The study shows that C4.5 has a very good combination of error rate and speed. C4.5 assumes that the training data is in memory. Gehrke et al. [17] proposed Rainforest, an approach to develop a fast and scalable algorithm. In [18], Kotsiantis represents a synthesis of the main basic problems of decision trees and current research work. The references cited cover the main theoretical problems that can lead the researcher into interesting directions of research and suggest

possible combinations of biases to explore.

*Data Mining - Methods, Applications and Systems*

by frequency of these objects in each class.

using median values at terminal nodes.

<sup>1</sup> http://eric.univ-lyon2.fr/�ricco/sipina.html <sup>2</sup> https://www.cs.waikato.ac.nz/ml/weka/

**24**

**2.1 Mathematical basics and search strategy**

0.22.1) implementation of scikit-learn library does not support categorical variables. A data encoding is mandatory at this stage (the labels transform into a value between 0 and nbClasses-1). The algorithm options are described in the Python documentation<sup>3</sup> .

The algorithm below generally summarises the learning phase of a decision tree which begins at the top of the tree with a root node containing all the objects of the learning set:

In order for the tree to be easily interpreted, its size must be minimum. Thus, the test selection procedure applied at each node aims to choose the test (the attributethreshold pair) which separates the objects from the current sub-sample in an optimal way, i.e. which reduces the uncertainty linked to the output variable within successor nodes. An entropy measurement (score based on a normalisation of the Shannon information measurement) allows to evaluate the gain of information provided by the test carried out. Once the model has been built, we can infer the class of a new object by propagating it in the tree from top to bottom according to the tests performed. The chosen test separates the current sample of objects into two sub-samples which are found in the successors of this node. Each test at a node makes it possible to direct any object to one of the two successors of this node according to the value of the attribute tested at this node. In other words, a decision tree is seen as a function which attributes to any object the class associated with the terminal node to which the object is directed following tests to the internal nodes of the tree. **Figure 1** illustrates an example using two input attributes with the partitioning of the input space it implies.

The induction algorithm continues to develop the tree until the terminal nodes contain sub-samples of objects that have the same output value. The label associated with a leaf in the tree is determined from the objects in the learning set that have been directed to this leaf. The majority class among the classes of these objects can be used or even the distribution of class probabilities if a stop criterion has interrupted development before reaching "pure" nodes.

The principal objective of an induction algorithm is to build on the learning data a simpler tree whose reliability is maximum, i.e. the classification error rate is minimal. However, a successful and very precise model on the learning set is not necessarily generalizable to unknown examples (objects), especially in the presence of noisy data. In this case, two sources of error expressed in the form of bias (difference between the real value and the estimated value) and the variance can generally influence the precision of a model. Several bibliographic analyses ([21]

<sup>3</sup> https://scikit-learn.org/stable/modules/tree.html#

*post-pruning* of the tree to have the best pruned sub-tree from the maximum tree to the sense of the generalisation error, i.e. improving the predictive aspect of the tree, on the one hand, and reducing its complexity, on the other hand. To this end,

• *Minimal cost complexity pruning* (*MCCP*), also called as *post-pruning* for the CART algorithm [4]. This method consists in constructing a nested sequence of sub-trees using a formulation called minimum cost complexity. In Section 2.2.1, we detail the general concept of this method that Scikit-Learn Library

• *Reduced error pruning* (*REP*) consists of estimating the real error of a given subtree on a pruning or test set. The pruning algorithm is performed as follows: "As long as there is a tree that can be replaced by a leaf without increasing the estimate of the real error, then prune this tree". This technique gives a slightly congruent tree in the sense that some examples may be misclassified. The study of Elomaa and Kääriäinen [26] presents a detailed analysis of the REP method. In this analysis, the two authors evoke that the REP method was introduced by Quinlan [27] but the latter never presented it in an algorithmic way, which is a source of confusion. Even though REP is considered a very simple, almost trivial algorithm for pruning, many different algorithms have the same name. There is no consensus whether *REP* is a bottom-up algorithm or an iterative method. Moreover, it is not apparent that the training or pruning set is used to

• *Pessimistic error pruning* (*PEP*). In order to overcome the disadvantages of the previous method, Quinlan [27] proposed a pruning strategy which uses a single set of construction and pruning of the tree. The tree is pruned by examining the error rate at each node and assuming that the true error rate is considerably

worse. If a given node contains *N* records in which *E* among them are misclassified, then the error rate is estimated at *E=N*. The central concern of the algorithm is to minimise this estimate, by considering this error rate as a

• Minimum error pruning (MEP) was proposed by Niblett and Bratko [30], critical value pruning (CVP) by Mingers [31] and error-based pruning (EBP)

Pre-pruning consists in fixing a stopping rule which allows to stop the growth of

hypothesis is the independence of the segmentation variable with the class attribute. If the calculated *χ*<sup>2</sup> is higher than the theoretical threshold corresponding to the risk of the first kind that we have set (respectively if the p-value calculated is lower than

proposed by Quinlan as an improvement of the PEP method, for the

a tree during learning phase by fixing a local stopping criterion which makes it possible to evaluate the informational contribution of the segmentation relating to the node that is being processed. The principle of the CHAID algorithm [32] is based on the same principle by accepting segmentation if the measure of information gain (*χ*<sup>2</sup> difference in independence or *t* from Tschuprow [3]) calculated on a node is significantly higher than a chosen threshold. According to Rakotomalala et al. [32, 33], formalisation involves a test of statistical hypothesis: the null

determine the labels of the leaves that result from pruning.

very optimistic version of the real error rate [28, 29].

the risk of first kind), we accept the segmentation.

algorithm C4.5.

*2.2.1 Pre-pruning*

**27**

several pruning methods have been developed, such as:

*Contribution to Decision Tree Induction with Python: A Review*

adopted in its implementation.

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

**Figure 1.** *An example of a decision tree and the partition it implies (Figure taken from https://www.kdnuggets.com/ website).*

and the references cited in this work, [22]) have shown that decision trees suffer from a significant variance which penalises the precision of this technique. A tree may be too large due to too many test nodes determined at the bottom of the tree on sub-samples of statistically unreliable size objects. The choice of tests (attributes and thresholds) at the internal nodes of a decision tree can also depend on a sample to another which contributes to the variance of the models built. For these reasons, the criteria for stopping the development of a tree or simplification techniques such as pruning procedures is to find a good compromise between the complexity of the model and its reliability on an independent sample. These techniques can only improve the first source of error (bias) mentioned above. Different variance reduction techniques are proposed in the literature, notably the ensemble metaalgorithms such as bagging, random forests, extra-trees and boosting.

The ensemble meta-algorithms are effective in combination with decision trees. These methods differ by their way of adapting the original tree induction algorithm and/or aggregating the results. Bagging, random forests and extra-trees methods have several similarities. They independently build *T* constitutive trees. The predictions of different trees are aggregated as follows: each tree produces a vector of class probabilities. The *T* trees probability are additional in a weight vector, and the class that receives the most weight according to this one is assigned to the object. Note that these three methods use a random component and their precision can then vary slightly from one execution to another. Boosting method produces sequentially (and deterministically) the set of trees unlike these three methods using a different aggregation procedure. These methods have been successfully applied to numerous applications, notably in bioinformatics [23] and in networks [24]. Maree [22] presents a bibliographical analysis of these methods. His work covers the problem of automatic image classification using sets of random trees combined with a random extraction of sub-windows of pixel values.

#### **2.2 Pruning**

Pruning is a model selection procedure, where the models are the pruned subtrees of the maximum tree *T*0. Let T be the set of all binary sub-trees of *T* having the same root as *T*0. This procedure minimises a penalised criterion where the penalty is proportional to the number of leaves in the tree [25]. Defining the optimal size of a decision tree consists in stopping the *pre-pruning* or reducing the *Contribution to Decision Tree Induction with Python: A Review DOI: http://dx.doi.org/10.5772/intechopen.92438*

*post-pruning* of the tree to have the best pruned sub-tree from the maximum tree to the sense of the generalisation error, i.e. improving the predictive aspect of the tree, on the one hand, and reducing its complexity, on the other hand. To this end, several pruning methods have been developed, such as:


#### *2.2.1 Pre-pruning*

and the references cited in this work, [22]) have shown that decision trees suffer from a significant variance which penalises the precision of this technique. A tree may be too large due to too many test nodes determined at the bottom of the tree on sub-samples of statistically unreliable size objects. The choice of tests (attributes and thresholds) at the internal nodes of a decision tree can also depend on a sample to another which contributes to the variance of the models built. For these reasons, the criteria for stopping the development of a tree or simplification techniques such as pruning procedures is to find a good compromise between the complexity of the model and its reliability on an independent sample. These techniques can only improve the first source of error (bias) mentioned above. Different variance reduc-

*An example of a decision tree and the partition it implies (Figure taken from https://www.kdnuggets.com/*

tion techniques are proposed in the literature, notably the ensemble metaalgorithms such as bagging, random forests, extra-trees and boosting.

combined with a random extraction of sub-windows of pixel values.

Pruning is a model selection procedure, where the models are the pruned subtrees of the maximum tree *T*0. Let T be the set of all binary sub-trees of *T* having the same root as *T*0. This procedure minimises a penalised criterion where the penalty is proportional to the number of leaves in the tree [25]. Defining the optimal size of a decision tree consists in stopping the *pre-pruning* or reducing the

**2.2 Pruning**

**26**

**Figure 1.**

*Data Mining - Methods, Applications and Systems*

*website).*

The ensemble meta-algorithms are effective in combination with decision trees. These methods differ by their way of adapting the original tree induction algorithm and/or aggregating the results. Bagging, random forests and extra-trees methods have several similarities. They independently build *T* constitutive trees. The predictions of different trees are aggregated as follows: each tree produces a vector of class probabilities. The *T* trees probability are additional in a weight vector, and the class that receives the most weight according to this one is assigned to the object. Note that these three methods use a random component and their precision can then vary slightly from one execution to another. Boosting method produces sequentially (and deterministically) the set of trees unlike these three methods using a different aggregation procedure. These methods have been successfully applied to numerous applications, notably in bioinformatics [23] and in networks [24]. Maree [22] presents a bibliographical analysis of these methods. His work covers the problem of automatic image classification using sets of random trees

Pre-pruning consists in fixing a stopping rule which allows to stop the growth of a tree during learning phase by fixing a local stopping criterion which makes it possible to evaluate the informational contribution of the segmentation relating to the node that is being processed. The principle of the CHAID algorithm [32] is based on the same principle by accepting segmentation if the measure of information gain (*χ*<sup>2</sup> difference in independence or *t* from Tschuprow [3]) calculated on a node is significantly higher than a chosen threshold. According to Rakotomalala et al. [32, 33], formalisation involves a test of statistical hypothesis: the null hypothesis is the independence of the segmentation variable with the class attribute. If the calculated *χ*<sup>2</sup> is higher than the theoretical threshold corresponding to the risk of the first kind that we have set (respectively if the p-value calculated is lower than the risk of first kind), we accept the segmentation.

One of the cons of this algorithm is that it prematurely stops the building process of the tree. Furthermore, the use of the statistical test is considered critical. This is a classic independence test whose variable tested is produced at the end of several optimisation stages: search for the optimal discretization point for continuous variables and then search for the segmentation variable which maximises the measure used. The statistical law is no longer the same from one stage to another. The correction of the test by the introduction of certain procedures known as the *Bonferroni* correction [33] is recommended, but in practice, this type of correction does not lead to improvement in terms of classification performance. We also cite the work of [34], which proposes two pruning approaches: the first is a method of simplifying rules by the test of statistical independence to modify the pruning mechanism of the algorithm CHAID, and the second uses validation criteria inspired by the discovery technique of association rules.

Suppose we got some *splits* and came up with a set of terminal nodes *T*~. The set

*I t*ðÞ¼ <sup>X</sup> *t*∈*T*~

*G t*ð Þ is the impurity measure on the node *t*, and *p t*ð Þ is the probability that an

It is easy to see that the selection of the splits which maximise *Δi*ð Þ *s*, *t* is equivalent to the selection of the splits which minimise the impurity *I T*ð Þ on all the trees. If we take any node *t*∈*T*~ and we use a split *s* which partitions the node into two parts

Because we have partitioned the subset arrived at *t* and *tD* and *tG*, reducing the

It only depends on the node *t* and the *splits s*. So, to maximise the reduction of

The proportions *pD* are defined as follows: *pD* ¼ *p t*ð Þ *<sup>D</sup> =p t*ð Þ, *pG* ¼ *p t*ð Þ *<sup>G</sup> =p t*ð Þ and

The stop splitting criterion used by CART was very simple: for a threshold *β* >0, a node is declared terminal (leaf) if *maxΔI s*ð Þ , *t* ≤ *β*. The algorithm assigns to each

Post-pruning is a procedure that appeared with the CART method [4]. It was very widely taken up in different forms thereafter. The principle is to build the tree in two phases. (1) The first phase of expansion consists in producing the purest possible trees and in which all segmentations are accepted even if they are not relevant. This is the principle of hurdling building. (2) In the second phase, we try to reduce the tree by using another criterion to compare trees of different sizes. The building time of the tree is of course longer. It can be penalising when the database is very large while the objective is to obtain a tree that performs better in classifica-

The idea that was introduced by Breiman et al. [4] is to construct a sequence of trees *T*0, *::*, *Ti*, *::*, *Tt*, which minimise a function called *cost complexity metric* (previously mentioned). This function combines two factors: the classification error rate

� � <sup>∗</sup> *p t*ð Þ

*ΔI s*ð Þ¼ , *t G t*ðÞ� *pG* ∗ *G t*ð Þ� *<sup>G</sup> pD* ∗ *G t*ð Þ *<sup>D</sup>*

Since *p t*ð Þ is the only difference between ð Þ *<sup>s</sup>*, *<sup>t</sup>* and ð Þ *<sup>s</sup>*, *<sup>t</sup>* , the same *splits s* <sup>∗</sup>

¼ *ΔG s*ð Þ , *t* ∗ *p t*ð Þ

*G t*ð Þ ∗ *p t*ð Þ (4)

*I t*ðÞþ *I t*ð Þþ *<sup>D</sup> I t*ð Þ *<sup>G</sup>* (5)

� � <sup>¼</sup> *I T*ð Þ� *I t*ð Þ� *<sup>D</sup> I t*ð Þ *<sup>G</sup>* (6)

*ΔI s*ð Þ¼ , *t I t*ðÞ� *I t*ð Þ� *<sup>G</sup> I t*ð Þ *<sup>D</sup>* (7)

(8)

of *splits*, used in the same order, determines the binary tree *T*. We have *I t*ðÞ¼

*I T*ð Þ¼ <sup>X</sup>

*t*∈*T*~

*<sup>T</sup>*~�f g*<sup>t</sup>*

*G t*ð Þ*p t*ð Þ. So the impurity function on the tree is:

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

*Contribution to Decision Tree Induction with Python: A Review*

*tD* and *tG*, the new tree *T*� has the following impurity:

*I T*� � � <sup>¼</sup> <sup>X</sup>

*I T*ð Þ� *<sup>I</sup> <sup>T</sup>*�

impurity in the tree on a node *t*, we maximise:

*pG* þ *pD* ¼ 1. So, Eq. (7) can be written as follows:

and the number of leaves in the tree using *α* parameter.

instance belongs to the node *t*.

impurity of the tree is therefore:

maximises both expressions.

tion phase.

**29**

terminal node the most probable class.

The depth (maximum number of levels) of the tree and the minimum number of observations from which no further segmentation attempts are made also remain two practical options that can be fixed at start learning to manage the complexity of the model. However, the choice of these parameters remains a critical step in the tree building process because the final result depends on these parameters that we have chosen. To this is added the fact that the evaluation is local (limited to a node) and we take more account of the global evaluation of the tree. It is therefore necessary to propose a rule which is not too restrictive (respectively not too permissive) to obtain a suitable tree and not undersized (respectively not oversized).

#### *2.2.2 Post-pruning*

The algorithm for building a binary decision tree using CART browses for each node the *m* attributes ð Þ *x*1, *x*2, … , *xm* one by one, starting with *x*<sup>1</sup> and continuing up to *xm*. For each attribute, it explores all the possible tests (splits), and it chooses the best split (dichotomy) which maximises the reduction in impurity. Then, it compares the *m* best splits to choose the best of them. The function that measures impurity should reach its maximum when the instances are fairly distributed between the different classes and its minimum when a class contains all the examples (the node is pure). There are different functions which satisfy these properties. The function used by CART algorithm is Gini function (Gini impurity index). Gini function on a node *t* with a distribution of class probabilities on this node P (j|t), c = 1, ..., k is:

$$\begin{split} G(p) &= \phi(P(\mathbf{1}|t), P(\mathbf{2}|t), \dots, P(k|t)) \\ &= \sum\_{c} P(c|t) . (\mathbf{1} - P(c|t)) \end{split} \tag{1}$$

If a split *s* on a node *t* splits the subset associated with this node into two subsets, left *tG* with a proportion *pG* and right *tD* with a proportion *pD*, we can define the impurity reduction measure as follows:

$$
\Delta G(s, t) = G(t) - p\_G \* G(t\_G) - p\_D \* G(t\_D) \tag{2}
$$

On each node, if the set of candidate splits is *S*, the algorithm searches the best split *s* <sup>∗</sup> such that:

$$
\Delta G(\text{s}^\*, t) = \max\_{\mathbf{s} \in \mathbb{S}} \Delta G(\mathbf{s}, t) \tag{3}
$$

*Contribution to Decision Tree Induction with Python: A Review DOI: http://dx.doi.org/10.5772/intechopen.92438*

One of the cons of this algorithm is that it prematurely stops the building process of the tree. Furthermore, the use of the statistical test is considered critical. This is a classic independence test whose variable tested is produced at the end of several optimisation stages: search for the optimal discretization point for continuous variables and then search for the segmentation variable which maximises the measure used. The statistical law is no longer the same from one stage to another. The correction of the test by the introduction of certain procedures known as the *Bonferroni* correction [33] is recommended, but in practice, this type of correction does not lead to improvement in terms of classification performance. We also cite the work of [34], which proposes two pruning approaches: the first is a method of simplifying rules by the test of statistical independence to modify the pruning mechanism of the algorithm CHAID, and the second uses validation criteria

The depth (maximum number of levels) of the tree and the minimum number

The algorithm for building a binary decision tree using CART browses for each node the *m* attributes ð Þ *x*1, *x*2, … , *xm* one by one, starting with *x*<sup>1</sup> and continuing up to *xm*. For each attribute, it explores all the possible tests (splits), and it chooses the

*G p*ð Þ¼ *ϕ*ð Þ *P*ð Þ 1j*t* , *P*ð Þ 2j*t* , … , *P k*ð Þ j*t*

If a split *s* on a node *t* splits the subset associated with this node into two subsets, left *tG* with a proportion *pG* and right *tD* with a proportion *pD*, we can define the

On each node, if the set of candidate splits is *S*, the algorithm searches the best

*ΔG s*ð Þ¼ , *t G t*ðÞ� *pG* ∗ *G t*ð Þ� *<sup>G</sup> pD* ∗ *G t*ð Þ *<sup>D</sup>* (2)

*<sup>Δ</sup>G s* <sup>∗</sup> ð Þ¼ , *<sup>t</sup> max <sup>s</sup>*∈*<sup>S</sup>ΔG s*ð Þ , *<sup>t</sup>* (3)

*P c*ð Þ <sup>j</sup>*<sup>t</sup> :*ð Þ <sup>1</sup> � *P c*ð Þ <sup>j</sup>*<sup>t</sup>* (1)

<sup>¼</sup> <sup>X</sup> *c*

best split (dichotomy) which maximises the reduction in impurity. Then, it compares the *m* best splits to choose the best of them. The function that measures impurity should reach its maximum when the instances are fairly distributed between the different classes and its minimum when a class contains all the examples (the node is pure). There are different functions which satisfy these properties. The function used by CART algorithm is Gini function (Gini impurity index). Gini function on a node *t* with a distribution of class probabilities on this

of observations from which no further segmentation attempts are made also remain two practical options that can be fixed at start learning to manage the complexity of the model. However, the choice of these parameters remains a critical step in the tree building process because the final result depends on these parameters that we have chosen. To this is added the fact that the evaluation is local (limited to a node) and we take more account of the global evaluation of the tree. It is therefore necessary to propose a rule which is not too restrictive (respectively not too permissive) to obtain a suitable tree and not undersized (respectively not

inspired by the discovery technique of association rules.

*Data Mining - Methods, Applications and Systems*

oversized).

*2.2.2 Post-pruning*

node P (j|t), c = 1, ..., k is:

split *s* <sup>∗</sup> such that:

**28**

impurity reduction measure as follows:

Suppose we got some *splits* and came up with a set of terminal nodes *T*~. The set of *splits*, used in the same order, determines the binary tree *T*. We have *I t*ðÞ¼ *G t*ð Þ*p t*ð Þ. So the impurity function on the tree is:

$$I(T) = \sum\_{t \in \hat{T}} I(t) = \sum\_{t \in \hat{T}} G(t) \* p(t) \tag{4}$$

*G t*ð Þ is the impurity measure on the node *t*, and *p t*ð Þ is the probability that an instance belongs to the node *t*.

It is easy to see that the selection of the splits which maximise *Δi*ð Þ *s*, *t* is equivalent to the selection of the splits which minimise the impurity *I T*ð Þ on all the trees. If we take any node *t*∈*T*~ and we use a split *s* which partitions the node into two parts *tD* and *tG*, the new tree *T*� has the following impurity:

$$I\left(\hat{T}\right) = \sum\_{\mathcal{T} - \{t\}} I(t) + I(t\_D) + I(t\_G) \tag{5}$$

Because we have partitioned the subset arrived at *t* and *tD* and *tG*, reducing the impurity of the tree is therefore:

$$I(T) - I\left(\dot{T}\right) = I(T) - I(t\_D) - I(t\_G) \tag{6}$$

It only depends on the node *t* and the *splits s*. So, to maximise the reduction of impurity in the tree on a node *t*, we maximise:

$$
\Delta I(\mathbf{s}, t) = I(t) - I(\mathbf{t}\_G) - I(\mathbf{t}\_D) \tag{7}
$$

The proportions *pD* are defined as follows: *pD* ¼ *p t*ð Þ *<sup>D</sup> =p t*ð Þ, *pG* ¼ *p t*ð Þ *<sup>G</sup> =p t*ð Þ and *pG* þ *pD* ¼ 1. So, Eq. (7) can be written as follows:

$$\begin{split} \Delta I(\mathfrak{s}, t) &= \left[ G(t) - p\_G \ast G(t\_G) - p\_D \ast G(t\_D) \right] \ast p(t) \\ &= \Delta G(\mathfrak{s}, t) \ast p(t) \end{split} \tag{8}$$

Since *p t*ð Þ is the only difference between ð Þ *<sup>s</sup>*, *<sup>t</sup>* and ð Þ *<sup>s</sup>*, *<sup>t</sup>* , the same *splits s* <sup>∗</sup> maximises both expressions.

The stop splitting criterion used by CART was very simple: for a threshold *β* >0, a node is declared terminal (leaf) if *maxΔI s*ð Þ , *t* ≤ *β*. The algorithm assigns to each terminal node the most probable class.

Post-pruning is a procedure that appeared with the CART method [4]. It was very widely taken up in different forms thereafter. The principle is to build the tree in two phases. (1) The first phase of expansion consists in producing the purest possible trees and in which all segmentations are accepted even if they are not relevant. This is the principle of hurdling building. (2) In the second phase, we try to reduce the tree by using another criterion to compare trees of different sizes. The building time of the tree is of course longer. It can be penalising when the database is very large while the objective is to obtain a tree that performs better in classification phase.

The idea that was introduced by Breiman et al. [4] is to construct a sequence of trees *T*0, *::*, *Ti*, *::*, *Tt*, which minimise a function called *cost complexity metric* (previously mentioned). This function combines two factors: the classification error rate and the number of leaves in the tree using *α* parameter.

For each internal node, *Ne* and *T*0, the relationship is defined as:

$$a(p) = \frac{\Delta R\_{emp}^{S}}{|T\_p| - 1} \tag{9}$$

7.Diabetes pedigree function (*DPF*)

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

*Contribution to Decision Tree Induction with Python: A Review*

the profile class function (**Figure 2**).

information gain.

regression.

sample\_weight is passed.

**Figure 2.**

**31**

The last column of the dataset indicates if the person has been diagnosed with

Without any data preparation step (cleaning, missing values processing, etc.), we partitioned the dataset into a training data (75%) to build the tree and test data (25%) for prediction. Then we kept the default settings which we can see through

The Scikit-Learn documentation<sup>8</sup> explains in detail how to use each parameter and offers other modules and functions to search information and internal structures of classifier from training to building step. Among these parameters, we

• *criterion*: Optional (default = "gini"). This parameter allows to measure the quality of a split, use the different-different attribute selection measure and supports two criteria, "gini" for the Gini index and "entropy" for the

• max\_depth: Optional (default = None), the maximum depth of a tree. If None,

min\_samples\_split samples. A higher value of maximum depth causes

• min\_samples\_leaf: Optional (default = 1), the minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min\_samples\_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in

• min\_impurity\_decrease: Optional (default = 0.0). A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

where *N* is the total number of samples, *Nt* is the number of samples at the current node, *NtL* is the number of samples in the left child and *NtR* is the number of samples in the right child. *N*, *Nt*, *NtR* and *NtL* all refer to the weighted sum, if

<sup>8</sup> https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

*mpurity NtL=Nt* ∗ *lefti mpurity* (10)

highlight in this review the following four we use to optimise the tree:

then nodes are expanded until all the leaves contain less than

The weighted impurity decrease equation is the following:

*Nt=N* ∗ *impurity NtR=Nt* ∗ *righti*

*Default setting to create decision tree classifier without pruning.*

overfitting, and a lower value causes underfitting.

8.Age (years)

diabetes or not.

where Δ*R<sup>S</sup> emp* is the number of additional errors that the decision tree makes on the set of samples *S* when we prune it at position *p*. ∣*p*∣ � 1 measures the number of sheets deleted. The tree *Ti*þ<sup>1</sup> is obtained by pruning *Ti* at its node which has the smallest value of *α*ð Þ *p* parameter. We thus obtain a sequence *T*0, *::*, *Ti*, *::*, *Tt* of elements of *T*, the last of which is reduced to a leaf. To estimate the error rate for each tree, the authors suggest using two different methods, one based on crossvalidation and the other on a new test base.

#### **3. Decision tree classifier building in Scikit-Learn**

Today there are several best machine learning websites that propose tutorials to show how decision trees work using the different modules of python. We quote for example three popular websites: Towards Data Science,<sup>4</sup> KDnuggets,<sup>5</sup> and Kaggle.<sup>6</sup> Developers offer in a few lines of optimised code how to use decision tree method by covering the various topics concerning attribute selection measures, information gain, how to optimise decision tree performance, etc.

From our side, we choose *Pima Indians Diabetes* datasets (often used in classification problems) to examine the various tuned parameters proposed as arguments by Scikit-Learn package. The Pima are a group of Native Americans living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop the highest prevalence of type 2 diabetes, and for this reason they have been subject of many studies. The original dataset is available at UCI Machine Learning Repository and can be downloaded from this address,<sup>7</sup> "diabetesdata.tar.Z", containing the distribution for 70 sets of data recorded on diabetes patients, several weeks to months worth of glucose, insulin and lifestyle data per patient. The dataset includes data from 768 women with 8 characteristics, in particular:


<sup>4</sup> https://towardsdatascience.com/decision-tree-algorithm-explained-83beb6e78ef4

<sup>5</sup> https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html

<sup>6</sup> https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-dataset

<sup>7</sup> http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

7.Diabetes pedigree function (*DPF*)

8.Age (years)

For each internal node, *Ne* and *T*0, the relationship is defined as:

where Δ*R<sup>S</sup>*

ticular:

**30**

validation and the other on a new test base.

*Data Mining - Methods, Applications and Systems*

**3. Decision tree classifier building in Scikit-Learn**

gain, how to optimise decision tree performance, etc.

1.Number of times pregnant (*NTP*)

3.Diastolic blood pressure (mm Hg) (*DBP*)

4.Triceps skinfold thickness (mm) (*TSFT*)

5.Two-hour serum insulin (mu U/ml) (*HSI*)

<sup>7</sup> http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

6.Body mass index (weight in kg/(height in m)<sup>2</sup>

<sup>4</sup> https://towardsdatascience.com/decision-tree-algorithm-explained-83beb6e78ef4 <sup>5</sup> https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html <sup>6</sup> https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-dataset

*α*ð Þ¼ *p*

Δ*R<sup>S</sup> emp*

the set of samples *S* when we prune it at position *p*. ∣*p*∣ � 1 measures the number of sheets deleted. The tree *Ti*þ<sup>1</sup> is obtained by pruning *Ti* at its node which has the smallest value of *α*ð Þ *p* parameter. We thus obtain a sequence *T*0, *::*, *Ti*, *::*, *Tt* of elements of *T*, the last of which is reduced to a leaf. To estimate the error rate for each tree, the authors suggest using two different methods, one based on cross-

Today there are several best machine learning websites that propose tutorials to show how decision trees work using the different modules of python. We quote for example three popular websites: Towards Data Science,<sup>4</sup> KDnuggets,<sup>5</sup> and Kaggle.<sup>6</sup> Developers offer in a few lines of optimised code how to use decision tree method by covering the various topics concerning attribute selection measures, information

From our side, we choose *Pima Indians Diabetes* datasets (often used in classification problems) to examine the various tuned parameters proposed as arguments by Scikit-Learn package. The Pima are a group of Native Americans living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop the highest prevalence of type 2 diabetes, and for this reason they have been subject of many studies. The original dataset is available at UCI Machine Learning Repository and can be downloaded from this address,<sup>7</sup> "diabetesdata.tar.Z", containing the distribution for 70 sets of data recorded on diabetes patients, several weeks to months worth of glucose, insulin and lifestyle data per patient. The dataset includes data from 768 women with 8 characteristics, in par-

2.Plasma glucose concentration in 2 h in an oral glucose tolerance test (*PGC*)

) (*BMI*)

*emp* is the number of additional errors that the decision tree makes on

<sup>∣</sup>*Tp*<sup>∣</sup> � <sup>1</sup> (9)

The last column of the dataset indicates if the person has been diagnosed with diabetes or not.

Without any data preparation step (cleaning, missing values processing, etc.), we partitioned the dataset into a training data (75%) to build the tree and test data (25%) for prediction. Then we kept the default settings which we can see through the profile class function (**Figure 2**).

The Scikit-Learn documentation<sup>8</sup> explains in detail how to use each parameter and offers other modules and functions to search information and internal structures of classifier from training to building step. Among these parameters, we highlight in this review the following four we use to optimise the tree:


$$\text{N}\_{\text{t}}/\text{N}\*\left(\text{impurity}-\text{N}\_{\text{t}\text{R}}/\text{N}\_{\text{t}}\*\text{right}\_{\text{i}}\text{mpurity}-\text{N}\_{\text{t}\text{L}}/\text{N}\_{\text{t}}\*\text{left}\_{\text{i}}\text{mpurity}\right)\tag{10}$$

where *N* is the total number of samples, *Nt* is the number of samples at the current node, *NtL* is the number of samples in the left child and *NtR* is the number of samples in the right child. *N*, *Nt*, *NtR* and *NtL* all refer to the weighted sum, if sample\_weight is passed.

**Figure 2.** *Default setting to create decision tree classifier without pruning.*

<sup>8</sup> https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In this example, each internal node has a decision rule that divides the data. The node impurity is set by default at Gini ratio. A node is pure when all the objects belong to the same class, i.e. impurity = 0. The unpruned tree resulting from this setting is inexplicable and difficult to understand. In **Figures 3** and **4**, we will show you how to adjust some tuned parameters to get an optimal tree by pruning.

We will now adjust only one parameter, the maximum depth of the tree. This

min\_samples\_leaf at 5. We will see that this pruned tree is less complex and easier to understand by a field expert than the previous flowchart. We will see that we have good accuracy with this setting. Accuracy can be computed by comparing actual

Most important features of Pima Indians Diabetes dataset is shown in **Figures 4**, **6** and **8**. We can see the root node is glucose, which can show the glucose has the max information gain, so it confirm the common sense and the clinical

will control the tree size (number of levels). On the same data, we set maximum\_depth at 4. Next, we set "min\_impurity\_decrease" at 0.01 and

test set values and predicted values (**Figures 5–8**).

*Contribution to Decision Tree Induction with Python: A Review*

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

*Decision tree pruned by mean of maximum depth parameter. Accuracy = 0.79.*

*Feature importance. Decision tree after pruning (corresponding to Figure 5 results).*

**Figure 5.**

**Figure 6.**

**33**

"export\_graphviz" and "pydotplus" modules convert the decision tree classifier to a "dot" file and in "png/pdf/.." format. Using various options of this modules, you can adjust leaf colours and edit leaf content, important descriptors, etc. Personally, I really enjoyed doing it during my R&D works.

**Figure 3.** *Decision tree without pruning. Accuracy = 0.72.*

**Figure 4.** *Feature importance. Decision tree without pruning.*

*Contribution to Decision Tree Induction with Python: A Review DOI: http://dx.doi.org/10.5772/intechopen.92438*

In this example, each internal node has a decision rule that divides the data. The node impurity is set by default at Gini ratio. A node is pure when all the objects belong to the same class, i.e. impurity = 0. The unpruned tree resulting from this setting is inexplicable and difficult to understand. In **Figures 3** and **4**, we will show you how to adjust some tuned parameters to get an optimal tree by pruning.

"export\_graphviz" and "pydotplus" modules convert the decision tree classifier to a "dot" file and in "png/pdf/.." format. Using various options of this modules, you can adjust leaf colours and edit leaf content, important descriptors, etc.

Personally, I really enjoyed doing it during my R&D works.

*Data Mining - Methods, Applications and Systems*

**Figure 3.**

**Figure 4.**

**32**

*Feature importance. Decision tree without pruning.*

*Decision tree without pruning. Accuracy = 0.72.*

We will now adjust only one parameter, the maximum depth of the tree. This will control the tree size (number of levels). On the same data, we set maximum\_depth at 4. Next, we set "min\_impurity\_decrease" at 0.01 and min\_samples\_leaf at 5. We will see that this pruned tree is less complex and easier to understand by a field expert than the previous flowchart. We will see that we have good accuracy with this setting. Accuracy can be computed by comparing actual test set values and predicted values (**Figures 5–8**).

Most important features of Pima Indians Diabetes dataset is shown in **Figures 4**, **6** and **8**. We can see the root node is glucose, which can show the glucose has the max information gain, so it confirm the common sense and the clinical

**Figure 5.** *Decision tree pruned by mean of maximum depth parameter. Accuracy = 0.79.*

**Figure 6.** *Feature importance. Decision tree after pruning (corresponding to Figure 5 results).*

**Figure 7.**

*Decision tree pruned by mean maximum depth and impurity parameters. Accuracy = 0.80.*

diagnosis basis. Body mass index (BMI) and age are also found among the first important variables. According to consulting relevant information, we know there are three indicators to determine the diabetes mellitus, which are fasting blood glucose, random blood glucose and blood glucose tolerance. Pima Indians Diabetes dataset only has blood glucose tolerance. Prevalence of diabetes mellitus, hypertension and dyslipidaemia increase with higher BMI (BMI 25 kg/m<sup>2</sup> ). On the other hand, type 2 diabetes usually begins after age 40 and is diagnosed at an average age of 65. This is why the French National Authority for Health recommends renewing the screening test every 3 years in people over 45 years and every year if there is more than one risk factor.

of the tree will make it possible to present a partial tree sufficiently precise. This is based on the confusion matrix: The accuracy P is the ratio of well-classified elements to the sum of all elements and is defined by the following expression

**Prédit/réel Classe A Classe B** Classe A VA (Vrais A) FA (Faux A) Classe B FB (Faux B) VB (Vrais B)

*Feature importance. Decision tree after pruning (corresponding to Figure 7 results).*

*Contribution to Decision Tree Induction with Python: A Review*

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

P ¼ *VA* <sup>þ</sup> *VB*

The accuracy associated with a level of the tree is calculated by summing the *VA*, *VB*, *FA* and *FB* taking into account the labels *A* or *B* of each node, and we add to *VA* or *VB* the elements corresponding to pure nodes *A* or *B* in the previous levels. We can thus decide to choose the partial tree according to the desired accuracy.

Decision trees accept, like most learning methods, several hyper-parameters that control its behaviour. In our use case, we used Gini index like information criteria to

*VA* <sup>þ</sup> *VB* <sup>þ</sup> *FA* <sup>þ</sup> *FB* (11)

(**Table 1**):

**Table 1.** *Confusion matrix.*

**Figure 8.**

**4. Discussions**

**35**

Despite the stop criterion of tree depth, the trees generated may be too deep for a good practical interpretation. The notion of "accuracy" associated with each level

*Contribution to Decision Tree Induction with Python: A Review DOI: http://dx.doi.org/10.5772/intechopen.92438*

#### **Figure 8.**

*Feature importance. Decision tree after pruning (corresponding to Figure 7 results).*


**Table 1.**

*Confusion matrix.*

of the tree will make it possible to present a partial tree sufficiently precise. This is based on the confusion matrix: The accuracy P is the ratio of well-classified elements to the sum of all elements and is defined by the following expression (**Table 1**):

$$\mathcal{P} = \frac{\text{VA} + \text{VB}}{\text{VA} + \text{VB} + \text{FA} + \text{FB}} \tag{11}$$

The accuracy associated with a level of the tree is calculated by summing the *VA*, *VB*, *FA* and *FB* taking into account the labels *A* or *B* of each node, and we add to *VA* or *VB* the elements corresponding to pure nodes *A* or *B* in the previous levels. We can thus decide to choose the partial tree according to the desired accuracy.

#### **4. Discussions**

Decision trees accept, like most learning methods, several hyper-parameters that control its behaviour. In our use case, we used Gini index like information criteria to

diagnosis basis. Body mass index (BMI) and age are also found among the first important variables. According to consulting relevant information, we know there are three indicators to determine the diabetes mellitus, which are fasting blood glucose, random blood glucose and blood glucose tolerance. Pima Indians Diabetes dataset only has blood glucose tolerance. Prevalence of diabetes mellitus, hyperten-

*Decision tree pruned by mean maximum depth and impurity parameters. Accuracy = 0.80.*

*Data Mining - Methods, Applications and Systems*

hand, type 2 diabetes usually begins after age 40 and is diagnosed at an average age of 65. This is why the French National Authority for Health recommends renewing the screening test every 3 years in people over 45 years and every year if there is

Despite the stop criterion of tree depth, the trees generated may be too deep for a good practical interpretation. The notion of "accuracy" associated with each level

). On the other

sion and dyslipidaemia increase with higher BMI (BMI 25 kg/m<sup>2</sup>

more than one risk factor.

**Figure 7.**

**34**

split the learning data. This criterion has directed the method to build a tree with a maximum of 15 levels and to accept a node as a leaf if it includes at least five learning instances. Impurity (entropy) is a measure of disorder in dataset; if we have zero entropy, it means that all the instances of the target classes are the same, while it reaches its maximum when there is an equal number of instances of each class. At each node, we have a number of instances (from the dataset), and we measure its entropy. Setting impurity to a given value (chosen according to expertise and tests) will allow us to select the questions which give more homogeneous partitions (with the lowest entropy), when we consider only the instances for which the answer to the question is yes or no, that is to say when the entropy after answer to the question decreases.

Decision tree is biased with imbalance dataset. It is recommended to balance dataset before training to prevent the tree from being biased towards the classes that are dominant. According to scikit-learn documentation "class balancing can be done by sampling an equal number of samples from each class, or preferably by normalising the sum of the sample weights (sample\_weight) for each class to the

min\_weight\_fraction\_leaf, will then be less biased towards dominant classes than

Decision trees simply respond to a classification problem. Decision tree is one of the few methods that can be presented quickly, without getting lost in mathematical formulations difficult to grasp, to hearing not specialised in data processing or machine learning. In this chapter, we have described the key elements necessary to build a decision tree from a dataset as well as the pruning methods, pre-pruning and post-pruning. We have also pointed to ensemble meta-algorithms as alternative for solving the variance problem. We have seen that letting the decision tree grow to the end causes several problems, such as overfitting. In addition, the deeper the tree is, the more the number of instances (samples) per leaf decreases. On the other hand, several studies have shown that pruning decreases the performance of the

Decision tree properties are now well known. It is mainly positioned as a reference method despite the fact that efforts to develop the method are less numerous today. The references cited in this chapter are quite interesting and significant. They provide a broad overview of statistical and machine learning methods by producing a more technical description pointing the essential key points of tree building. In spite of the fact that the CART algorithm has been around for a long time, it remains an essential reference, by its precision, its exhaustiveness and the hindsight which the authors, developers and researchers demonstrate in the solutions they recommend. Academic articles also suggest new learning techniques and often use it in their comparisons to locate their work, but the preferred method in machine learning also remains C4.5 method. The availability of source code on the web justifies this success. C4.5 is now used for Coronavirus Disease 2019

Finally, we would like to emphasise that the interpretability of a decision tree is a factor which can be subjective and whose importance also depends on the problem. A tree that does not have many leafs can be considered easily interpretable by a human. Some applications require good interpretability, which is not the case for all prediction applications. On industrial problems, an interpretable model with great precision is often necessary to increase knowledge of the field studied and identify new patterns that can provide solutions to needs and to several expert questions. We continue to put a lot of effort (scientific researchers, developers, experts, manufacturers, etc.) to make more improvements to this approach: decision tree induction. This chapter opens several opportunities in terms of algorithms and in terms of applications. For our use case, we would like to have more data to predict the type of diabetes and determine the proportion of each indicator, which can

same value. Also note that weight-based pre-pruning criteria, such as

*Contribution to Decision Tree Induction with Python: A Review*

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

**5. Conclusions**

decision tree in estimating probability.

(COVID-19) diagnosis [35, 36].

improve the accuracy of predicting diabetes.

**37**

criteria that are not aware of the sample weights, like min\_samples\_leaf".

During my previous R&D work, we used the CART algorithm implemented in the scikit-learn library. This implementation is close to the original one proposed by [4]; however there is no parameter for penalising the deviance of the model by its complexity (number of leaves) in order to build a sequence of trees nested in the prospect of optimal pruning by cross-validation. The generic function of k-fold cross-validation "GridSearchCV" can be used to optimise the depth parameter but with great precision in pruning. The depth parameter eliminates a whole level and not the only unnecessary leaves to the quality of the prediction. On the other hand, the implementation anticipates those of model aggregation methods by integrating the parameters (number of variables drawn, importance, etc.) which are specific to them. On the other hand, the graphical representation of tree is not included and requires the implementation of another free software like "Graphviz" and "Pydotplus" modules.

The pros and cons of decision trees are known and described in almost all the articles and works developed in this field. We highlight some that we consider important for industrial applications. Selecting features is an extremely important step when creating a machine learning solution. If the algorithm does not have good input functionality, it will not have enough material to learn, and the results will not be good, even if we have the best machine learning algorithm ever designed. The selection of characteristics can be done manually depending on the knowledge of the field and the machine learning method that we plan to use or by using automatic tools to evaluate and select the most promising. Another common problem with datasets is the problem of missing values. In most cases, we take a classic imputation approach using the most common value in the training data, or the median value. When we replace missing values, we should understand that we are modifying the original problem and be careful when using this data for other analytical purposes. This is a general rule in machine learning. When we change the data, we should have a clear idea of what we are changing, to avoid distorting the final results. Fortunately, decision tree requires fewer data preprocessing from users. It is used with missing data, and there is no need to normalise features. However, we must be careful in the way we describe the categorical data. Having a priori knowledge of the data field, we can favour one or more modalities of a descriptor to force the discretization process to choose a threshold, which highlights the importance of the variables. Moreover, Geurts [21] has shown that the choice of tests (attributes and thresholds) at the internal nodes of the tree can strongly depend on samples, which also contributes to the variance of the models built according to this method.

Decision tree can easily capture nonlinear patterns, which is important in big data processing. Nevertheless it is sensitive to noisy data, and it can overfit it. In big data mining, online data processing is subject to continuous development (upgrade, environment change, catching up, bugs, etc.) impacting the results expected by customers and users. To this, the problem of variation that can be reduced by bagging and boosting algorithms (that we mentioned in Section 2.1) is added.

Decision tree is biased with imbalance dataset. It is recommended to balance dataset before training to prevent the tree from being biased towards the classes that are dominant. According to scikit-learn documentation "class balancing can be done by sampling an equal number of samples from each class, or preferably by normalising the sum of the sample weights (sample\_weight) for each class to the same value. Also note that weight-based pre-pruning criteria, such as min\_weight\_fraction\_leaf, will then be less biased towards dominant classes than criteria that are not aware of the sample weights, like min\_samples\_leaf".

#### **5. Conclusions**

split the learning data. This criterion has directed the method to build a tree with a maximum of 15 levels and to accept a node as a leaf if it includes at least five learning instances. Impurity (entropy) is a measure of disorder in dataset; if we have zero entropy, it means that all the instances of the target classes are the same, while it reaches its maximum when there is an equal number of instances of each class. At each node, we have a number of instances (from the dataset), and we measure its entropy. Setting impurity to a given value (chosen according to expertise and tests) will allow us to select the questions which give more homogeneous partitions (with the lowest entropy), when we consider only the instances for which the answer to the question is yes or no, that is to say when the entropy after answer

During my previous R&D work, we used the CART algorithm implemented in the scikit-learn library. This implementation is close to the original one proposed by [4]; however there is no parameter for penalising the deviance of the model by its complexity (number of leaves) in order to build a sequence of trees nested in the prospect of optimal pruning by cross-validation. The generic function of k-fold cross-validation "GridSearchCV" can be used to optimise the depth parameter but with great precision in pruning. The depth parameter eliminates a whole level and not the only unnecessary leaves to the quality of the prediction. On the other hand, the implementation anticipates those of model aggregation methods by integrating the parameters (number of variables drawn, importance, etc.) which are specific to them. On the other hand, the graphical representation of tree is not included and requires the implementation of another free software like "Graphviz" and

The pros and cons of decision trees are known and described in almost all the articles and works developed in this field. We highlight some that we consider important for industrial applications. Selecting features is an extremely important step when creating a machine learning solution. If the algorithm does not have good input functionality, it will not have enough material to learn, and the results will not be good, even if we have the best machine learning algorithm ever designed. The selection of characteristics can be done manually depending on the knowledge of the field and the machine learning method that we plan to use or by using automatic tools to evaluate and select the most promising. Another common problem with datasets is the problem of missing values. In most cases, we take a classic imputation approach using the most common value in the training data, or the median value. When we replace missing values, we should understand that we are modifying the original problem and be careful when using this data for other analytical purposes. This is a general rule in machine learning. When we change the data, we should have a clear idea of what we are changing, to avoid distorting the final results. Fortunately, decision tree requires fewer data preprocessing from users. It is used with missing data, and there is no need to normalise features. However, we must be careful in the way we describe the categorical data. Having a priori knowledge of the data field, we can favour one or more modalities of a descriptor to force the discretization process to choose a threshold, which highlights the importance of the variables. Moreover, Geurts [21] has shown that the choice of tests (attributes and thresholds) at the internal nodes of the tree can strongly depend on samples, which also contributes to the variance of the models built according to this method. Decision tree can easily capture nonlinear patterns, which is important in big data processing. Nevertheless it is sensitive to noisy data, and it can overfit it. In big data mining, online data processing is subject to continuous development (upgrade, environment change, catching up, bugs, etc.) impacting the results expected by customers and users. To this, the problem of variation that can be reduced by bagging and boosting algorithms (that we mentioned in Section 2.1) is added.

to the question decreases.

*Data Mining - Methods, Applications and Systems*

"Pydotplus" modules.

**36**

Decision trees simply respond to a classification problem. Decision tree is one of the few methods that can be presented quickly, without getting lost in mathematical formulations difficult to grasp, to hearing not specialised in data processing or machine learning. In this chapter, we have described the key elements necessary to build a decision tree from a dataset as well as the pruning methods, pre-pruning and post-pruning. We have also pointed to ensemble meta-algorithms as alternative for solving the variance problem. We have seen that letting the decision tree grow to the end causes several problems, such as overfitting. In addition, the deeper the tree is, the more the number of instances (samples) per leaf decreases. On the other hand, several studies have shown that pruning decreases the performance of the decision tree in estimating probability.

Decision tree properties are now well known. It is mainly positioned as a reference method despite the fact that efforts to develop the method are less numerous today. The references cited in this chapter are quite interesting and significant. They provide a broad overview of statistical and machine learning methods by producing a more technical description pointing the essential key points of tree building. In spite of the fact that the CART algorithm has been around for a long time, it remains an essential reference, by its precision, its exhaustiveness and the hindsight which the authors, developers and researchers demonstrate in the solutions they recommend. Academic articles also suggest new learning techniques and often use it in their comparisons to locate their work, but the preferred method in machine learning also remains C4.5 method. The availability of source code on the web justifies this success. C4.5 is now used for Coronavirus Disease 2019 (COVID-19) diagnosis [35, 36].

Finally, we would like to emphasise that the interpretability of a decision tree is a factor which can be subjective and whose importance also depends on the problem. A tree that does not have many leafs can be considered easily interpretable by a human. Some applications require good interpretability, which is not the case for all prediction applications. On industrial problems, an interpretable model with great precision is often necessary to increase knowledge of the field studied and identify new patterns that can provide solutions to needs and to several expert questions. We continue to put a lot of effort (scientific researchers, developers, experts, manufacturers, etc.) to make more improvements to this approach: decision tree induction. This chapter opens several opportunities in terms of algorithms and in terms of applications. For our use case, we would like to have more data to predict the type of diabetes and determine the proportion of each indicator, which can improve the accuracy of predicting diabetes.

*Data Mining - Methods, Applications and Systems*

**References**

**29**(2):119-127

[1] Morgan J, Sonquist J. Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association. 1963;**58**(2):415-435

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

*Contribution to Decision Tree Induction with Python: A Review*

Learning. Wilmslow, UK: Sigma Press;

[10] Quinlan JR, Compton PJ, Horn KA,

Proceedings of the Second Australian Conference on Applications of Expert Systems. Boston, MA, USA: Addison Wesley Longman Publishing Co., Inc.;

[11] Quinlan R. C4.5-Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufman; 1993. p. 368

solution of the general covering problem. In: Proceedings of the 5th International Symposium on Information Processing; 1969.

[13] Clark P, Niblett T. The cn2

[14] Shafer J, Agrawal R, Mehta M. SPRINT-A scalable parallel classifier for data mining. Proceeding of the VLDB Conference. 1996;**39**:261-283. DOI: 10.1007/s10462-011-9272-4

[15] Mehta M, Agrawal R, Riassnen J. SLIQ-A fast scalable classifier for data

[16] Tjen-Sien L, Wei-Yin L, Yu-Shan S. SLIQ. A comparison of prediction accuracy, complexity, and training time

[17] Gehrke J, Ramakrishnan R, Ganti V. RainForest—A framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery. 2000;**4**(2–3):127-162. DOI:

mining. Extending Database Technology. 1996;**39**:18-32

of thirty-three old and new classification algorithms. Machine Learning. 2000;**40**:203-228

10.1023/A:1009839829793

[12] Michalski RS. On the quasi minimal

induction algorithm. Machine Learning.

Lazarus L. Inductive knowledge acquisition: A case study. In:

1987. pp. 31-45

1987. pp. 137-156

pp. 125-128

1989;**3**(4):261283

[2] Morgan J, Messenger R. THAID-A Sequential Analysis Program for the Analysis of Nominal Scale Dependent Variables. Ann Arbor: Survey Research Center, Institute for Social Research,

[3] Kass G. An exploratory technique for

categorical data. Applied Statistics. 1973;

[4] Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression

Available from: https://books.google.fr/

Experiments in Induction. New York, NY, USA: Academic Press; 1997. Available from: http://www.univtebessa.dz/fichiers/mosta/544f77fe0cf

[6] Quinlan JR. Discovering rules by induction from large collections of examples. In: Michie D, editor. Expert Systems in the Micro Electronic Age. Vol. 1. Edinburgh University Press;

[7] Paterson A, Niblett T. ACLS Manual.

Rapport Technique. Edinburgh: Intelligent Terminals, Ltd; 1982

[8] Kononenko I, Bratko I, Roskar E. Experiments in Automatic Learning of Medical Diagnostic Rules. Technical Report. Ljubljana, Yugoslavia: Jozef

[9] Cestnik B, Knononenko I, Bratko I. Assistant86-A knowledge elicitation tool for sophisticated users. In: Bratko I, Lavrac N, editors. Progress in Machine

Trees. Taylor & Francis;; 1984.

books?id=JwQx-WOmSyQC

[5] Hunt E, Marin J, Stone P.

29473161c8f87.pdf

1979. pp. 168-201

Stefan Institute; 1984

**39**

University of Michigan; 1973

investigating large quantities of

### **Author details**

Bouchra Lamrini Ex-researcher (Senior R&D Engineer) within Livingobjects Company, Toulouse, France

\*Address all correspondence to: lamrini.bouchra@gmail.com

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Contribution to Decision Tree Induction with Python: A Review DOI: http://dx.doi.org/10.5772/intechopen.92438*

#### **References**

[1] Morgan J, Sonquist J. Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association. 1963;**58**(2):415-435

[2] Morgan J, Messenger R. THAID-A Sequential Analysis Program for the Analysis of Nominal Scale Dependent Variables. Ann Arbor: Survey Research Center, Institute for Social Research, University of Michigan; 1973

[3] Kass G. An exploratory technique for investigating large quantities of categorical data. Applied Statistics. 1973; **29**(2):119-127

[4] Breiman L, Friedman J, Stone C, Olshen R. Classification and Regression Trees. Taylor & Francis;; 1984. Available from: https://books.google.fr/ books?id=JwQx-WOmSyQC

[5] Hunt E, Marin J, Stone P. Experiments in Induction. New York, NY, USA: Academic Press; 1997. Available from: http://www.univtebessa.dz/fichiers/mosta/544f77fe0cf 29473161c8f87.pdf

[6] Quinlan JR. Discovering rules by induction from large collections of examples. In: Michie D, editor. Expert Systems in the Micro Electronic Age. Vol. 1. Edinburgh University Press; 1979. pp. 168-201

[7] Paterson A, Niblett T. ACLS Manual. Rapport Technique. Edinburgh: Intelligent Terminals, Ltd; 1982

[8] Kononenko I, Bratko I, Roskar E. Experiments in Automatic Learning of Medical Diagnostic Rules. Technical Report. Ljubljana, Yugoslavia: Jozef Stefan Institute; 1984

[9] Cestnik B, Knononenko I, Bratko I. Assistant86-A knowledge elicitation tool for sophisticated users. In: Bratko I, Lavrac N, editors. Progress in Machine

Learning. Wilmslow, UK: Sigma Press; 1987. pp. 31-45

[10] Quinlan JR, Compton PJ, Horn KA, Lazarus L. Inductive knowledge acquisition: A case study. In: Proceedings of the Second Australian Conference on Applications of Expert Systems. Boston, MA, USA: Addison Wesley Longman Publishing Co., Inc.; 1987. pp. 137-156

[11] Quinlan R. C4.5-Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufman; 1993. p. 368

[12] Michalski RS. On the quasi minimal solution of the general covering problem. In: Proceedings of the 5th International Symposium on Information Processing; 1969. pp. 125-128

[13] Clark P, Niblett T. The cn2 induction algorithm. Machine Learning. 1989;**3**(4):261283

[14] Shafer J, Agrawal R, Mehta M. SPRINT-A scalable parallel classifier for data mining. Proceeding of the VLDB Conference. 1996;**39**:261-283. DOI: 10.1007/s10462-011-9272-4

[15] Mehta M, Agrawal R, Riassnen J. SLIQ-A fast scalable classifier for data mining. Extending Database Technology. 1996;**39**:18-32

[16] Tjen-Sien L, Wei-Yin L, Yu-Shan S. SLIQ. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning. 2000;**40**:203-228

[17] Gehrke J, Ramakrishnan R, Ganti V. RainForest—A framework for fast decision tree construction of large datasets. Data Mining and Knowledge Discovery. 2000;**4**(2–3):127-162. DOI: 10.1023/A:1009839829793

**Author details**

Bouchra Lamrini

France

**38**

Ex-researcher (Senior R&D Engineer) within Livingobjects Company, Toulouse,

© 2020 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium,

\*Address all correspondence to: lamrini.bouchra@gmail.com

provided the original work is properly cited.

*Data Mining - Methods, Applications and Systems*

[18] Kotsiantis SB. Decision trees—A recent overview. Artificial Intelligence Review. 2013;**39**:261-283. DOI: 10.1007/ s10462-011-9272-4

[19] Rokach L, Maimon O. Top-down induction of decision trees classifiers— A survey. IEEE Transactions on Systems, Man, and Cybernetics: Part C. 2005;**35**(4):476-487. DOI: 10.1109/ TSMCC.2004.843247

[20] Brijain MR, Patel R, Kushik MR, Rana K. International Journal of Engineering Development and Research. 2014;**2**(1):1-5. DOI: 10.1109/ TSMCC.2004.843247

[21] Geurts P. Contributions to Decision Tree Induction: Bias/Variance Tradeoff and Time Series Classification. Belgium: University of Liège; 2002. p. 260. Available from: http://www.montefiore. ulg.ac.be/services/stochastic/pubs/ 2002/Geu02

[22] Marée R, Geurts P, Visimberga G, Piater J, Wehenkel L. A comparison of generic machine learning algorithms for image classification. In: Proceeding Research and Development in Intelligent Systems XX; 2004. pp. 169-182

[23] Geurts P, Fillet M, de Seny D, Meuwis MA, Merville MP, Wehenkel L. Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics. 2004;**21**(14): 3138-3145. DOI: 10.1093/ bioinformatics/bti494

[24] Geurts P, Khayat E, Leduc G. A machine Learning approach to improve congestion control over wireless computer networks. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'04); 2004. Available from: https://ieeexplore.ieee. org/document/1410316

[25] Genuer R, Poggi JM. Arbres CART et Forêts aléatoires, Importance et sélection de variables. hal-01387654v2; 2017. pp. 1-5. Available from: https:// hal.archives-ouvertes.fr/hal-013876

[33] Rakotomalala R. Arbres de décision. Revue Modular. 2005;**33**:163-187. Available from: https://www.rocq.inria. fr/axis/modulad/archives/numero-33/tutorial-rakotomalala-33/ rakotomalala-33-tutorial.pdf

*DOI: http://dx.doi.org/10.5772/intechopen.92438*

*Contribution to Decision Tree Induction with Python: A Review*

[34] Mededjel M, Belbachir H. Postélagage indirect des arbres de décision

International Conference: Sciences of Electronic, Technologies of Information and Telecommunications; 2007. pp. 1-7.

[35] Wiguna W, Riana D. Diagnosis of Coronavirus Disease 2019 (COVID-19) surveillance using C4.5 Algorithm. Jurnal Pilar Nusa Mandiri. 2020;**16**(1): 71-80. DOI: 10.33480/pilar.v16i1.1293

[36] Wiguna W. Decision tree of Coronavirus Disease (COVID-19) surveillance. IEEE Dataport. 2020. DOI:

10.21227/remc-6d63

**41**

dans le data mining. In: 4th

Available from: http://www. univ-tebessa.dz/fichiers/mosta/ 544f77fe0cf29473161c8f87.pdf

[26] Elomaa T, Kääriäinen M. An analysis of reduced error pruning. Journal of Artificial Intelligence Research. 2001:163-187. Available from: http://dl.acm.org/citation.cfm?id= 26079.26082

[27] Quinlan R. Simplifying decision trees. International Journal of Human Computer Studies. 1999;**51**(2):497-510

[28] Berry MJ, Linoff G. Data Mining Techniques for Marketing, Sales, and Customer Relationship Management. John Wiley & Sons; 1997. Available from: http://hdl.handle.net/2027/md p.39015071883859

[29] Brostaux Y. Étude du Classement par Forêts Aléatoires D'échantillons Perturbés à Forte Structure D'interaction. Belgium: Faculté Universitaire des Sciences Agronomiques de Gembloux; 2005. p. 168. Available from: http://www.montefiore.ulg.ac.be/ services/stochastic/pubs/2002/Geu02

[30] Niblett T, Bratko I. Learning decision rules in noisy domains. In: Proceedings of Expert Systems '86, The 6th Annual Technical Conference on Research and Development in Expert Systems III. Cambridge University Press; 1987. pp. 25-34. Available from: http://dl.acm.org/citation.cfm?id= 26079.26082

[31] Mingers J. Experts systems-rule induction with statistical data. Journal of the Operational Research Society. 1987; **38**(1):39-47

[32] Rakotomalala R, Lallich S. Construction d'arbres de décision par optimisation. Revue des Sciences et Technologies de l'Information - Série RIA: Revue d'Intelligence Artificielle. 2002;**16**(6):685-703. Available from: https://hal.archives-ouvertes.fr/ hal-00624091

*Contribution to Decision Tree Induction with Python: A Review DOI: http://dx.doi.org/10.5772/intechopen.92438*

[33] Rakotomalala R. Arbres de décision. Revue Modular. 2005;**33**:163-187. Available from: https://www.rocq.inria. fr/axis/modulad/archives/numero-33/tutorial-rakotomalala-33/ rakotomalala-33-tutorial.pdf

[18] Kotsiantis SB. Decision trees—A recent overview. Artificial Intelligence Review. 2013;**39**:261-283. DOI: 10.1007/

*Data Mining - Methods, Applications and Systems*

2017. pp. 1-5. Available from: https:// hal.archives-ouvertes.fr/hal-013876

Research. 2001:163-187. Available from: http://dl.acm.org/citation.cfm?id=

[27] Quinlan R. Simplifying decision trees. International Journal of Human Computer Studies. 1999;**51**(2):497-510

[28] Berry MJ, Linoff G. Data Mining Techniques for Marketing, Sales, and Customer Relationship Management. John Wiley & Sons; 1997. Available from: http://hdl.handle.net/2027/md

[29] Brostaux Y. Étude du Classement par Forêts Aléatoires D'échantillons

Universitaire des Sciences Agronomiques de Gembloux; 2005. p. 168. Available from: http://www.montefiore.ulg.ac.be/ services/stochastic/pubs/2002/Geu02

[30] Niblett T, Bratko I. Learning decision rules in noisy domains. In: Proceedings of Expert Systems '86, The 6th Annual Technical Conference on Research and Development in Expert Systems III. Cambridge University Press; 1987. pp. 25-34. Available from: http://dl.acm.org/citation.cfm?id=

[31] Mingers J. Experts systems-rule induction with statistical data. Journal of the Operational Research Society. 1987;

[32] Rakotomalala R, Lallich S. Construction d'arbres de décision par optimisation. Revue des Sciences et Technologies de l'Information - Série RIA: Revue d'Intelligence Artificielle. 2002;**16**(6):685-703. Available from: https://hal.archives-ouvertes.fr/

Perturbés à Forte Structure D'interaction. Belgium: Faculté

[26] Elomaa T, Kääriäinen M. An analysis of reduced error pruning. Journal of Artificial Intelligence

26079.26082

p.39015071883859

26079.26082

**38**(1):39-47

hal-00624091

[19] Rokach L, Maimon O. Top-down induction of decision trees classifiers—

Systems, Man, and Cybernetics: Part C. 2005;**35**(4):476-487. DOI: 10.1109/

[20] Brijain MR, Patel R, Kushik MR, Rana K. International Journal of Engineering Development and

Research. 2014;**2**(1):1-5. DOI: 10.1109/

[21] Geurts P. Contributions to Decision Tree Induction: Bias/Variance Tradeoff and Time Series Classification. Belgium: University of Liège; 2002. p. 260. Available from: http://www.montefiore. ulg.ac.be/services/stochastic/pubs/

[22] Marée R, Geurts P, Visimberga G, Piater J, Wehenkel L. A comparison of generic machine learning algorithms for image classification. In: Proceeding Research and Development in Intelligent

Systems XX; 2004. pp. 169-182

3138-3145. DOI: 10.1093/ bioinformatics/bti494

org/document/1410316

**40**

[23] Geurts P, Fillet M, de Seny D, Meuwis MA, Merville MP, Wehenkel L. Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics. 2004;**21**(14):

[24] Geurts P, Khayat E, Leduc G. A machine Learning approach to improve congestion control over wireless computer networks. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'04); 2004. Available from: https://ieeexplore.ieee.

[25] Genuer R, Poggi JM. Arbres CART et Forêts aléatoires, Importance et sélection de variables. hal-01387654v2;

A survey. IEEE Transactions on

s10462-011-9272-4

TSMCC.2004.843247

TSMCC.2004.843247

2002/Geu02

[34] Mededjel M, Belbachir H. Postélagage indirect des arbres de décision dans le data mining. In: 4th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications; 2007. pp. 1-7. Available from: http://www. univ-tebessa.dz/fichiers/mosta/ 544f77fe0cf29473161c8f87.pdf

[35] Wiguna W, Riana D. Diagnosis of Coronavirus Disease 2019 (COVID-19) surveillance using C4.5 Algorithm. Jurnal Pilar Nusa Mandiri. 2020;**16**(1): 71-80. DOI: 10.33480/pilar.v16i1.1293

[36] Wiguna W. Decision tree of Coronavirus Disease (COVID-19) surveillance. IEEE Dataport. 2020. DOI: 10.21227/remc-6d63

**43**

**Chapter 3**

**Abstract**

Data Sets

*and Oya Kalipsiz*

preprocessing techniques

**1. Introduction**

and up-sell strategies.

Association Rule Mining on Big

*Oguz Celik, Muruvvet Hasanbasoglu, Mehmet S. Aktas* 

An accurate, complete, and rapid establishment of customer needs and existence

of product recommendations are crucial points in terms of increasing customer satisfaction level in various different sectors such as the banking sector. Due to the significant increase in the number of transactions and customers, analyzing costs regarding time and consumption of memory becomes higher. In order to increase the performance of the product recommendation, we discuss an approach, a sample data creation process, to association rule mining. Thus instead of processing whole population, processing on a sample that represents the population is used to decrease time of analysis and consumption of memory. In this regard, sample composing methods, sample size determination techniques, the tests which measure the similarity between sample and population, and association rules (ARs) derived from the sample were examined. The mutual buying behavior of the customers was found using a well-known association rule mining algorithm. Techniques were compared according to the criteria of complete rule derivation and time consumption.

**Keywords:** big data, sampling, association rule mining, data mining, data

Thanks to improved storage capacities, databases in various fields such as banking have grown up to a rich level. Most of the strategic sales and marketing decisions are taken by processing these data. For example, strategies such as cross-sell, up-sell, or risk management are being created as a result of processing the customer data. Because of the increasing number of customers and the need for a higher processing capacity, it has made it more difficult to identify the customer requirements in a rapid and accurate way and to present solution recommendations. Innovative data mining applications and techniques are required to solve this issue [1].

The market basket analysis is one of the data mining methods applied to identify the pattern which is found in product ownership data of customers. Thanks to this analysis, a pattern among the products frequently bought together by the customers can be established. The obtained pattern plays an active role in developing cross-sell

Market basket analysis consists of two main processes. These are clustering and association processes, respectively. The clustering process involves grouping of similar customers in terms of clusters. Thus, those customers which should be

#### **Chapter 3**
