**2. Artificial neural networks and multilayer neural network**

Artificial Neural Networks (ANNs) are powerful computing techniques that mimic functions of the human brain to solve complex problems arising from big and messy data. As a machine learning method, ANNs can act as universal approximators of complex functions capable of capturing nonlinear relationships between inputs and outcomes. They adaptively learn functional structures by simultaneously utilizing a series of nonlinear and linear activation functions. ANNs offer several advantages. They require less formal statistical training, have the ability to detect all possible interactions between input variables, and include training algorithms adapted from backpropagation algorithms to improve the predictive ability of the model [2].

Feed forward multilayer perceptron (**Figure 1**) is the most used in ANN architectures. Uses include function approximation, classification, and pattern recognition. Similar to information processing within the human brain, connections are formed from successive layers. They are fully connected because neurons in each layer are connected to the neurons from the previous and the subsequent layer through adaptable synaptic network parameters. The first layer of multi-layer ANN is called the input layer (or left-most layer) that accepts the training data from sources external to the network. The last layer (or rightmost layer) is called the output layer that contains output units of the network. Depending on prediction or classification, the number of neurons in the output layer may consist of one or more neurons. The layer(s) between input and output layers are called hidden layer(s). Depending on the architecture multiple hidden layers can be placed between input and output layers.

Training occurs at the neuronal level in the hidden and output layers by updating synaptic strengths, eliminating some, and building new synapses. The central idea is to distribute the error function across the hidden layers, corresponding to their effect on the output. **Figure 1** demonstrates the architecture

#### **Figure 1.**

As the name suggests, information is fed in a forward direction from the input to the output layer through one or more hidden layers. MLP feed forward with one hidden layer can virtually predict any linear or non-linear model to any degree of accuracy, assuming that you have a appropriate number of neurons in hidden layer and an appropriate amount of data. Adding more neurons in the hidden layers to an ANN architecture gives the model the flexibility of fitting extremely complex nonlinear functions. This also holds true in classification modeling in approximating

Recurrent neural networks (RNNs) emerged as an operative and scalable ANN model for several learning problems associated with sequential data. Information in these networks incorporate loops into the hidden layer. These loops allow information to flow multi-directionally so that the hidden state signifies past information held at a given time step. Thus, these network types have an infinite dynamic response to sequential data. Many applications, such as Apple's Siri and Google's

The most popular way to train a neural network is by backpropagation (BP). This method can be used with either feedforward or recurrent networks. BP involves working backward through each timestep to calculate prediction errors and estimate a gradient, which in turn is used to update the weights in the network. For example, to enable the long sequences found in RNNs, multiple timesteps are conducted that unrolls the network, adds new layers, and recalculates the predic-

However, standard and deep RNN neural networks may suffer from vanishing or exploding gradients problems. As more layers in RNN containing activation functions are added, the gradient of the loss function approaches zero. As more layers of activating functions are added, the gradient loss function may approach zero (vanish), leaving the functions unchanged. This stops further training and ends the procedure prematurely. As such, parameters capture only short-term dependencies, while information from earlier time steps may be forgotten. Thus, the model converges on a poor solution. Because error gradients can be unstable, the reverse issue, exploding gradients may occur. This causes errors to grow drastically within each time step (MATLAB, 2020b). Therefore, backpropagation may be

Long Short-Term Memory (LSTM) networks address the issue of vanishing/ exploding gradients and was first introduced by [3]. In addition to the hidden state in RNN, an LSTM block includes memory cells (that store previous information) and introduces a series of gates, called input, output, and forget gates. These gates allow for additional adjustments (to account for nonlinearity) and prevents errors from vanishing or exploding. The result is a more accurate predicted outcome, the

Practical applications of LSTM have been published in medical journals. For example, studies [5–14] used different variants of RNN for classification and prediction purposes from medical records. Important, LSTM does not have any assumption about elapsed time measures can be utilized to subtype patients or diseases. In one such study, LSTM was used to make risk predictions of disease progression for patients with Parkinson's by leveraging longitudinal medical records

The purpose of this chapter is to introduce Long Short-Term Memory (LSTM) networks. This chapter begins with introduction of multilayer feedforward architectures. The core characteristic of an RNN and vanishing of the gradient will be explained briefly. Next, the LSTM neural network, optimization of network parameters, and the methodology for avoiding the vanishing gradient problem will be covered, respectively. The chapter ends with a MATLAB example for LSTM.

solution does not stop prematurely, nor is previous information lost [4].

any nonlinear decision boundary with great accuracy [2].

voice search, use RNN.

*Deep Learning Applications*

tion error, resulting in a very deep network.

limited by the number of timesteps.

with irregular time intervals [15].

**58**

*(adapted from Okut, 2016). Artificial neural network design with 4 inputs (*pi*). Each input is connected to up to 3 neurons via coefficients w*ð Þ*<sup>l</sup> kj (*l *denotes layer;* j *denotes neuron;* k *denotes input variable). Each hidden and output neuron has a bias parameter b<sup>l</sup> <sup>j</sup> . Here P = inputs, IW = weights from input to hidden layer (12 weights), LW = weights from hidden to output layer (3 weights),* b*<sup>1</sup> = Hidden layer biases (3 biases),* b*<sup>2</sup> = Output layer biases (1 bias),* n*<sup>1</sup> = IWP +* b*<sup>1</sup> is the weighted summation of the first layer,* a*<sup>1</sup> =* f*(*n*<sup>1</sup> ) is output of hidden layer,* n*<sup>2</sup> = LW*a*<sup>1</sup> + b 2 is weighted summation of the second layer and* ^*t =* a*<sup>2</sup> =* f*(*n*<sup>2</sup> ) is the predicted value of the network. The total number of parameters for this ANN is 12 + 3 + 3 + 1 = 19.*

of a simple feedforward MLP where *P* identifies the input layer, next one or more hidden layers, which is followed by the output layer containing the fitted values. The feed forward MLP networks is evaluated in two stages. First, in the feedforward stage information comes from the left and each unit evaluates its activation function *f.* The results (output) are transmitted to the units connected to the right. The second stage involves the backpropagation (BP) step and is used for training the neural network using gradient descent algorithm in which the network parameters are moved along the negative of the gradient of the performance function The process consists of running the whole network backward and adjusting the weights (and error) in the hidden layer. The feedforward and backward steps are repeated several times, called epochs. The *algorithm stops when the value of the loss (error) function has become sufficiently small*.

vectors, each element in the vector is placed in a different orthogonal

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory*

*An unrolled RNN with a hidden state carries pertinent information from one input item in the series to others. The blue and red arrows in the figure are indicating the forward and the backward pass of the network, respectively. With backward pass, we sum up the contributions of each time step to the gradient. In other words, because W is used in every step up to the output, we need to backpropagate gradients from* t = *4 through the*

*ah*ðÞ¼ *<sup>t</sup> <sup>W</sup>*ð Þ *<sup>p</sup> pt* <sup>þ</sup> *ht*�1*W*ð Þ *<sup>h</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ *<sup>h</sup>* (1)

*e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup>*

1þ*e*�ð Þ *<sup>x</sup>*

*the hidden state RNNs have a "memory" that information has been calculated so far is captured. The information*

*RNNs, the weight (***W***) calculation of each time point of the network model is related to the content of the previous time point. We can process a sequence of vectors of inputs (***p***) by applying a recurrence formula at every time step.*

*to generate the hidden state. Because of*

*to generate the predicted (output) values. In*

dimension

*DOI: http://dx.doi.org/10.5772/intechopen.96180*

**Figure 3.**

**61**

**Figure 2.**

*A typical RNN that has a hyperbolic tangent activation function <sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup>*

*in hidden state passed further to a second activation function* <sup>1</sup>

*network all the way to* t *= 0.*
