*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory DOI: http://dx.doi.org/10.5772/intechopen.96180*

derivatives of tanh (or sigmoid) activation function would be 0 at the end. Zero gradients drive other gradients in previous layers towards 0. Thus, with small values in the Jacobian matrix and multiple matrix multiplications (t-j, in particular) the gradient values will be shrunk exponentially fast, eventually vanishing completely after a few time steps. As a result, the RNN ends up not learning longrange dependencies. As in RNNs, the vanishing gradients problem will be an important issue for the deep feedforward MLP when multiple hidden layers (multiple neurons within each) are placed between input and output layers.

The long short-term memory networks (LSTMs) are a special type of RNN that can overcome the vanishing gradient problem and can learn long-term dependencies. LSTM introduces a memory unit and a gate mechanism to enable capture of the long dependencies in a sequence. The term "long short-term memory" originates from the following intuition. Simple RNN networks have long-term memory in the form of weights. The weights change gradually during the training of the network, encoding general knowledge about the training data. They also have shortterm memory in the form of ephemeral activations, which flows from each node to successive nodes [17, 18].

### **4.1 The architecture of LSTM**

*W*ð Þ*<sup>y</sup>*

derivative of a hidden state that stores memory at time *t.*

<sup>¼</sup> *<sup>∂</sup>ht <sup>∂</sup>ht*�<sup>1</sup>

*<sup>∂</sup>hj*�<sup>1</sup> � �

> *<sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ht*�<sup>2</sup>

while the Jacobian matrix for hidden state is given by:

, *<sup>∂</sup>hj <sup>∂</sup>hj*�1,2

� �

X *j*

The Jacobians of any time *<sup>∂</sup>hj*

*Deep Learning Applications*

*∂hj <sup>∂</sup>hj*�<sup>1</sup>

*∂hj <sup>∂</sup>hj*�<sup>1</sup>

<sup>¼</sup> *<sup>∂</sup>hj <sup>∂</sup>hj*�1,1

> *∂E <sup>∂</sup><sup>W</sup>* <sup>¼</sup> <sup>X</sup> *t*

decomposition given by *<sup>W</sup>*ð Þ*<sup>i</sup> Tdiag f*<sup>0</sup> *hj*�**<sup>1</sup>**

**4. Long short-term memory**

*f a*ð Þ¼ *<sup>h</sup>*ð Þ*<sup>t</sup> ht* <sup>¼</sup> *tanh <sup>W</sup>*ð Þ*<sup>y</sup> ht* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* � �

**64**

network all the way to *<sup>t</sup>* = 0. The Jacobians in (10), *<sup>∂</sup>h<sup>j</sup>*

*<sup>t</sup>*þ<sup>1</sup> <sup>¼</sup> *<sup>W</sup>*ð Þ*<sup>y</sup>*

depends on the quantity of the previous state (*ht*-1) and the other parameters. Therefore, the differentiation of ℎ*<sup>t</sup>* and ℎ*<sup>j</sup>* (here *j =* 0, 1, … ., *t*-1) given in (7) is a

> *<sup>∂</sup>ht*�<sup>2</sup> *<sup>∂</sup>ht*�<sup>3</sup>

… *<sup>∂</sup>hj <sup>∂</sup>hj*�1,*<sup>s</sup>*

Putting the Eqs. (7) and (8) together, we have the following relationship:

*∂ yt ∂ht*

In other words, because the network parameters are used in every step up to the output, we need to backpropagate gradients from last time step (*t=t*) through the

As mentioned before, the output from RNNs is dependent on its previous state or previous N time steps circumstances. Conventional RNN face difficulty in learning and maintaining long-range dependencies. Imagine the unfolding RNN given in **Figure 3**. Each time step requires a new copy of the network. With large RNNs, thousands, even millions of weights are needed to be updated. In other word, *<sup>∂</sup>hj*

Imagine an unrolling the RNN a thousand times, in which every activation of the neurons inside the network are replicated thousands of times. This means, especially for larger networks, that thousands or millions of weights are needed. As Jacobian matrix will play a role to update the weights, the values of the Jacobian matrix will range between �1, 1 if tanh activation function is applied to the

*∂Et ∂ yt*

tors are generated. Here the *W*ð Þ*<sup>i</sup> <sup>T</sup>* is the transpose of the network parameters matrix. Consequently, if the largest eigenvalue is greater or smaller than 1, the RNN

suffers from vanishing or exploding gradient problems (see **Figure 3**).

a chain rule itself. For **Figure 3**, for example, the derivative of *<sup>∂</sup>h*<sup>4</sup>

*<sup>t</sup>* þ *a*

Note that, as given in (2), the current state (*ht*) = *tanh W*ð Þ *<sup>p</sup> pt* <sup>þ</sup> *ht*�1*W*ð Þ *<sup>h</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ *<sup>h</sup>* � �

… … *:*

¼

Y*t j*¼*j*þ1

*∂E ∂W*ð Þ*<sup>y</sup>*

and for the entire time will be:

*∂hj*,1 *<sup>∂</sup>hj*�1,1

*∂hj*,*<sup>s</sup> <sup>∂</sup>hj*�1,1

*∂hj <sup>∂</sup>hj*�<sup>1</sup> ! *<sup>∂</sup>hj*

> *<sup>∂</sup>hj*�<sup>1</sup> � �

� � � � **,** where the eigenvalues and eigenvec-

<sup>¼</sup> <sup>Y</sup>*<sup>t</sup> j*¼*j*þ1

*∂hj <sup>∂</sup>hj*�<sup>1</sup>

<sup>⋯</sup> *<sup>∂</sup>hj*,1 *<sup>∂</sup>hj*�1,*<sup>s</sup>*

*<sup>∂</sup><sup>W</sup> :* (10)

, demonstrates the eigen

*<sup>∂</sup>h*<sup>3</sup> <sup>¼</sup> *<sup>∂</sup>h*<sup>4</sup> *∂h*<sup>3</sup> *∂h*<sup>3</sup> *∂h*<sup>2</sup> *∂h*<sup>2</sup> *∂h*<sup>1</sup> *∂h*<sup>1</sup> *<sup>∂</sup>h*<sup>0</sup> .

given in (2). It can be easily imagined that the

<sup>⋯</sup> *<sup>∂</sup>hj*,*<sup>s</sup> <sup>∂</sup>hj*�1,*<sup>s</sup>*

⋮⋱⋮

(8)

(9)

*<sup>∂</sup>hj*�<sup>1</sup> is

*<sup>∂</sup>hj*þ<sup>1</sup> *∂hj*

> The neural network architecture for an LSTM block given in **Figure 4** demonstrates that the LSTM network extends RNN's memory and can selectively remember or forget information by structures called cell states and three gates. Thus, in addition to a hidden state in RNN, an LSTM block typically has four more layers. These layers are called the cell state (*Ct*), an input gate (*it*), an output gate (*Ot*), and a forget gate (*ft*). Each layer interacts with each other in a very special way to generate information from the training data.

#### **Figure 4.**

*Illustration of long short-term memory block structure. The operator "*⨀*" denotes the element-wise multiplication. The* Ct-1, Ct, ht *and* ht *are previous cell state, current cell state, current hidden state and previous hidden state, respectively. The* ft; it; ot *are the values of the forget, input and output gates, respectively. The C*~*<sup>t</sup> is the candidate value for the cell state,* W(f), W(i), W(c), W(o) *are weight matrices consist of forget gate, input gate, cell state and output gate weights, and* b(f), b(i), b(c), *and* b(o) *are bias vectors associated with them.*

A block diagram of LSTM at any timestamp is depicted in **Figure 4**. This block is a recurrently connected subnet that contains the same cell state and three gates structure. The *pt*, *ht*�1, and *Ct*�<sup>1</sup> correspond to the input of the current time step, the hidden output from the previous LSTM unit, and the cell state (memory) of the previous unit, respectively. The information from the previous LSTM unit is combined with current input to generate a newly predicted value. The LSTM block is mainly divided into three gates: forget (blue), input-update (green), and output (red). Each of these gates is connected to the cell state to provide the necessary information that flows from the current time step to the next.

sigmoid activation function, *W*(*f*) and *b*(*f*) are the weight matrix and bias vector,

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory*

<sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>f</sup>* <sup>¼</sup> <sup>1</sup>

The function takes the old output (*ht*�*1*) at time *t* � 1 and the current input (*pt*) at time *t* for calculating the components that control the cell state and hidden state of the layer. The results are [0,1], where 1 represents "completely hold this" and 0

The *Input Gate* (*it*) controls what new information will be added to the cell state from the current input. This gate also plays the role to protect the memory contents from perturbation by irrelevant input (**Figures 7** and **8**). A sigmoid activation function is used to generate the input values and converts information between 0

where *W*(i) and *b*(i) are the weight matrix and bias vector, *p<sup>t</sup>* is the current input

Next, a vector of new candidate values, *C*~*t*, is created. The computation of the new

In the next step, the values of the input state and cell candidate are combined to create and update the cell state as given in (14). The linear combination of the input gate and forget gate are used for updating the previous cell state (*Ct*-1) into current

1 þ *e*

*<sup>e</sup> <sup>W</sup>*ð Þ*<sup>c</sup> : pt* ð Þ ,*ht*�<sup>1</sup> <sup>þ</sup>*b*ð Þ*<sup>c</sup>* ð Þ <sup>þ</sup> *<sup>e</sup>*

1 þ *e*

� *<sup>W</sup>*ð Þ*<sup>f</sup> :* <sup>p</sup>*<sup>t</sup>* ð Þ ,h*t*�<sup>1</sup> <sup>þ</sup>*b*ð Þ*<sup>f</sup>* ð Þ (11)

� *<sup>W</sup>*ð Þ*<sup>i</sup> : pt* ð Þ ,*ht*�<sup>1</sup> <sup>þ</sup>*b*ð Þ*<sup>i</sup>* ð Þ (12)

� *<sup>W</sup>*ð Þ*<sup>c</sup> : pt* ð Þ ,*ht*�<sup>1</sup> <sup>þ</sup>*b*ð Þ*<sup>c</sup>* ð Þ

� *<sup>W</sup>*ð Þ*<sup>c</sup> : pt* ð Þ ,*ht*�<sup>1</sup> <sup>þ</sup>*b*ð Þ*<sup>c</sup>* ð Þ (13)

which will be learned from the input training data.

represents "completely throw this away" (**Figure 6**).

, h*t*�<sup>1</sup>

, *ht*�<sup>1</sup>

<sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>i</sup>* <sup>¼</sup> <sup>1</sup>

timestep index with the previous time step *ht***-1**. Similar to the forget gate, the parameters in the input gate will be learned from the input training data. At each time step, with the new information *pt*, we can compute a candidate cell state.

candidate is similar to that of (11) and (12) but uses a hyperbolic tanh activation function with a value range of (�1,1). This leads to the following Eq. (13) at time *t.*

<sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>c</sup>* <sup>¼</sup> *<sup>e</sup> <sup>W</sup>*ð Þ*<sup>c</sup> : pt* ð Þ ,*ht*�<sup>1</sup> <sup>þ</sup>*b*ð Þ*<sup>c</sup>* ð Þ � *<sup>e</sup>*

*ft* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>f</sup>* <sup>p</sup>*<sup>t</sup>*

*DOI: http://dx.doi.org/10.5772/intechopen.96180*

and 1. So, mathematically the input gate is:

*it* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>i</sup> pt*

, *ht*�<sup>1</sup>

*The forget gate controls what information to throw away from the memory.*

*4.1.3 Input gate*

*<sup>C</sup>*~*<sup>t</sup>* <sup>¼</sup> *tanh W*ð Þ*<sup>c</sup> pt*

**Figure 6.**

**67**

A sigmoid activation function <sup>1</sup> 1þ*e*�ð Þ *<sup>x</sup>* is implemented in the forget gate. For the input and output gates, however, a combination of sigmoid and hyperbolic tangenttanh- *<sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup> e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup>* are used to provide the necessary information to the cell state. The information generated by the blocks flow through the cell state from one block to another as the chain of repeating components of the LSTM neural network holds. Details about cell state and each layer are given in different subtitles.
