*4.1.1 Cell state*

As shown in the upper part of **Figures 4** and **5**, the cell state is the key to LSTMs and represents the memory of LSTM networks. The process for the cell state is very much like to a conveyor belt or production chain. The information about the parameters runs straight forward the entire chain, with only some linear interactions, such as multiplication and addition. The state of information depends on these interactions. If there are no interactions, the information will run along without changes. The LSTM block removes or adds information to the cell state through the gates, which allow optional information to cross [19].

### *4.1.2 Forget gate*

The *Forget Gate* (*ft*) decides the type of information that should be thrown away or kept from the cell state. This process is implemented by a sigmoid activation function. The sigmoid activation function outputs values between 0 and 1 coming from the weighted input (**W***<sup>f</sup> pt*), previous hidden state (*ht***-1**), and a bias (*bf*). The forget gates (**Figure 6**) can be described by the equation given in (11). Here, σ is the

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory DOI: http://dx.doi.org/10.5772/intechopen.96180*

sigmoid activation function, *W*(*f*) and *b*(*f*) are the weight matrix and bias vector, which will be learned from the input training data.

$$f\_t = \sigma\left(\mathcal{W}^{(f)}\left(\mathbf{p}\_t, \mathbf{h}\_{t-1}\right) + b^{(f)}\right) = \frac{1}{\mathbf{1} + e^{-\left(\mathcal{W}^{(f)}\left(\mathbf{p}\_t, \mathbf{h}\_{t-1}\right) + b^{(f)}\right)}}\tag{11}$$

The function takes the old output (*ht*�*1*) at time *t* � 1 and the current input (*pt*) at time *t* for calculating the components that control the cell state and hidden state of the layer. The results are [0,1], where 1 represents "completely hold this" and 0 represents "completely throw this away" (**Figure 6**).

#### *4.1.3 Input gate*

A block diagram of LSTM at any timestamp is depicted in **Figure 4**. This block is

is implemented in the forget gate. For the

a recurrently connected subnet that contains the same cell state and three gates structure. The *pt*, *ht*�1, and *Ct*�<sup>1</sup> correspond to the input of the current time step, the hidden output from the previous LSTM unit, and the cell state (memory) of the previous unit, respectively. The information from the previous LSTM unit is combined with current input to generate a newly predicted value. The LSTM block is mainly divided into three gates: forget (blue), input-update (green), and output (red). Each of these gates is connected to the cell state to provide the necessary

> 1þ*e*�ð Þ *<sup>x</sup>*

Details about cell state and each layer are given in different subtitles.

through the gates, which allow optional information to cross [19].

*The cell state, the horizontal line running through the top of the diagram of an LSTM.*

input and output gates, however, a combination of sigmoid and hyperbolic tangent-

information generated by the blocks flow through the cell state from one block to another as the chain of repeating components of the LSTM neural network holds.

As shown in the upper part of **Figures 4** and **5**, the cell state is the key to LSTMs and represents the memory of LSTM networks. The process for the cell state is very much like to a conveyor belt or production chain. The information about the parameters runs straight forward the entire chain, with only some linear interactions, such as multiplication and addition. The state of information depends on these interactions. If there are no interactions, the information will run along without changes. The LSTM block removes or adds information to the cell state

The *Forget Gate* (*ft*) decides the type of information that should be thrown away or kept from the cell state. This process is implemented by a sigmoid activation function. The sigmoid activation function outputs values between 0 and 1 coming from the weighted input (**W***<sup>f</sup> pt*), previous hidden state (*ht***-1**), and a bias (*bf*). The forget gates (**Figure 6**) can be described by the equation given in (11). Here, σ is the

are used to provide the necessary information to the cell state. The

information that flows from the current time step to the next.

A sigmoid activation function <sup>1</sup>

tanh- *<sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup> e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup>* 

*Deep Learning Applications*

*4.1.1 Cell state*

*4.1.2 Forget gate*

**Figure 5.**

**66**

The *Input Gate* (*it*) controls what new information will be added to the cell state from the current input. This gate also plays the role to protect the memory contents from perturbation by irrelevant input (**Figures 7** and **8**). A sigmoid activation function is used to generate the input values and converts information between 0 and 1. So, mathematically the input gate is:

$$i\_t = \sigma\left(\mathcal{W}^{(i)}\left(p\_t, h\_{t-1}\right) + b^{(i)}\right) = \frac{1}{\mathbf{1} + e^{-\left(\mathcal{W}^{(i)}\cdot\left(p\_t, h\_{t-1}\right) + b^{(i)}\right)}}\tag{12}$$

where *W*(i) and *b*(i) are the weight matrix and bias vector, *p<sup>t</sup>* is the current input timestep index with the previous time step *ht***-1**. Similar to the forget gate, the parameters in the input gate will be learned from the input training data. At each time step, with the new information *pt*, we can compute a candidate cell state.

Next, a vector of new candidate values, *C*~*t*, is created. The computation of the new candidate is similar to that of (11) and (12) but uses a hyperbolic tanh activation function with a value range of (�1,1). This leads to the following Eq. (13) at time *t.*

$$\tilde{\mathbf{C}}\_{t} = \tanh\left(\mathcal{W}^{(c)}\left(p\_{t}, h\_{t-1}\right) + b^{(c)}\right) = \frac{e^{\left(\mathcal{W}^{(c)}.\left(p\_{t}, h\_{t-1}\right) + b^{(c)}\right)} - e^{-\left(\mathcal{W}^{(c)}.\left(p\_{t}, h\_{t-1}\right) + b^{(c)}\right)}}{e^{\left(\mathcal{W}^{(c)}.\left(p\_{t}, h\_{t-1}\right) + b^{(c)}\right)} + e^{-\left(\mathcal{W}^{(c)}.\left(p\_{t}, h\_{t-1}\right) + b^{(c)}\right)}} \tag{13}$$

In the next step, the values of the input state and cell candidate are combined to create and update the cell state as given in (14). The linear combination of the input gate and forget gate are used for updating the previous cell state (*Ct*-1) into current

**Figure 6.** *The forget gate controls what information to throw away from the memory.*

#### **Figure 7.**

*The input-update gate decides what new information should be stored in the cell state, which has two parts: A sigmoid layer and a hyperbolic tangent (tanh) layer. The sigmoid layer is called the "input gate layer" because it decides which values should be updated. The tanh layer is a vector of new candidate values C*~*<sup>t</sup> that could be added to the cell state.*

the prediction (^*yt* ¼ softmaxð ÞÞ *:* . Here, softmax is a nonlinear activation function

*The output state decides what information will be output using a sigmoid* σ *and tanh (to push the values to be*

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory*

<sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>o</sup>* <sup>¼</sup> <sup>1</sup>

*ht* <sup>¼</sup> *ot* <sup>⨀</sup> *tanh C*ð Þ¼ *<sup>t</sup> ot* <sup>∙</sup> *<sup>e</sup>*ð*ft* <sup>⨀</sup>*Ct*�1þ*it* <sup>⨀</sup>*C*~*t*<sup>Þ</sup> � *<sup>e</sup>*�ð*ft* <sup>⨀</sup>*Ct*�1þ*it* <sup>⨀</sup>*C*~*t*<sup>Þ</sup>

^*yt* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>y</sup> ht* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* <sup>¼</sup> <sup>1</sup>

First, the previous hidden state (*ht*-1**)** is passed to the current input into a sigmoid function. Next newly updated cell state is generated with the tanh function [15, 18]. Finally, the tanh output is multiplied with the sigmoid output to determine what information the hidden state should carry (16). The final product of the output gate is an updated of the hidden state, and this is used for the prediction at time step *t*. Therefore, the aim of this gate is to separate the updated cell state (updated memory) from the hidden state. The updated cell state (*C*t) contains a lot of information that is not necessarily required to be saved in the updated hidden state. However, this information is critical as the updated hidden state at each time is used in all gates of an LSTM block. Thus, the output gate does the assessment regarding what parts of the cell state (**C**t) is presented in the hidden state (*h*t). The new cell and new hidden states are then passed to the next time step (**Figure 9**).

1.*Forget gate*: Controls what information to throw away and decides how much from the part should be

2.*Input-Update Gate:* Controls information to add cell state from current input and decides how much

, **h***<sup>t</sup>*�<sup>1</sup>

<sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>i</sup>* , *<sup>C</sup>*~*<sup>t</sup>* <sup>¼</sup> tanh *<sup>W</sup>*ð Þ*<sup>c</sup>* **<sup>p</sup>***<sup>t</sup>*

1 þ *e*

1 þ *e*

*<sup>e</sup>*ð*ft* <sup>⨀</sup>*Ct*�1þ*it* <sup>⨀</sup>*C*~*t*<sup>Þ</sup> <sup>þ</sup> *<sup>e</sup>*�ð*ft* <sup>⨀</sup>*Ct*�1þ*it* <sup>⨀</sup>*C*~*t*<sup>Þ</sup>

� *<sup>W</sup>*ð Þ*<sup>o</sup> ht*�1,*<sup>p</sup>* ð Þ*<sup>t</sup>* <sup>þ</sup>*b*ð Þ*<sup>o</sup>* ð Þ (15)

� *<sup>W</sup>*ð Þ*<sup>y</sup> ht*þ*b*ð Þ*<sup>y</sup>* ð Þ (17)

, **h***<sup>t</sup>*�<sup>1</sup> <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>c</sup>*

(16)

(sigmoid, hyperbolic tangent etc.).

*DOI: http://dx.doi.org/10.5772/intechopen.96180*

*Summary of forward pass*

**69**

remember. *ft* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>f</sup>* **<sup>p</sup>***<sup>t</sup>*

, **h***<sup>t</sup>*�<sup>1</sup> <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>f</sup>*

should be added to the cell state *<sup>ı</sup><sup>t</sup>* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>i</sup>* **<sup>p</sup>***<sup>t</sup>*

**Figure 9.**

*between* �*1 and 1) layers.*

*ot* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>o</sup> ht*�1, *pt*

#### **Figure 8.**

*Memory update is done using old memory via the forget gate and new memory via the input gate.*

cell state (*Ct*). Once again, the input gate (*it*) governs how much new data should be taken into account via the candidate (*C*~*t*Þ, while the forget gate (*ft*) reports how much of the old memory cell content (*Ct*-1) should be retained. Using the same pointwise multiplication (⨀ ¼Hadamard product), we arrive at the following updated equation:

$$\mathbf{C}\_{t} = f\_{t} \mathbf{C} \mathbf{C}\_{t-1} + i\_{t} \mathbf{C} \tilde{\mathbf{C}}\_{t} \tag{14}$$

#### *4.1.4 Output gate*

The *Output Gate* (*ot*) controls which information to reveal from the updated cell state (*Ct*) to the output in a single time step. In other words, the output gate determines what the value of the next hidden state should be in each time step. As depicted in **Figure 9**, the hidden state comprises information on previous inputs. Moreover, the calculated value of the hidden state for the given time step is used for *Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory DOI: http://dx.doi.org/10.5772/intechopen.96180*

#### **Figure 9.**

*The output state decides what information will be output using a sigmoid* σ *and tanh (to push the values to be between* �*1 and 1) layers.*

the prediction (^*yt* ¼ softmaxð ÞÞ *:* . Here, softmax is a nonlinear activation function (sigmoid, hyperbolic tangent etc.).

$$\sigma\_t = \sigma\left(\mathcal{W}^{(o)}\left(h\_{t-1}, p\_t\right) + b^{(o)}\right) = \frac{1}{1 + e^{-\left(\mathcal{W}^{(o)}\left(h\_{t-1}p\_t\right) + b^{(o)}\right)}}\tag{15}$$

$$h\_t = o\_t \odot \tanh\left(\mathbf{C}\_t\right) = o\_t \bullet \frac{e^{(f\_t \odot \mathbf{C}\_{t-1} + i\_t \odot \hat{\mathbf{C}}\_t)} - e^{-(f\_t \odot \mathbf{C}\_{t-1} + i\_t \odot \hat{\mathbf{C}}\_t)}}{e^{(f\_t \odot \mathbf{C}\_{t-1} + i\_t \odot \hat{\mathbf{C}}\_t)} + e^{-(f\_t \odot \mathbf{C}\_{t-1} + i\_t \odot \hat{\mathbf{C}}\_t)}} \tag{16}$$

$$\hat{\boldsymbol{y}}\_t = \sigma\left(\boldsymbol{W}^{(\boldsymbol{y})}\boldsymbol{h}\_t + \boldsymbol{b}^{(\boldsymbol{y})}\right) = \frac{\mathbf{1}}{\mathbf{1} + \boldsymbol{e}^{-\left(\boldsymbol{W}^{(\boldsymbol{y})}\boldsymbol{h}\_t + \boldsymbol{b}^{(\boldsymbol{y})}\right)}}\tag{17}$$

First, the previous hidden state (*ht*-1**)** is passed to the current input into a sigmoid function. Next newly updated cell state is generated with the tanh function [15, 18]. Finally, the tanh output is multiplied with the sigmoid output to determine what information the hidden state should carry (16). The final product of the output gate is an updated of the hidden state, and this is used for the prediction at time step *t*. Therefore, the aim of this gate is to separate the updated cell state (updated memory) from the hidden state. The updated cell state (*C*t) contains a lot of information that is not necessarily required to be saved in the updated hidden state. However, this information is critical as the updated hidden state at each time is used in all gates of an LSTM block. Thus, the output gate does the assessment regarding what parts of the cell state (**C**t) is presented in the hidden state (*h*t). The new cell and new hidden states are then passed to the next time step (**Figure 9**).

*Summary of forward pass*


cell state (*Ct*). Once again, the input gate (*it*) governs how much new data should be taken into account via the candidate (*C*~*t*Þ, while the forget gate (*ft*) reports how much of the old memory cell content (*Ct*-1) should be retained. Using the same pointwise multiplication (⨀ ¼Hadamard product), we arrive at the following

*Memory update is done using old memory via the forget gate and new memory via the input gate.*

*The input-update gate decides what new information should be stored in the cell state, which has two parts: A sigmoid layer and a hyperbolic tangent (tanh) layer. The sigmoid layer is called the "input gate layer" because it decides which values should be updated. The tanh layer is a vector of new candidate values C*~*<sup>t</sup> that could be*

The *Output Gate* (*ot*) controls which information to reveal from the updated cell

state (*Ct*) to the output in a single time step. In other words, the output gate determines what the value of the next hidden state should be in each time step. As depicted in **Figure 9**, the hidden state comprises information on previous inputs. Moreover, the calculated value of the hidden state for the given time step is used for

*Ct* <sup>¼</sup> *ft* <sup>⨀</sup>*Ct*�<sup>1</sup> <sup>þ</sup> *it* <sup>⨀</sup>*C*~*<sup>t</sup>* (14)

updated equation:

**Figure 8.**

**Figure 7.**

*added to the cell state.*

*Deep Learning Applications*

*4.1.4 Output gate*

**68**

3.*Output gate:* Determines the part of the current cell state makes it *to the output ot* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>o</sup> ht*�**<sup>1</sup>**, *pt* � � <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>o</sup>* � �*.* 4.*Current cell state: Ct* <sup>¼</sup> *ft* <sup>⨀</sup>*Ct*�**<sup>1</sup>** <sup>þ</sup> *it* <sup>⨀</sup>*C*~*<sup>t</sup>* 5.*Current hidden state: ht* ¼ *ot* ⨀ *tanh C*ð Þ)*<sup>t</sup> ht* ¼ *LSTM pt*, *ht*�**<sup>1</sup>** �� � 6.*LSTM block prediction:* ^*yt* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>y</sup> ht* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* � � 7.*Calculate the LSTM block error for the time step: Et yt*, ^*y<sup>t</sup>* � � ¼ �*y<sup>t</sup> log* ^*yt* �
