**4.2 Backward pass**

Like the RNN networks, an LSTM network generates an output ^*yt* � � at each time step that is used to train the network via gradient descent (**Figure 10**). During the backward pass, the network parameters are updated at each epoch (iteration). The only fundamental difference between the back-propagation algorithms of the RNN and LSTM networks is a minor modification of the algorithm. Here, the calculated error term at each time step is *Et* ¼ �*y<sup>t</sup> log* ^*yt* **.** As in RNN, the total error is calculated by the summation of error from all time steps *<sup>E</sup>* <sup>¼</sup> <sup>P</sup> *<sup>t</sup>* � *yt log* ^*yt:*

Similarly, the value of gradients *<sup>∂</sup><sup>E</sup> <sup>∂</sup><sup>W</sup>* at each time step is calculated and then the summation of the gradients at each time steps *<sup>∂</sup><sup>E</sup> <sup>∂</sup><sup>W</sup>* <sup>¼</sup> <sup>P</sup> *t ∂Et <sup>∂</sup><sup>W</sup>* is obtained. Remember, the predicted value, ^*yt* , is a function of the hidden state (^*yt* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>y</sup> ht* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* � �<sup>Þ</sup> and the hidden state (*ht)* is a function of the cell state (*ht* ¼ *ot* ⨀ *tanh* ð ÞÞ *Ct :*These both are subjected in the chain rule. Hence, the derivatives of individual error terms with respect the network parameter:

$$\frac{\partial E\_t}{\partial W} = \frac{\partial E\_t}{\partial \hat{\jmath}\_t} \frac{\partial \hat{\jmath}\_t}{\partial h\_t} \frac{\partial h\_t}{\partial c\_t} \frac{\partial c\_t}{\partial c\_{t-1}} \frac{\partial c\_{t-1}}{\partial c\_{t-2}} \frac{\partial c\_{t-2}}{\partial c\_{t-3}} \dots \dots \dots \dots \frac{\partial c\_0}{\partial W} \tag{18}$$

derivatives starting from the part *<sup>∂</sup>ct*

*generates between 0 and 1) and hyperbolic tangent ( <sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup>*

<sup>¼</sup> *<sup>∂</sup> ft* <sup>⨀</sup>*Ct*�<sup>1</sup> <sup>þ</sup> *it* <sup>⨀</sup>*C*~*<sup>t</sup> <sup>∂</sup> ft*�<sup>1</sup> <sup>⨀</sup>*Ct*�<sup>2</sup> <sup>þ</sup> *it*�<sup>1</sup> <sup>⨀</sup>*C*~*<sup>t</sup>*�<sup>1</sup> 

*functions. (B) an RNN with 3-time steps. It has only a tangent, <sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup>*

*<sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ct*�<sup>1</sup> þ *∂ct ∂i*

*Ct* <sup>¼</sup> *ft* <sup>⨀</sup>*Ct*�<sup>1</sup> <sup>þ</sup> *it* <sup>⨀</sup>*C*~*t*.

¼ *∂ct ∂ f*

*∂ f <sup>∂</sup>ht*�<sup>1</sup>

*∂ct <sup>∂</sup>ct*�<sup>1</sup>

**Figure 10.**

Note the term *<sup>∂</sup>ct*

the LSTM, while the *<sup>∂</sup>ct*

**4.3 Other type of LSTMs**

*4.3.1 Peephole connections*

**71**

*<sup>∂</sup>ct*�**<sup>1</sup>**

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory*

*DOI: http://dx.doi.org/10.5772/intechopen.96180*

*Illustration of the (A) an LSTM unit from 3-time steps with input data (demographic and clinical data). LSTM network takes inputs from to the current time step to update the hidden state and (LSTM pt*, *ht*�**<sup>1</sup>**

*with relevant information. The "x" in the circles denote point-wise operators,* σ *and tanh are sigmoid (* <sup>1</sup>

*e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup>*

*∂i <sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ct*�<sup>1</sup> þ *∂ct ∂C*~*<sup>t</sup>*

less than 1 after certain time steps. Thus, for an LSTM, the term will not converge to 0 or diverge completely, even for an infinite number of time steps. If the gradient starts converging towards zero, the weights of the gates are adjusted to bring it closer to 1.

Several modifications to original LSTM architecture have been recommended over the years. Surprisingly, the original continues to outperform, and has similar

This is a type of LSTM by adding "*peephole connections*" to the standard LSTM network. The theme stands for peephole connections needed to capture information

predictive ability compared with variants of LSTM over 20 years.

. Expanding this value using the expression for

*, activation function in the block.*

, *generates between* �*1 and 1) activation*

*<sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ct*�<sup>1</sup> þ *∂ct <sup>∂</sup>ct*�<sup>1</sup>

(21)

*)*

1þ*e*�ð Þ *<sup>x</sup> ,*

*∂C*~*<sup>t</sup> <sup>∂</sup>ht*�<sup>1</sup>

*e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup>*

*<sup>∂</sup>ct*�**<sup>1</sup>** does not have a fixed pattern and can yield any positive value in

*<sup>∂</sup>ct*�**<sup>1</sup>** term in the standard RNN can yield values greater than 1 or

and for the overall error gradient using the chain rule of differentiation is:

$$\frac{\partial E}{\partial W} = \sum\_{t} \frac{\partial E\_t}{\partial \hat{\jmath}\_t} \frac{\partial \hat{\jmath}\_t}{\partial h\_t} \frac{\partial h\_t}{\partial c\_t} \frac{\partial c\_t}{\partial c\_{t-1}} \frac{\partial c\_{t-1}}{\partial c\_{t-2}} \frac{\partial c\_{t-2}}{\partial c\_{t-3}} \dots \dots \dots \dots \frac{\partial c\_0}{\partial W} \tag{19}$$

As Eq. (19) illustrates, the gradient involves the chain rule of *∂ct* in an LSTM training using the backpropagation algorithm, while the gradient equation involves a chain rule of *∂ht* for a basic RNN. Therefore, the Jacobian matrix for cell state for an LSTM is [20]:

$$
\begin{bmatrix}
\frac{\partial c\_{j}}{\partial c\_{j-1,1}}, \frac{\partial c\_{j}}{\partial c\_{j-1,2}} \dots \frac{\partial c\_{j}}{\partial c\_{j-1,\varsigma}}
\end{bmatrix} = \begin{bmatrix}
\frac{\partial c\_{j,1}}{\partial c\_{j-1,1}} & \cdots & \frac{\partial c\_{j,1}}{\partial c\_{j-1,\varsigma}} \\
\vdots & \ddots & \vdots \\
\frac{\partial c\_{j,\varsigma}}{\partial c\_{j-1,1}} & \cdots & \frac{\partial c\_{j,\varsigma}}{\partial c\_{j-1,\varsigma}}
\end{bmatrix} \tag{20}
$$

#### *The problem of gradient vanishing*

Recall the Eq. (14) for cell state is *Ct* <sup>¼</sup> *ft* <sup>⨀</sup>*Ct*�**<sup>1</sup>** <sup>þ</sup> *it* <sup>⨀</sup>*C*~*t*. When we consider Eq. (19), the value of the gradients in the LSTM is controlled by the chain of

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory DOI: http://dx.doi.org/10.5772/intechopen.96180*

#### **Figure 10.**

**4.2 Backward pass**

*ot* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>o</sup> ht*�**<sup>1</sup>**, *pt*

*Deep Learning Applications*

� � <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>o</sup>* � �

4.*Current cell state: Ct* <sup>¼</sup> *ft* <sup>⨀</sup>*Ct*�**<sup>1</sup>** <sup>þ</sup> *it* <sup>⨀</sup>*C*~*<sup>t</sup>*

6.*LSTM block prediction:* ^*yt* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>y</sup> ht* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* � �

7.*Calculate the LSTM block error for the time step: Et yt*, ^*y<sup>t</sup>*

the predicted value, ^*yt*

Like the RNN networks, an LSTM network generates an output ^*yt*

3.*Output gate:* Determines the part of the current cell state makes it *to the output*

*.*

5.*Current hidden state: ht* ¼ *ot* ⨀ *tanh C*ð Þ)*<sup>t</sup> ht* ¼ *LSTM pt*, *ht*�**<sup>1</sup>**

calculated by the summation of error from all time steps *<sup>E</sup>* <sup>¼</sup> <sup>P</sup>

*∂*^*yt ∂ht ∂ht ∂ct*

*∂Et ∂*^*yt* *∂*^*yt ∂ht ∂ht ∂ct*

Similarly, the value of gradients *<sup>∂</sup><sup>E</sup>*

with respect the network parameter:

*∂E <sup>∂</sup><sup>W</sup>* <sup>¼</sup> <sup>X</sup> *t*

> *∂cj <sup>∂</sup>cj*�1,1

*The problem of gradient vanishing*

**70**

, *<sup>∂</sup>cj <sup>∂</sup>cj*�1,2

� �

… *<sup>∂</sup>cj <sup>∂</sup>cj*�1,*<sup>s</sup>*

*∂Et <sup>∂</sup><sup>W</sup>* <sup>¼</sup> *<sup>∂</sup>Et ∂*^*yt*

summation of the gradients at each time steps *<sup>∂</sup><sup>E</sup>*

step that is used to train the network via gradient descent (**Figure 10**). During the backward pass, the network parameters are updated at each epoch (iteration). The only fundamental difference between the back-propagation algorithms of the RNN and LSTM networks is a minor modification of the algorithm. Here, the calculated error term at each time step is *Et* ¼ �*y<sup>t</sup> log* ^*yt* **.** As in RNN, the total error is

and the hidden state (*ht)* is a function of the cell state (*ht* ¼ *ot* ⨀ *tanh* ð ÞÞ *Ct :*These both are subjected in the chain rule. Hence, the derivatives of individual error terms

> *<sup>∂</sup>ct*�<sup>1</sup> *<sup>∂</sup>ct*�<sup>2</sup>

> > *<sup>∂</sup>ct*�<sup>1</sup> *<sup>∂</sup>ct*�<sup>2</sup>

As Eq. (19) illustrates, the gradient involves the chain rule of *∂ct* in an LSTM training using the backpropagation algorithm, while the gradient equation involves a chain rule of *∂ht* for a basic RNN. Therefore, the Jacobian matrix for cell state for an LSTM is [20]:

¼

Recall the Eq. (14) for cell state is *Ct* <sup>¼</sup> *ft* <sup>⨀</sup>*Ct*�**<sup>1</sup>** <sup>þ</sup> *it* <sup>⨀</sup>*C*~*t*. When we consider

Eq. (19), the value of the gradients in the LSTM is controlled by the chain of

and for the overall error gradient using the chain rule of differentiation is:

*∂ct <sup>∂</sup>ct*�<sup>1</sup> *<sup>∂</sup>ct*�<sup>2</sup> *<sup>∂</sup>ct*�<sup>3</sup>

> *<sup>∂</sup>ct*�<sup>2</sup> *<sup>∂</sup>ct*�<sup>3</sup>

*∂cj*,1 *<sup>∂</sup>cj*�1,1

*∂cj*,*<sup>s</sup> <sup>∂</sup>cj*�1,1

*∂ct <sup>∂</sup>ct*�<sup>1</sup> � � at each time

Þ

*<sup>t</sup>* � *yt log* ^*yt:*

*<sup>∂</sup><sup>W</sup>* is obtained. Remember,

*<sup>∂</sup><sup>W</sup>* (18)

*<sup>∂</sup><sup>W</sup>* (19)

(20)

*<sup>∂</sup><sup>W</sup>* at each time step is calculated and then the

�

………… *:*

………… *:*

<sup>⋯</sup> *<sup>∂</sup>cj*,1 *<sup>∂</sup>cj*�1,*<sup>s</sup>*

<sup>⋯</sup> *<sup>∂</sup>cj*,*<sup>s</sup> <sup>∂</sup>cj*�1,*<sup>s</sup>*

⋮⋱⋮

*∂c*0

*∂c*0

*<sup>∂</sup><sup>W</sup>* <sup>¼</sup> <sup>P</sup> *t ∂Et*

�� �

� � ¼ �*y<sup>t</sup> log* ^*yt*

, is a function of the hidden state (^*yt* <sup>¼</sup> *<sup>σ</sup> <sup>W</sup>*ð Þ*<sup>y</sup> ht* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* � �

*Illustration of the (A) an LSTM unit from 3-time steps with input data (demographic and clinical data). LSTM network takes inputs from to the current time step to update the hidden state and (LSTM pt*, *ht*�**<sup>1</sup>** *) with relevant information. The "x" in the circles denote point-wise operators,* σ *and tanh are sigmoid (* <sup>1</sup> 1þ*e*�ð Þ *<sup>x</sup> , generates between 0 and 1) and hyperbolic tangent ( <sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup> e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup>* , *generates between* �*1 and 1) activation functions. (B) an RNN with 3-time steps. It has only a tangent, <sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup> e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup> , activation function in the block.*

derivatives starting from the part *<sup>∂</sup>ct <sup>∂</sup>ct*�**<sup>1</sup>** . Expanding this value using the expression for *Ct* <sup>¼</sup> *ft* <sup>⨀</sup>*Ct*�<sup>1</sup> <sup>þ</sup> *it* <sup>⨀</sup>*C*~*t*.

$$\begin{split} \frac{\partial c\_{t}}{\partial c\_{t-1}} &= \frac{\partial \left( f\_{t} \bigodot \mathcal{C}\_{t-1} + i\_{t} \bigodot \tilde{\mathcal{C}}\_{t} \right)}{\partial \big( f\_{t-1} \bigodot \mathcal{C}\_{t-2} + i\_{t-1} \bigodot \tilde{\mathcal{C}}\_{t-1} \big)} \\ &= \frac{\partial c\_{t}}{\partial f} \frac{\partial f}{\partial h\_{t-1}} \frac{\partial h\_{t-1}}{\partial c\_{t-1}} + \frac{\partial c\_{t}}{\partial i} \frac{\partial i}{\partial h\_{t-1}} \frac{\partial h\_{t-1}}{\partial c\_{t-1}} + \frac{\partial c\_{t}}{\partial \tilde{C}\_{t}} \frac{\partial \tilde{\mathcal{C}}\_{t}}{\partial h\_{t-1}} \frac{\partial h\_{t-1}}{\partial c\_{t-1}} + \frac{\partial c\_{t}}{\partial c\_{t-1}} \end{split} \tag{21}$$

Note the term *<sup>∂</sup>ct <sup>∂</sup>ct*�**<sup>1</sup>** does not have a fixed pattern and can yield any positive value in the LSTM, while the *<sup>∂</sup>ct <sup>∂</sup>ct*�**<sup>1</sup>** term in the standard RNN can yield values greater than 1 or less than 1 after certain time steps. Thus, for an LSTM, the term will not converge to 0 or diverge completely, even for an infinite number of time steps. If the gradient starts converging towards zero, the weights of the gates are adjusted to bring it closer to 1.

#### **4.3 Other type of LSTMs**

Several modifications to original LSTM architecture have been recommended over the years. Surprisingly, the original continues to outperform, and has similar predictive ability compared with variants of LSTM over 20 years.

#### *4.3.1 Peephole connections*

This is a type of LSTM by adding "*peephole connections*" to the standard LSTM network. The theme stands for peephole connections needed to capture information inherent to time lags. In other words, with *peephole connections* the information conveyed by time intervals between sub-patterns of sequences is included to the network recurrent. Thus, *peephole connections* concatenate the previous cell state (*Ct*-1) information to the forget, input and output gates. That is, the expression of these gates with peephole connection would be:

$$\begin{aligned} \mathbf{f}\_t &= \sigma\left(\mathcal{W}^{(f)}\left(\mathbf{p}\_t, \mathbf{h}\_{t-1}, \mathbf{C}\_{t-1}\right) + b^{(f)}\right) \\\\ i\_t &= \sigma\left(\mathcal{W}^{(i)}\left(\mathbf{p}\_t, \mathbf{h}\_{t-1}, \mathbf{C}\_{t-1}\right) + b^{(i)}\right) \\\\ o\_t &= \sigma\left(\mathcal{W}^{(o)}\left(\mathbf{h}\_{t-1}, p\_t\right) + b^{(o)}\right) \end{aligned} \tag{22}$$

training data is the rate for the number of positive/number of tests for each day between 01/22/2020–2112/22/2020. Data set was taken from publicly available h ttps://covidtracking.com/data/national. web site and data are updated each day between about 6 pm and 7:30 pm Eastern Time Zone. The initiative relies upon publicly available data from multiple sources. States in the USA are not consistent in how and when they release and update their data, and some may even retroactively change the numbers they report. This can affect the predictions presented in these data visualizations (**Figure 11a-d**). The steps for example 1 are summarized in the **Table 1** (MATLAB 2020b) and results are illustrated in **Figure 11a-d.** LSTM network was trained on the first 90% of the sequence and tested on the last 10%.

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory*

*DOI: http://dx.doi.org/10.5772/intechopen.96180*

This example trains an LSTM network to forecast the number of positively tested given the number of cases in previous days. The training data contains a single time series, with time steps corresponding to days and values corresponding to the number of cases. To make predictions on a new sequence, reset the network state using the "*resetState*" command in MATLAB. Resetting the network state prevents previous predictions from affecting the predictions on the new data. Reset the network state, and then initialize the network state by predicting on the training data (MATLAB, 2020b). The solid line with red color in **Figure 11a** and **c** indicates

*Total daily number of positively tested COVID19 and the rate (positively tested/number of test) conducted in the USA. (a) Plot of the training time series of the number of positively tested COVID19 with the forecasted values, (b) compare the forecasted values of the number of positively tested with the test data set. This graph shows the total daily number of virus tests conducted in each state and of those tests, how many were positive each day. (c) Plot of the training time series of the rate of positively tested COVID19 (d) compare the forecasted values of the rate of positively tested with the rates in the test data set. The trend line in blue shows the actual number of positive cases and the trend line in red shows the number predicted for the last 38 days.*

Therefore, results reveal predicting the positive last 38 days.

the number of cases predicted for the last 30 days.

**Figure 11.**

**73**

This configuration was offered to improve the predictive ability of LSTMs to count and time distances between rare events [21].
