**3.1 Training recurrent neural networks**

Similar to feedforward MLP networks, RNNs have two stages, a forward and a backward stage. Each works together during the training of the network. However, structures and calculation patterns differ. Let us first consider the forward pass. *Stage 1: Forward pass.*

The forward pass will be summarized into 5 steps:

1.Summation step. In this step two different source of information are combined before nonlinear activation function will be take place. The sources are the values of weighted input *W*ð Þ *<sup>p</sup> pt* and weighted previous hidden state with bias *ht*�**<sup>1</sup>***W*ð Þ *<sup>h</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ *<sup>h</sup>* . Here, *pt* and *<sup>W</sup>*ð Þ *<sup>p</sup>* are input vector and the input weight matrix, *ht***-1** is value of previous hidden state, *W*ð Þ *<sup>h</sup>* is weight matrix of hidden state pertains the previous hidden state with the current one and *b(h)*. Is bias. Since the previous hidden state and current input are measured as

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory DOI: http://dx.doi.org/10.5772/intechopen.96180*

vectors, each element in the vector is placed in a different orthogonal dimension

#### **Figure 2.**

of a simple feedforward MLP where *P* identifies the input layer, next one or more hidden layers, which is followed by the output layer containing the fitted values.

Because the two stages, outlined in **Figure 1** above, are used in all neural networks, we can extend MLP feedforward neural networks for use in sequential data or time series data. To do so, we must address time sequences. Recurrent neural networks (RNNs) address this issue by conducting multiple time steps that unrolls the network, adds new layers, and recalculates the prediction error, resulting in a very deep network. First, the connections between nodes in the hidden layer(s) form a directed graph along a temporal sequence allowing information to persist. Through such a mechanism, the concept of time creates the RNNs memory. Here, the architecture receives information from multiple previous layers of the network. **Figure 2** outlines the hidden layer for RNN and demonstrates the nonlinear function of the previous layers and the current input (*p*). Here, the hyperbolic tangent activation function is used to generate the hidden state. The model has memory since the bias term is based on the "past". As a consequence, the outputs from the previous step are fed as input to the current step. Another way to think about RNNs is that a recurrent neural network has multiple copies of the same network, each passing a message to a successor (**Figure 3**). Thus, the output value of the last time point is transmitted back to the neural network, so that the parameter estimation (weight calculation) of each time point is related to the content of

Similar to feedforward MLP networks, RNNs have two stages, a forward and a backward stage. Each works together during the training of the network. However, structures and calculation patterns differ. Let us first consider the forward pass.

1.Summation step. In this step two different source of information are combined before nonlinear activation function will be take place. The sources are the

weight matrix, *ht***-1** is value of previous hidden state, *W*ð Þ *<sup>h</sup>* is weight matrix of hidden state pertains the previous hidden state with the current one and *b(h)*. Is bias. Since the previous hidden state and current input are measured as

and weighted previous hidden state with

. Here, *pt* and *W*ð Þ *<sup>p</sup>* are input vector and the input

The feed forward MLP networks is evaluated in two stages. First, in the feedforward stage information comes from the left and each unit evaluates its activation function *f.* The results (output) are transmitted to the units connected to the right. The second stage involves the backpropagation (BP) step and is used for training the neural network using gradient descent algorithm in which the network parameters are moved along the negative of the gradient of the performance function The process consists of running the whole network backward and adjusting the weights (and error) in the hidden layer. The feedforward and backward steps are repeated several times, called epochs. The *algorithm stops when the value of the loss*

*(error) function has become sufficiently small*.

**3. Recurrent neural networks**

*Deep Learning Applications*

the previous time point.

*Stage 1: Forward pass.*

bias *ht*�**<sup>1</sup>***W*ð Þ *<sup>h</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ *<sup>h</sup>*

**60**

**3.1 Training recurrent neural networks**

values of weighted input *W*ð Þ *<sup>p</sup> pt*

The forward pass will be summarized into 5 steps:

*A typical RNN that has a hyperbolic tangent activation function <sup>e</sup>*ð Þ *<sup>x</sup>* �*e*�ð Þ *<sup>x</sup> e*ð Þ *<sup>x</sup>* þ*e*�ð Þ *<sup>x</sup> to generate the hidden state. Because of the hidden state RNNs have a "memory" that information has been calculated so far is captured. The information in hidden state passed further to a second activation function* <sup>1</sup> 1þ*e*�ð Þ *<sup>x</sup> to generate the predicted (output) values. In RNNs, the weight (***W***) calculation of each time point of the network model is related to the content of the previous time point. We can process a sequence of vectors of inputs (***p***) by applying a recurrence formula at every time step.*

#### **Figure 3.**

*An unrolled RNN with a hidden state carries pertinent information from one input item in the series to others. The blue and red arrows in the figure are indicating the forward and the backward pass of the network, respectively. With backward pass, we sum up the contributions of each time step to the gradient. In other words, because W is used in every step up to the output, we need to backpropagate gradients from* t = *4 through the network all the way to* t *= 0.*

The weight of the previous hidden state and current input are placed in a trainable weight matrix. Element-wise multiplication of the previous hidden state vector with the hidden state weights (*ht*�**<sup>1</sup>***W*ð Þ *<sup>h</sup>* <sup>Þ</sup> and element wise multiplication of the current input vector with the current input weights *W*ð Þ *<sup>p</sup> pt* produces the parameterized of state vector and input vector.

*Stage 2: Backward Pass.*

*DOI: http://dx.doi.org/10.5772/intechopen.96180*

Here, *a* is learning rate and *<sup>∂</sup><sup>E</sup>*

The error each time step is

The value of gradients *<sup>∂</sup><sup>E</sup>*

(7) is used.

**63**

applied to all parameters on the network):

*∂Et <sup>∂</sup><sup>W</sup>* <sup>¼</sup> *<sup>∂</sup>Et ∂*^*yt*

*∂E <sup>∂</sup><sup>W</sup>* <sup>¼</sup> <sup>X</sup> *t*

To calculate the error gradient given in Eq. (6):

*∂*^*yt ∂ht*

*∂Et ∂*^*yt*

Then the network weights can be updated as follow:

*∂*^*yt ∂ht*

*W*ð Þ *<sup>h</sup>*

*W*ð Þ *<sup>p</sup>*

*∂ht <sup>∂</sup>ht*�<sup>1</sup>

*<sup>t</sup>*þ<sup>1</sup> <sup>¼</sup> *<sup>W</sup>*ð Þ *<sup>h</sup>*

*<sup>t</sup>*þ<sup>1</sup> <sup>¼</sup> *<sup>W</sup>*ð Þ *<sup>p</sup>*

*. ∂W*

After the forward pass of RNN, the calculated error (loss function, cost function) at each time step is injected backwards into the network to update the network weights at each iteration. The idea of RNN unfolding in **Figure 3** takes place the bigger part in the way RNNs are implemented for the backward pass. Like standard backpropagation in feed forward MLP, the backward pass consists of a repeated application of the chain rule. For this reason, the type of backpropagation algorithm used for an RNN to update the network parameters is called backpropagation through time (BPTT). In BPTT, the RNN network is unfolded in time to construct a feed forward MLP neural network. Then, the generalized delta rule is applied to update the weights *W(p)***,** *W(h)* and *W(y)* and biases *b(h)* and *b(y)*. Remember, the goal with backpropagation is minimizing the gradients of error with respect to the network parameter space (*W(p)***,** *W(h)* and *W(y)* and biases *b(h)* and *b(y)*) and then updates the parameters using Stochastic Gradient Descent. The following equation

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory*

is used to update the parameters for minimizing the error function:

*Wt* ¼ *Wt*�<sup>1</sup> � *a*

to parameters space. The same is applied to all weights and biases on the networks.

*E*0ð Þ¼� *y*, ^*y ylog*^*y*

*<sup>E</sup>* <sup>¼</sup> <sup>X</sup> *t*

> *∂E <sup>∂</sup><sup>W</sup>* <sup>¼</sup> <sup>X</sup> *t*

*∂ht <sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ht*�<sup>2</sup>

To calculate the overall error gradient, the chain rule of differentiation given in

*<sup>∂</sup>ht*�<sup>1</sup> *<sup>∂</sup>ht*�<sup>2</sup>

*<sup>t</sup>* þ *a*

*<sup>t</sup>* þ *a*

The total error is calculated by the summation of the error from all time steps as:

*∂E <sup>∂</sup><sup>W</sup> :*

*<sup>∂</sup><sup>W</sup>* is the derivative of the error function with respect

� *yt log* ^*yt* (5)

*<sup>∂</sup><sup>W</sup>* (6)

*∂h*<sup>0</sup>

*∂h*<sup>0</sup>

*<sup>∂</sup><sup>W</sup>* (7)

*<sup>∂</sup><sup>W</sup>* at each time step is calculated as (the same rule is

………… *:*

………… *:*

*∂Et*

*<sup>∂</sup>ht*�<sup>2</sup> *<sup>∂</sup>ht*�<sup>3</sup>

> *<sup>∂</sup>ht*�<sup>2</sup> *<sup>∂</sup>ht*�<sup>3</sup>

> > *∂E ∂W*ð Þ *<sup>h</sup>*

> > *∂E ∂W*ð Þ *<sup>p</sup>*

2.Hyperbolic tangent activation function is applied to the summed of the two parameterized vectors *<sup>W</sup>*ð Þ *<sup>p</sup> pt* <sup>þ</sup> *ht*�**<sup>1</sup>***W*ð Þ *<sup>h</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ *<sup>h</sup>* to push the output between �1 and 1 (**Figure 2**).

$$f(a\_h(t)) = h\_t = \frac{e^{\left(\mathcal{W}^{(p)}p\_t + h\_{t-1}\mathcal{W}^{(h)} + b^{(h)}\right)} - e^{-\left(\mathcal{W}^{(p)}p\_t + h\_{t-1}\mathcal{W} + b^{(h)}\right)}}{e^{\left(\mathcal{W}^{(p)}p\_t + h\_{t-1}\mathcal{W}^{(h)} + b^{(h)}\right)} + e^{-\left(\mathcal{W}^{(p)}p\_t + h\_{t-1}\mathcal{W} + b^{(h)}\right)}} \tag{2}$$

3.The network input to the output unit at time *t* with element-wise multiplication of output weights and with updated (current) hidden state *htW*ð Þ*<sup>y</sup>* .

Therefore, the value before a softmax activation function takes place is *ao*ð Þ¼ *t htW*ð Þ*<sup>y</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* . Here *<sup>W</sup>*ð Þ*<sup>y</sup>* and *<sup>b</sup>*ð Þ*<sup>y</sup>* are the weight and bias of the output layer.

4.The output of the network at time t is calculated (the activation function applied to the output layer depends on the type of target (dependent) variable and the values coming from the hidden units. Again, a second activation function (mostly sigmoid) is applied to the value generated by the hidden node. The predicted value of a RNN block with sigmoid:

$$\hat{\mathcal{Y}}\_t = \frac{1}{\mathbf{1} + \mathbf{e}^{\left(W^{(\mathcal{I})}h\_t + b^{(\mathcal{I})}\right)}} \tag{3}$$

During the training of the forward pass of the RNN, the network outputs predicted value (^*yi* , *i* ¼ *t* � 1, *t*, *t* þ 1, … *::*, *t* þ *s*Þ at each time step. We can image the unfold (unroll) of RNN given in **Figure 3**. That is, for each time step, an RNN can be imaged as multiple copies of the same network for the complete sequence. For example, if the sequence is a sentence of four words as; "*I have kidney problem*" then the RNN would be unrolled into a 4-layer neural network, one layer for each word. The output given in (3) is used to train the network using gradient descent after calculation of error in (4).

5.Then the error (loss function, cost function) at each time step is calculated to start the "*backward pass"*:

$$E\_t(\mathbf{y}\_t, \hat{\mathbf{y}}\_t) = -\mathbf{y}\_t \log \hat{\mathbf{y}}\_t) \tag{4}$$

Here *y* and ^*yt* are actual and predicted outcomes, respectively. After calculation of the error at each time step, this calculated error is injected backwards into the network to update the network weights at each epoch (iteration). As there are many training algorithms based on some modification of standard backpropagation, the chosen error measure can be different and depends on the selected algorithm. For example, the error *Et* y <sup>¼</sup> *Et yt* , ^*yt* given in (4), has an additional term, *Et(w),* in the Bayesian regularized neural networks (BRANN) training algorithm that penalizes large weights in anticipation of achieving smoother mapping. Both *Et(y), Et(w),* have a coefficient as *β* and α, respectively (also referred to as regularization parameters or hyper-parameters) that need to be estimated adaptively [1, 16].

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory DOI: http://dx.doi.org/10.5772/intechopen.96180*

### *Stage 2: Backward Pass.*

The weight of the previous hidden state and current input are placed in a trainable weight matrix. Element-wise multiplication of the previous hidden state vector with the hidden state weights (*ht*�**<sup>1</sup>***W*ð Þ *<sup>h</sup>* <sup>Þ</sup> and element wise multiplication of the current input vector with the current input weights

produces the parameterized of state vector and input vector.

parameterized vectors *<sup>W</sup>*ð Þ *<sup>p</sup> pt* <sup>þ</sup> *ht*�**<sup>1</sup>***W*ð Þ *<sup>h</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ *<sup>h</sup>*

*e <sup>W</sup>*ð Þ *<sup>p</sup> pt*

node. The predicted value of a RNN block with sigmoid:

*Et yt* , ^*yt*

, ^*yt*

<sup>¼</sup> *Et yt*

^*yt* <sup>¼</sup> **<sup>1</sup>**

between �1 and 1 (**Figure 2**).

predicted value (^*yi*

start the "*backward pass"*:

example, the error *Et* y

**62**

*f a*ð Þ¼ *<sup>h</sup>*ð Þ*<sup>t</sup> ht* <sup>¼</sup> *<sup>e</sup> <sup>W</sup>*ð Þ *<sup>p</sup> pt*

2.Hyperbolic tangent activation function is applied to the summed of the two

<sup>þ</sup>*ht*�1*W*ð Þ *<sup>h</sup>* <sup>þ</sup>*b*ð Þ *<sup>h</sup>* ð Þ � *<sup>e</sup>*

<sup>þ</sup>*ht*�1*W*ð Þ *<sup>h</sup>* <sup>þ</sup>*b*ð Þ *<sup>h</sup>* ð Þ <sup>þ</sup> *<sup>e</sup>*

of output weights and with updated (current) hidden state *htW*ð Þ*<sup>y</sup>*

4.The output of the network at time t is calculated (the activation function

3.The network input to the output unit at time *t* with element-wise multiplication

Therefore, the value before a softmax activation function takes place is *ao*ð Þ¼ *t htW*ð Þ*<sup>y</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* . Here *<sup>W</sup>*ð Þ*<sup>y</sup>* and *<sup>b</sup>*ð Þ*<sup>y</sup>* are the weight and bias of the output layer.

applied to the output layer depends on the type of target (dependent) variable and the values coming from the hidden units. Again, a second activation function (mostly sigmoid) is applied to the value generated by the hidden

During the training of the forward pass of the RNN, the network outputs

image the unfold (unroll) of RNN given in **Figure 3**. That is, for each time step, an RNN can be imaged as multiple copies of the same network for the complete sequence. For example, if the sequence is a sentence of four words as; "*I have kidney problem*" then the RNN would be unrolled into a 4-layer neural network, one layer for each word. The output given in (3) is used to train the network using gradient descent after calculation of error in (4).

5.Then the error (loss function, cost function) at each time step is calculated to

¼ �*yt log* ^*yt*

Here *y* and ^*yt* are actual and predicted outcomes, respectively. After calculation of the error at each time step, this calculated error is injected backwards into the network to update the network weights at each epoch (iteration). As there are many training algorithms based on some modification of standard backpropagation, the chosen error measure can be different and depends on the selected algorithm. For

the Bayesian regularized neural networks (BRANN) training algorithm that penalizes large weights in anticipation of achieving smoother mapping. Both *Et(y), Et(w),* have a coefficient as *β* and α, respectively (also referred to as regularization parameters or hyper-parameters) that need to be estimated adaptively [1, 16].

, *i* ¼ *t* � 1, *t*, *t* þ 1, … *::*, *t* þ *s*Þ at each time step. We can

to push the output

<sup>þ</sup>*ht*�1*W*þ*b*ð Þ *<sup>h</sup>* ð Þ (2)

.

� *<sup>W</sup>*ð Þ *<sup>p</sup> pt*

� *<sup>W</sup>*ð Þ *<sup>p</sup> pt*

**<sup>1</sup>** <sup>þ</sup> *<sup>e</sup> <sup>W</sup>*ð Þ*<sup>y</sup> ht*þ*b*ð Þ*<sup>y</sup>* (3)

given in (4), has an additional term, *Et(w),* in

Þ (4)

<sup>þ</sup>*ht*�1*W*þ*b*ð Þ *<sup>h</sup>* ð Þ

*W*ð Þ *<sup>p</sup> pt*

*Deep Learning Applications*

After the forward pass of RNN, the calculated error (loss function, cost function) at each time step is injected backwards into the network to update the network weights at each iteration. The idea of RNN unfolding in **Figure 3** takes place the bigger part in the way RNNs are implemented for the backward pass. Like standard backpropagation in feed forward MLP, the backward pass consists of a repeated application of the chain rule. For this reason, the type of backpropagation algorithm used for an RNN to update the network parameters is called backpropagation through time (BPTT). In BPTT, the RNN network is unfolded in time to construct a feed forward MLP neural network. Then, the generalized delta rule is applied to update the weights *W(p)***,** *W(h)* and *W(y)* and biases *b(h)* and *b(y)*. Remember, the goal with backpropagation is minimizing the gradients of error with respect to the network parameter space (*W(p)***,** *W(h)* and *W(y)* and biases *b(h)* and *b(y)*) and then updates the parameters using Stochastic Gradient Descent. The following equation is used to update the parameters for minimizing the error function:

$$W\_t = W\_{t-1} - a \frac{\partial E}{\partial W}.$$

Here, *a* is learning rate and *<sup>∂</sup><sup>E</sup> <sup>∂</sup><sup>W</sup>* is the derivative of the error function with respect to parameters space. The same is applied to all weights and biases on the networks. The error each time step is

$$E\_0(\mathcal{y}, \hat{\mathcal{y}}) = -\mathcal{y}\log \hat{\mathcal{y}}$$

The total error is calculated by the summation of the error from all time steps as:

$$E = \sum\_{t} - \left. \boldsymbol{\chi}\_{t} \log \boldsymbol{\hat{\chi}}\_{t} \right| \tag{5}$$

The value of gradients *<sup>∂</sup><sup>E</sup> <sup>∂</sup><sup>W</sup>* at each time step is calculated as (the same rule is applied to all parameters on the network):

$$\frac{\partial E}{\partial W} = \sum\_{t} \frac{\partial E\_t}{\partial W} \tag{6}$$

To calculate the error gradient given in Eq. (6):

$$\frac{\partial E\_t}{\partial W} = \frac{\partial E\_t}{\partial \hat{\jmath}\_t} \frac{\partial \hat{\jmath}\_t}{\partial h\_t} \frac{\partial h\_t}{\partial h\_{t-1}} \frac{\partial h\_{t-1}}{\partial h\_{t-2}} \frac{\partial h\_{t-2}}{\partial h\_{t-3}} \dots \dots \dots \dots \frac{\partial h\_0}{\partial W}$$

To calculate the overall error gradient, the chain rule of differentiation given in (7) is used.

$$\frac{\partial E}{\partial W} = \sum\_{t} \frac{\partial E\_t}{\partial \hat{\jmath}\_t} \frac{\partial \hat{\jmath}\_t}{\partial h\_t} \frac{\partial h\_t}{\partial h\_{t-1}} \frac{\partial h\_{t-1}}{\partial h\_{t-2}} \frac{\partial h\_{t-2}}{\partial h\_{t-3}} \dots \dots \dots \dots \frac{\partial h\_0}{\partial W} \tag{7}$$

Then the network weights can be updated as follow:

$$\begin{aligned} W\_{t+1}^{(h)} &= W\_t^{(h)} + a \frac{\partial E}{\partial W^{(h)}}, \\ W\_{t+1}^{(p)} &= W\_t^{(p)} + a \frac{\partial E}{\partial W^{(p)}}, \end{aligned}$$

$$\boldsymbol{W}\_{t+1}^{(\boldsymbol{\nu})} = \boldsymbol{W}\_t^{(\boldsymbol{\nu})} + \boldsymbol{a} \frac{\partial \boldsymbol{E}}{\partial \boldsymbol{W}^{(\boldsymbol{\nu})}},$$

Note that, as given in (2), the current state (*ht*) = *tanh W*ð Þ *<sup>p</sup> pt* <sup>þ</sup> *ht*�1*W*ð Þ *<sup>h</sup>* <sup>þ</sup> *<sup>b</sup>*ð Þ *<sup>h</sup>* � �

depends on the quantity of the previous state (*ht*-1) and the other parameters. Therefore, the differentiation of ℎ*<sup>t</sup>* and ℎ*<sup>j</sup>* (here *j =* 0, 1, … ., *t*-1) given in (7) is a derivative of a hidden state that stores memory at time *t.*

The Jacobians of any time *<sup>∂</sup>hj <sup>∂</sup>hj*�<sup>1</sup> � � and for the entire time will be:

$$\frac{\partial h\_j}{\partial h\_{j-1}} = \frac{\partial h\_t}{\partial h\_{t-1}} \frac{\partial h\_{t-1}}{\partial h\_{t-2}} \frac{\partial h\_{t-2}}{\partial h\_{t-3}} \dots \dots \frac{\partial h\_{j+1}}{\partial h\_j} = \prod\_{j=j+1}^t \frac{\partial h\_j}{\partial h\_{j-1}} \tag{8}$$

derivatives of tanh (or sigmoid) activation function would be 0 at the end. Zero gradients drive other gradients in previous layers towards 0. Thus, with small values in the Jacobian matrix and multiple matrix multiplications (t-j, in particular)

completely after a few time steps. As a result, the RNN ends up not learning longrange dependencies. As in RNNs, the vanishing gradients problem will be an important issue for the deep feedforward MLP when multiple hidden layers (mul-

The long short-term memory networks (LSTMs) are a special type of RNN that can overcome the vanishing gradient problem and can learn long-term dependencies. LSTM introduces a memory unit and a gate mechanism to enable capture of the long dependencies in a sequence. The term "long short-term memory" originates from the following intuition. Simple RNN networks have long-term memory in the form of weights. The weights change gradually during the training of the network, encoding general knowledge about the training data. They also have shortterm memory in the form of ephemeral activations, which flows from each node to

The neural network architecture for an LSTM block given in **Figure 4** demonstrates that the LSTM network extends RNN's memory and can selectively remember or forget information by structures called cell states and three gates. Thus, in addition to a hidden state in RNN, an LSTM block typically has four more layers. These layers are called the cell state (*Ct*), an input gate (*it*), an output gate (*Ot*), and a forget gate (*ft*). Each layer interacts with each other in a very special way to

*Illustration of long short-term memory block structure. The operator "*⨀*" denotes the element-wise multiplication. The* Ct-1, Ct, ht *and* ht *are previous cell state, current cell state, current hidden state and previous hidden state, respectively. The* ft; it; ot *are the values of the forget, input and output gates, respectively. The C*~*<sup>t</sup> is the candidate value for the cell state,* W(f), W(i), W(c), W(o) *are weight matrices consist of forget gate, input gate, cell state and output gate weights, and* b(f), b(i), b(c), *and* b(o) *are bias vectors associated with them.*

the gradient values will be shrunk exponentially fast, eventually vanishing

*Deep Learning for Subtyping and Prediction of Diseases: Long-Short Term Memory*

*DOI: http://dx.doi.org/10.5772/intechopen.96180*

tiple neurons within each) are placed between input and output layers.

successive nodes [17, 18].

**Figure 4.**

**65**

**4.1 The architecture of LSTM**

generate information from the training data.

while the Jacobian matrix for hidden state is given by:

$$\frac{\partial h\_{j}}{\partial h\_{j-1}} = \begin{bmatrix} \frac{\partial h\_{j}}{\partial h\_{j-1,1}}, \frac{\partial h\_{j}}{\partial h\_{j-1,2}} \dots \frac{\partial h\_{j}}{\partial h\_{j-1,\ell}} \end{bmatrix} = \begin{bmatrix} \frac{\partial h\_{j,1}}{\partial h\_{j-1,1}} & \cdots & \frac{\partial h\_{j,1}}{\partial h\_{j-1,\ell}} \\ \vdots & \ddots & \vdots \\ \frac{\partial h\_{j,\ell}}{\partial h\_{j-1,1}} & \cdots & \frac{\partial h\_{j,\ell}}{\partial h\_{j-1,\ell}} \end{bmatrix} \tag{9}$$

Putting the Eqs. (7) and (8) together, we have the following relationship:

$$\frac{\partial E}{\partial W} = \sum\_{t} \sum\_{j} \frac{\partial E\_{t}}{\partial \boldsymbol{\chi}\_{t}} \frac{\partial \boldsymbol{\upchi}\_{t}}{\partial \boldsymbol{h}\_{t}} \left( \prod\_{j=j+1}^{t} \frac{\partial \boldsymbol{h}\_{j}}{\partial \boldsymbol{h}\_{j-1}} \right) \frac{\partial \boldsymbol{h}\_{j}}{\partial W}. \tag{10}$$

In other words, because the network parameters are used in every step up to the output, we need to backpropagate gradients from last time step (*t=t*) through the network all the way to *<sup>t</sup>* = 0. The Jacobians in (10), *<sup>∂</sup>h<sup>j</sup> <sup>∂</sup>hj*�<sup>1</sup> � �, demonstrates the eigen decomposition given by *<sup>W</sup>*ð Þ*<sup>i</sup> Tdiag f*<sup>0</sup> *hj*�**<sup>1</sup>** � � � � **,** where the eigenvalues and eigenvectors are generated. Here the *W*ð Þ*<sup>i</sup> <sup>T</sup>* is the transpose of the network parameters matrix. Consequently, if the largest eigenvalue is greater or smaller than 1, the RNN suffers from vanishing or exploding gradient problems (see **Figure 3**).
