**4. Long short-term memory**

As mentioned before, the output from RNNs is dependent on its previous state or previous N time steps circumstances. Conventional RNN face difficulty in learning and maintaining long-range dependencies. Imagine the unfolding RNN given in **Figure 3**. Each time step requires a new copy of the network. With large RNNs, thousands, even millions of weights are needed to be updated. In other word, *<sup>∂</sup>hj <sup>∂</sup>hj*�<sup>1</sup> is a chain rule itself. For **Figure 3**, for example, the derivative of *<sup>∂</sup>h*<sup>4</sup> *<sup>∂</sup>h*<sup>3</sup> <sup>¼</sup> *<sup>∂</sup>h*<sup>4</sup> *∂h*<sup>3</sup> *∂h*<sup>3</sup> *∂h*<sup>2</sup> *∂h*<sup>2</sup> *∂h*<sup>1</sup> *∂h*<sup>1</sup> *<sup>∂</sup>h*<sup>0</sup> . Imagine an unrolling the RNN a thousand times, in which every activation of the neurons inside the network are replicated thousands of times. This means, especially for larger networks, that thousands or millions of weights are needed. As Jacobian matrix will play a role to update the weights, the values of the Jacobian matrix will range between �1, 1 if tanh activation function is applied to the *f a*ð Þ¼ *<sup>h</sup>*ð Þ*<sup>t</sup> ht* <sup>¼</sup> *tanh <sup>W</sup>*ð Þ*<sup>y</sup> ht* <sup>þ</sup> *<sup>b</sup>*ð Þ*<sup>y</sup>* � � given in (2). It can be easily imagined that the
