2.4 The training method of DBN

As same as the training process proposed in [10], the training process of DBN is performed by two steps. The first one, pretraining, utilizes the learning rules of RBM, i.e., Eqs. (4–6), for each RBM independently. The second step is a fine-tuning process using the pretrained parameters of RBMs and BP algorithm. These processes are shown in Figure 4 and Eqs. (11)–(13).

$$
\Delta w\_{ji}^{L} = -e \left( \sum\_{i} \frac{\partial E}{\partial w\_{ji}^{L+1}} w\_{ji}^{L+1} \right) \left( \mathbf{1} - h\_j^{L} \right) \nu\_i^{L} \tag{11}
$$

$$
\Delta b\_j^L = -\varepsilon \left( \sum\_i \frac{\partial E}{\partial w\_{ji}^{L+1}} w\_{ji}^{L+1} \right) \left( \mathbf{1} - h\_j^L \right) \tag{12}
$$

$$E = \frac{1}{2} \sum\_{t=1}^{T} \left( \mathfrak{y}\_t - \mathfrak{y}\_t \right) \tag{13}$$

In the case of reinforcement learning (RL), the output is decided by a probability distribution, e.g., the Gaussian distribution <sup>y</sup> � π μ; <sup>σ</sup><sup>2</sup> ð Þ. So the output units are the mean μ and the variance σ instead of one unit y.

$$
\mu = \sum\_{j} w\_{\mu j} x\_j \tag{14}
$$

$$\sigma = \frac{1}{1 + \exp\left(-\sum\_{j} w\_{\sigma j} z\_{j}\right)}\tag{15}$$

$$\pi(\mu, \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right) \tag{16}$$

The learning algorithm of stochastic gradient ascent (SGA) [7] is as follows.

Step 1. Observe an input x<sup>t</sup> ¼ xt ð Þ ; xt�<sup>1</sup>; …; xt�nþ<sup>1</sup> .

Step 2. Predict a future data yt ¼ xtþ<sup>1</sup> according to a probability yt � π x<sup>t</sup> ð Þ ; w with ANN models which are constructed by parameters w w<sup>μ</sup>j; w<sup>σ</sup>j; wij; vji � �.

Step 3. Receive a scalar reward/punishment rt by calculating the prediction error:

Figure 4. The training of DBN by BP method.

Time Series Analysis - Data, Methods, and Applications

$$
\sigma\_t = \begin{cases}
\mathbf{1} & \text{if } \left(\mathbf{y}\_t - \mathbf{\tilde{y}}\_t\right)^2 \le \zeta \\
\end{cases}
\tag{17}
$$

evijðÞ¼ t ew<sup>μ</sup><sup>j</sup>

DOI: http://dx.doi.org/10.5772/intechopen.85457

e L wij

> e L bj

is proposed as in Eq. (27):

shown as the following.

method is as follows:

45

• The number of RBMs: [0–3]

• The number of units in each layer of DBN: [2–20]

• Learning rate of each RBM in Eqs. (4)–(6): [10�<sup>5</sup>

• Fixed learning rate of SGA in Eq. (21): [10�<sup>5</sup>

or else if the error is not changed, stop the exploration,

• Discount factor in Eq. (19): [10�<sup>5</sup>

• Coefficient in Eq. (27) [0.5–2.0]

limitations.

connections. Step 3. If the error between yt

parameters,

else return to step 1.

where is 0≤ β a constant.

2.5 Optimization of meta-parameters

ðÞ¼ t ∑ j e

ðÞ¼ t ∑ k e

ð Þt w<sup>μ</sup><sup>j</sup> þ ew<sup>σ</sup><sup>j</sup>

Training Deep Neural Networks with Reinforcement Learning for Time Series Forecasting

The eið Þt of the RBM of Lth layer in the case of the DBN is given as follows:

ð Þt w<sup>σ</sup><sup>j</sup>

<sup>L</sup>þ<sup>1</sup> wij ð Þ<sup>t</sup> <sup>w</sup>Lþ<sup>1</sup> wij ! <sup>1</sup> � <sup>h</sup><sup>L</sup>

<sup>L</sup>þ<sup>1</sup> wij ð Þ<sup>t</sup> <sup>w</sup>Lþ<sup>1</sup> wij � � <sup>1</sup> � hL

The learning rate ε in Eq. (21) affects the learning performance of fine-tuning of DBN. Different values to result different training error (mean squared error (MSE)) as shown in Figure 5. An adaptive learning rate as a linear function of learning error

The number of RBM that constitute the DBN and the number of neurons of

optimization (PSO) method is used to decide the structure of DBN, and in [13] it is suggested that random search method [16] is more efficient. In the experiment of time series forecasting by DBN and SGA shown in this chapter, these metaparameters were decided by the random search, and the exploration limits are

> –10�<sup>1</sup> ]

The optimization algorithm of these meta-parameters by the random search

Step 2. Predict a future data yt ≈xtþ<sup>1</sup> by MLP or DBN using the current weighted

Step 1. Set random values of meta-parameters beyond the exploration

each layer affects prediction performance seriously. In [9], particle swarm

� � <sup>1</sup> � zjð Þ<sup>t</sup> � �xið Þ<sup>t</sup> (24)

j � �vL

j

ε ¼ βMSEð Þ t � 1 (27)

–10�<sup>1</sup> ]

, xtþ<sup>1</sup> is reduced enough, store the values of meta-

–10�<sup>1</sup> ] <sup>i</sup> (25)

� � (26)

where ζ is an evaluation constant greater than or equal to zero. Step 4. Calculate characteristic eligibility eið Þt and eligibility trace Dið Þt :

$$
\sigma\_i(t) = \frac{\partial}{\partial w\_i} \ln \left\{ \pi(\mathbf{x}\_t, \mathbf{w}) \right\} \tag{18}
$$

$$
\overline{D}\_i(t) = e\_i(t) + \chi \overline{D}\_i(t-1) \tag{19}
$$

where 0 ≤γ <1 is a discount factor and wi denotes ith internal variable of DBN.

Step 5. Calculate the modification Δwið Þt :

$$
\Delta w\_i(t) = (r\_t - b)\overline{D}\_i(t) \tag{20}
$$

where b≥ 0 denotes the reinforcement baseline (it can be set as zero). Step 6. Improve the policy Eq. (16) by renewing its internal variable wi by Eq. (21):

$$
\omega w\_i \leftarrow w\_i + \varepsilon \Delta w\_i \tag{21}
$$

where 0≤ε ≤1 is a learning rate.

Step 7. For the next time step t þ 1, return to step 1.

Characteristic eligibility eið Þt , shown in Eq. (18), means that the change of the policy function concerns with the change of system internal variable vector. In fact, the algorithm combines reward/punishment to modify the stochastic policy with its internal variable renewing by Step 4 and Step 5.

The calculation of ew<sup>μ</sup><sup>j</sup> ð Þt , ew<sup>σ</sup><sup>j</sup> ð Þt , evijð Þt in MLP part of DBN is induced as follows;

$$\mathbf{e}\_{w\_{\neq}}(\mathbf{t}) = \frac{\mathbf{y}\_t - \mu\_{\mathbf{f}}}{\sigma\_{\mathbf{f}}^2} \mathbf{z}\_{\mathbf{j}}(\mathbf{t}) \tag{22}$$

$$\varepsilon\_{w\_{\eta}}(t) = \frac{\left(\mathcal{y}\_t - \mu\_t\right)^2 - \sigma^2}{\sigma\_t^2} (\mathbf{1} - \sigma\_t) \mathbf{z}\_{\dot{\jmath}}(t) \tag{23}$$

Figure 5. The learning errors given by different learning rates.

Training Deep Neural Networks with Reinforcement Learning for Time Series Forecasting DOI: http://dx.doi.org/10.5772/intechopen.85457

$$\boldsymbol{e}\_{v\_{\overline{\boldsymbol{\eta}}}}(t) = \left(\boldsymbol{e}\_{w\_{\overline{\boldsymbol{\eta}}}}(t)\boldsymbol{w}\_{\boldsymbol{\mu}\overline{\boldsymbol{\eta}}} + \boldsymbol{e}\_{w\_{\overline{\boldsymbol{\eta}}}}(t)\boldsymbol{w}\_{\boldsymbol{\sigma}\overline{\boldsymbol{\eta}}}\right) \left(\mathbbm{1} - \boldsymbol{z}\_{\overline{\boldsymbol{\eta}}}(t)\right) \boldsymbol{\varkappa}\_{i}(t) \tag{24}$$

The eið Þt of the RBM of Lth layer in the case of the DBN is given as follows:

$$\boldsymbol{e}\_{w\_{\vec{\eta}}}^{L}(t) = \left(\sum\_{j} \boldsymbol{e}\_{w\_{\vec{\eta}}}^{L+1}(t) \boldsymbol{w}\_{w\_{\vec{\eta}}}^{L+1}\right) \left(\mathbf{1} - \boldsymbol{h}\_{j}^{L}\right) \boldsymbol{v}\_{i}^{L} \tag{25}$$

$$\boldsymbol{e}\_{b\_{j}}^{L}(t) = \left(\sum\_{k} \boldsymbol{e}\_{w\_{\vec{\imath}}}^{L+1}(t) \boldsymbol{w}\_{w\_{\vec{\imath}}}^{L+1}\right) \left(1 - h\_{j}^{L}\right) \tag{26}$$

The learning rate ε in Eq. (21) affects the learning performance of fine-tuning of DBN. Different values to result different training error (mean squared error (MSE)) as shown in Figure 5. An adaptive learning rate as a linear function of learning error is proposed as in Eq. (27):

$$
\varepsilon = \beta \text{MSE}(t-1) \tag{27}
$$

where is 0≤ β a constant.

rt <sup>¼</sup> <sup>1</sup> if yt � <sup>~</sup>yt

(

eiðÞ¼ t

Step 5. Calculate the modification Δwið Þt :

Time Series Analysis - Data, Methods, and Applications

where 0≤ε ≤1 is a learning rate. Step 7. For the next time step t þ 1, return to step 1.

ð Þt , ew<sup>σ</sup><sup>j</sup>

ew<sup>μ</sup><sup>j</sup>

ðÞ¼ <sup>t</sup> yt � <sup>μ</sup><sup>t</sup>

internal variable renewing by Step 4 and Step 5.

ew<sup>σ</sup><sup>j</sup>

The calculation of ew<sup>μ</sup><sup>j</sup>

DBN.

Eq. (21):

Figure 5.

44

The learning errors given by different learning rates.

�1 else

Step 4. Calculate characteristic eligibility eið Þt and eligibility trace Dið Þt :

∂ ∂wi

where ζ is an evaluation constant greater than or equal to zero.

where 0 ≤γ <1 is a discount factor and wi denotes ith internal variable of

Step 6. Improve the policy Eq. (16) by renewing its internal variable wi by

Characteristic eligibility eið Þt , shown in Eq. (18), means that the change of the policy function concerns with the change of system internal variable vector. In fact, the algorithm combines reward/punishment to modify the stochastic policy with its

> ðÞ¼ <sup>t</sup> yt � <sup>μ</sup><sup>t</sup> σ2 t

� �<sup>2</sup> � <sup>σ</sup><sup>2</sup> σ2 t

where b≥ 0 denotes the reinforcement baseline (it can be set as zero).

� �<sup>2</sup>

≤ζ

DiðÞ¼ t eiðÞþt γDið Þ t � 1 (19)

ΔwiðÞ¼ t ð Þ rt � b Dið Þt (20)

wi wi þ εΔwi (21)

ð Þt , evijð Þt in MLP part of DBN is induced as follows;

zjð Þt (22)

ð Þ 1 � σ<sup>t</sup> zjð Þt (23)

ln π x<sup>t</sup> f g ð Þ ; w (18)

(17)

#### 2.5 Optimization of meta-parameters

The number of RBM that constitute the DBN and the number of neurons of each layer affects prediction performance seriously. In [9], particle swarm optimization (PSO) method is used to decide the structure of DBN, and in [13] it is suggested that random search method [16] is more efficient. In the experiment of time series forecasting by DBN and SGA shown in this chapter, these metaparameters were decided by the random search, and the exploration limits are shown as the following.


The optimization algorithm of these meta-parameters by the random search method is as follows:


or else if the error is not changed,

stop the exploration,

else return to step 1.
