4. Reinforcement learning y long-short-term memory neural network

In order to teach the LSTM to navigate an unstructured terrain, RL was implemented as described in Ref. [7]. In this approach, an LSTM approximates the value function V of the RL algorithm, which teaches a robotic agent how to navigate a T-shaped maze environment.

This problem is a partially observable Markov decision process, in which the agent is unaware of the full state of the environment and must infer this information using current observations. In this study, these observations are the same feature vectors of the environment that were used for previous LEARCH experiments, and these vectors are the input for the LSTM.

The LSTM outputs represent the advantage values A(s, a) of each action, where a is the action taken in state s. They are used to compute the value of the state V sð Þ¼ max<sup>a</sup> A sð Þ ; a , which represents the action with the higher advantage value.

To perform weight updates, truncated backpropagation through time was implemented with RL. A function approximator's prediction error at time step t, ETD(t), is computed using Eq. (13) and is propagated one step back in time through all the units of the network, except for the CECs, for which the error is backpropagated for an indefinite amount of time [7]. Thus,

$$E^{\rm TD}(t) = V\left(\mathbf{s}(t) + \frac{r(t) + \gamma V(\mathbf{s}(t+1)) - V(\mathbf{s}(t))}{k} - A(\mathbf{s}(t), a(t))\right) \tag{13}$$

where r is the immediate reward, γ is a discount factor in the [0, 1] range and k scales the difference between the values of the optimal and suboptimal actions. It is worth mentioning that only the output associated with the executed action receives the error signal.

During the learning process, the agent can explore the environment using the state values; however, directed exploration (i.e. exploration for which a predictor is used to direct the exploration stage, so as to avoid clueless exploration of the entire state space) is important in order to learn complex terrain navigation. When an undirected exploration is conducted, RL tries every action in the same way over all states; however, in unstructured terrain, some states provide ambiguous information about the environment rendering it difficult for the agent to determine the state of the environment. Other states provide clear information; therefore, the agent must direct its exploration to discover the ambiguous states. In order to explore the environment, an MLP was implemented for directed exploration. This MLP input was the same as the LSTM, and the MLP objective was to predict the absolute value of the current temporal difference error, i.e. ETD(t). This aided prediction of which observations were associated with a larger error. The desired MLP output was obtained using Eq. (14), and backpropagation was employed to train the MLP:

$$y\_d^v(t) = \left| E^{TD}(t) \right| + \beta y^v(t+1) \tag{14}$$

The MLP output yv (t) is used as the temperature of the Boltzmann action selection rule, which has the form

$$\frac{e^{A(s,a)/y^r(t)}}{\sum\_{b=1}^n e^{A(s,a)/y^r(t)}}\tag{15}$$

where n is the number of actions available to the agent.

ycv

116 Advanced Path Planning for Mobile Entities

update incoming weights when it leaves the memory cell through the input gate.

4. Reinforcement learning y long-short-term memory neural network

In order to teach the LSTM to navigate an unstructured terrain, RL was implemented as described in Ref. [7]. In this approach, an LSTM approximates the value function V of the RL algorithm, which teaches a robotic agent how to navigate a T-shaped maze environment.

This problem is a partially observable Markov decision process, in which the agent is unaware of the full state of the environment and must infer this information using current observations. In this study, these observations are the same feature vectors of the environment that were used for previous LEARCH experiments, and these vectors are the input for the LSTM.

The LSTM outputs represent the advantage values A(s, a) of each action, where a is the action taken in state s. They are used to compute the value of the state V sð Þ¼ max<sup>a</sup> A sð Þ ; a , which

To perform weight updates, truncated backpropagation through time was implemented with RL. A function approximator's prediction error at time step t, ETD(t), is computed using Eq. (13) and is propagated one step back in time through all the units of the network, except for the CECs, for which the error is backpropagated for an indefinite amount of time [7]. Thus,

where r is the immediate reward, γ is a discount factor in the [0, 1] range and k scales the difference between the values of the optimal and suboptimal actions. It is worth mentioning

During the learning process, the agent can explore the environment using the state values; however, directed exploration (i.e. exploration for which a predictor is used to direct the exploration stage, so as to avoid clueless exploration of the entire state space) is important in order to learn complex terrain navigation. When an undirected exploration is conducted, RL tries every action in the same way over all states; however, in unstructured terrain, some states provide ambiguous information about the environment rendering it difficult for the agent to determine the state of the environment. Other states provide clear information; therefore, the agent must direct its exploration to discover the ambiguous states. In order to explore the environment, an MLP was implemented for directed exploration. This MLP input was the same as the LSTM, and the MLP objective was to predict the absolute value of the current temporal difference error, i.e. ETD(t). This aided prediction of which observations were

<sup>k</sup> � Ast ð Þ ð Þ; a tð Þ

ETDðÞ¼ <sup>t</sup> V stðÞþ r tð Þþ <sup>γ</sup>Vst ð Þ� ð Þ <sup>þ</sup> <sup>1</sup> Vst ð Þ ð Þ

that only the output associated with the executed action receives the error signal.

represents the action with the higher advantage value.

<sup>j</sup> ðÞ¼ <sup>t</sup> <sup>y</sup>outð Þ<sup>t</sup> h scv

The learning process implemented for LSTM in this paper is a variation of real-time recurrent learning (RTRL), as described in Ref. [7] which is a variation of [9]. In this variant, when the error arrives at a cell, it stops propagation further back in time. However, the error is used to

j ð Þt 

(12)

(13)

The complete learning process and the manner in which the LEARCH and RL-LSTM systems are connected is shown in Figure 7. The entire process occurs offline. First, the LEARCH

Figure 7. LEARCH-RL-LSTM system showing the manner in which the two systems are connected to train the LSTM. The entire process occurs offline. First, the LEARCH algorithm iterates until the required cost map M is obtained. Then, the RL-LSTM algorithm begins the process of training the LSTM using the costs converted into rewards r. The feature map F is obtained from the robotic agent and used by both systems.

Figure 9 shows in the lower right corner the color code used to illustrate the manner in which this environment was designed. Each environment differed by 5% from the previous one, i.e. 20% of the states in map 5 differed from those of map 1. These maps are shown in Figure 9.

Table 1 lists the results of experiments conducted using the LEARCH system alone to learn the navigation policies and cost functions of the five maps. In order to test the capability of LEARCH to reuse knowledge learned in previous navigation episodes, the following process was employed. Once LEARCH learned the navigation policies and cost function of map 1, this knowledge was used as initial knowledge to start navigation episodes involving the remaining maps. As is apparent from the first row of Table 2, it was not necessary to retrain the LEARCH row shows, and it was not necessary to retrain the LEARCH system to learn the demonstrated behavior and cost function of map 2. In other words, the LEARCH system could apply the knowledge learned from map 1 to map 2. However, this behavior did not occur for the other maps. For maps 3, 4 and 5, and when attempting to reuse the knowledge learned from map 1, it was necessary to retrain the LEARCH system. In these learning episodes, an increased number of iterations were necessary in order to acquire the new knowledge (as is apparent when Table 1 is compared with row one on Table 2, it can be concluded that the previous knowledge learned using map 1 is even detrimental to the system performance when new

Start

Start

Figure 9. Maps used in experiments. Lower right corner of the second row: colour code used to represent environment

Goal

Path Planning in Rough Terrain Using Neural Network Memory

http://dx.doi.org/10.5772/intechopen.71486

119

Start

Low slope with grass High slope with rocks

Plain terrain Grass Low slope Gravel Trees Water High slope

Start

Start

features.

Goal Goal

Goal Goal

Figure 8. (a) Example of a real environment modelled as a grid map. (b) Patches of terrain used for training are marked with an orange box.

algorithm iterates until the required cost map M is obtained. Then, the RL-LSTM algorithm begins the process of training the LSTM using the costs converted into rewards r. The feature map F is obtained from the robotic agent and used by both systems.

In order to prove the generalization and long-term memory capabilities of LSTM, training was performed using patches of terrain containing representative features of rough terrain. That is, an entire map is not used to train the LSTM (Figure 8). In this way, an efficient training phase is achieved by taking advantage of the above-mentioned capabilities. In the next section, we show the results of the experiments conducted to confirm these capabilities. In addition, we prove the efficacy of the LSTM for mapping tasks that require inference of hidden states, i.e. smoothing or noise recognition.

The LEARCH algorithm builds a cost function; however, as noted in Section 2, the cost function capability for generalization is limited and decays as the number of training paths grows. As the motivation for employing a cost function is to obtain the cost of traversing a patch of terrain so that the path planning system can compute the optimal path with the minimal traversal cost, we propose the extraction of terrain patches having descriptive characteristics of rough terrain for navigation. Hence, the traversal costs for these environment features can be determined using LEARCH, and the costs can be transformed to rewards for a RL algorithm [10].
