2. Learning to Search

This section presents an overview of the LEARCH algorithm, along with the results of generalization tests conducted using LEARCH. The objective is to explain the need for memory for this algorithm in order to improve its performance.

The LEARCH algorithm is based on the concept of inverse optimal control [8], which addresses the problem of finding a cost map such that a known trajectory through an environment is optimally navigated using this map. In addition, non-linear maximum margin planning [1] with the support vector regression machine [9] is used to learn behavior from an expert, that is, human expert.

Let S be a state space operated by a path planner. F is a feature space defined over S. Then, for every x∈ S a corresponding feature vector Fx∈S exists. The Fx vectors are inputs for the cost function C, which maps F to scalar values. C is defined as the weighted sums of functions Ri∈ R, where R is a space of limited complexity that maps from the feature space to a scalar [1].

planning systems of the robot using functions that map features of the environment to scalar values that represent the traversability of the terrain [1, 4, 5]. These works try to resolve the task of autonomous robot navigation in unstructured terrain; however, these approaches do

In most cases, a human expert provides information for cost map constructions heuristically. In other cases, this information is then used to construct a cost function that automatically maps features to costs; however, note that this cost function is established heuristically. The problem with this methodology is that the traversability of a given feature for a robot is difficult to

An alternative to establishing cost functions is to use an algorithm that automatically constructs and tunes a cost function. Learning to Search (LEARCH) [1] is an algorithm that uses learning from demonstration in order to construct a cost function. In this approach, a human expert exhibits a desirable behavior (a sample path) over certain terrain; then, LEARCH

LEARCH has the advantage that the cost function to be constructed can be chosen from among linear functions, parametric functions, neural networks and decision trees, to name a few [1]. In particular, neural networks and other learning machines such as support vector machines

However, as will be demonstrated in this paper, the LEARCH generalization capability decreases as the number of sample paths increases; this is because the error decays over time

In this study, this problem with the LEARCH algorithm is addressed using a long-short-term memory (LSTM) neural network and reinforcement learning (RL). Recurrent neural networks are capable of finding hidden states, as shown in [7]. Therefore, we propose a complex learning system that allows a navigation agent to learn navigation policies and determine complex traversability cost functions. Furthermore, this system can retain the knowledge learned in past navigation episodes in memory and generalize this knowledge for use in new episodes.

This section presents an overview of the LEARCH algorithm, along with the results of generalization tests conducted using LEARCH. The objective is to explain the need for memory for

The LEARCH algorithm is based on the concept of inverse optimal control [8], which addresses the problem of finding a cost map such that a known trajectory through an environment is optimally navigated using this map. In addition, non-linear maximum margin planning [1] with the support vector regression machine [9] is used to learn behavior from an expert, that is, human expert.

Let S be a state space operated by a path planner. F is a feature space defined over S. Then, for every x∈ S a corresponding feature vector Fx∈S exists. The Fx vectors are inputs for the cost

(SVM) have exhibited considerable generalization capability in different scenarios [6].

quantify. In contrast, humans can determine traversal trajectories relatively easily.

adjusts a cost function in order to match the behavior exhibited by the expert.

not address the issue as an integrated system.

108 Advanced Path Planning for Mobile Entities

during training.

2. Learning to Search

this algorithm in order to improve its performance.

We define a path P as a sequence of states x ∈S that lead from the start s point to the goal g. The cost of each state is C(Fx); thus, the cost of the entire path is defined as

$$\mathbb{C}(P) = \sum\_{\mathbf{x} \in P} \mathbb{C}(F\_{\mathbf{x}}) \tag{1}$$

Consider a path provided by an expert, i.e. sample path Pe, which runs from a start state se to a goal state ge. In order to learn from the expert demonstration, a cost function such that Pe is the optimal path from se to ge is required. This task can be expressed as the following optimization problem [1]:

$$\text{MinO}[\mathbb{C}] = \lambda \text{REG}(\mathbb{C}) + \sum\_{\mathbf{x} \in \mathcal{P}} \mathbb{C}(\mathcal{F}\_{\mathbf{x}}) - \min \hat{\mathcal{P}} \left[ \sum\_{\mathbf{x} \in \widehat{\mathcal{P}}} (\mathbb{C}(\mathcal{F}\_{\mathbf{x}}) - L\_{\mathbf{r}}(\mathbf{x})) \right] \tag{2}$$

where λ is a term that scales the regularization term REG(C). Pb is a path computed by a planner over the cost space, and Le is a loss function that encodes the similarity between paths. The latter is defined as

$$L\_{\epsilon} = \begin{cases} -1 \text{ if } x \in P\_{\epsilon \nu} \\ 0 \text{ otherwise} \end{cases} \tag{3}$$

The sub-gradient is used to minimize O[C]. In the cost function space, the sub-gradient is

$$\nabla \mathcal{O}\_{\mathbb{F}}[\mathbb{C}] = \lambda \nabla \text{REG}\_{\mathbb{F}}[\mathbb{C}] + \sum\_{\mathbf{x} \in \mathcal{P}\_{\mathbb{F}}} \delta\_{\mathbb{F}}(\mathcal{F}\_{\mathbf{x}}) - \sum\_{\mathbf{x} \in \mathcal{P}\_{\mathbb{F}}} \delta\_{\mathbb{F}}(\mathcal{F}\_{\mathbf{x}}), \tag{4}$$

where δ is the Dirac delta and P<sup>∗</sup> is the optimal path for the actual cost map.

In order to avoid overfitting, a cost function space is considered. C is now defined as the space of weighted sums of functions Ri∈ R, where R is a space of functions of limited complexity, which maps from the feature space to a scalar. The possible choices for R include linear functions, parametric functions, neural networks and decision trees. Thus,

$$\mathcal{C} = \left\{ \mathbb{C} | \mathbb{C} = \sum\_{i} \eta\_{i} R\_{i}(F), R\_{i} \in \mathcal{R}, \,\eta\_{i} \in \mathbb{R} \right\} \tag{5}$$

$$\mathcal{R} = \{ \mathbb{R} | \mathbb{R} : \mathcal{F} \to \mathbb{R} \land REG(\mathbb{R}) < v \}$$

The functional gradient is projected onto the direction set by finding the element Ri∈ R that maximizes the inner product 〈�∇OF[C], R∗〉. This maximization can be regarded as a learning problem. Here,

$$\begin{aligned} R\_\* &= \arg\max\_{\mathbb{R}} \langle -\nabla O\_F[\mathbb{C}], R\_\* \rangle \\ &= \arg\max\_{\mathbb{x}} \sum\_{\mathbf{x} \in P\_\* \cap P\_\*} \alpha\_{\mathbf{x}} y\_{\mathbf{x}} \mathcal{R}(F\_{\mathbf{x}}) \end{aligned} \tag{6}$$

where

$$\alpha\_{\mathfrak{x}} = |\nabla O\_{\mathbb{F}\_{\mathfrak{x}}}[\mathbb{C}]| \\ y\_{\mathfrak{x}} = -\operatorname{sgn}\left(\nabla O\_{\mathbb{F}\_{\mathfrak{x}}}[\mathbb{C}]\right),$$

As in [1] the projection of the functional gradient can be regarded as a weighted classification problem. It can be seen that the regression targets yx are positive in regions of the feature space for which the planned path visits more than the sample path and negative in the opposite case. Here, this approach is viewed as minimizing the error induced by visiting states that are not in the sample path. Then, the visitation count U is the cumulative count of the number of states x∈ P such that Fx = F. The visitation counts can be split into positive and negative components, depending on whether they correspond to the current planned path or the sample path:

$$\begin{aligned} \mathcal{U}\_{+}(F) &= \sum\_{\mathbf{x} \in P\_{\mathbf{r}}} \delta\_{\mathbf{F}}(F\_{\mathbf{x}})\\ \mathcal{U}\_{-}(F) &= \sum\_{\mathbf{x} \in P\_{\mathbf{r}}} \delta\_{\mathbf{F}}(F\_{\mathbf{x}}) \end{aligned} \tag{7}$$

With this information a support vector regressor (SVR) was used to learn the cost function Ri of Eq. (5). Note that, after LEARCH is executed, the trained SVR can map the features of the terrain directly into traversal costs. Figure 2 shows a diagram of the algorithm used for training.

Figure 1. Left: Satellite like image. Right: Grid cells and lines with different colours representing sample paths.

• The path planner D<sup>∗</sup> computes an optimal path P<sup>∗</sup> from start point se to the goal point ge

Figure 2. LEARCH diagram. F is the feature map, M is the cost map, se and ge represent the start and the goal points, respectively, P\* is the optimal path, Pe is the sample path, T is the vector of regressor targets values for training a regressor R, C (F is the cost function and j = {1,2,3,…, n}, where n is the number of iterations needed to train the cost function.

, the vector U indicating the

Path Planning in Rough Terrain Using Neural Network Memory

http://dx.doi.org/10.5772/intechopen.71486

111

• A cost map M is constructed using a feature map and a cost function (C(F)).

• Using U, the regressor targets are computed, and this regressor is trained.

• Using the sample path from the expert path Pe and P<sup>∗</sup>

visitation counts is constructed as shown in Eq. (8).

• The cost function is updated using Eq. (9).

• This process is repeated until Pe and P<sup>∗</sup> are equal.

The procedure is explained as follows:

over M.

$$\mathcal{U}(F) = \mathcal{U}\_{+}(F) - \mathcal{U}\_{-}(F) = \sum\_{x \in P\_{\*}} \delta\_{F}(F\_{x}) - \sum\_{x \in P\_{\*}} \delta\_{F}(F\_{x}) \tag{8}$$

Ignoring the regularization term of Eq. (4), the regression targets and weights can be computed as functions of the visitation counts. Then, the regressor targets can be obtained using these visitation counts. Further, with this regressor, the cost function can be expressed as

$$\mathbf{C}\_{j} = \mathbf{C}\_{j-1} \, ^\*e^{\imath \imath \mathcal{R}\_{j}} \tag{9}$$

where j = {1, 2, 3, …, n}, with n being the number of iterations; R is the regressor; and η is the learning rate.

#### 2.1. Learning to Search experiments

This section describes the experiment conducted to test the LEARCH generalization capabilities. Satellite-like images were selected for feature extraction. Further, patches of terrain were divided into grid cells, with a vector being created for each cell. These vectors represented a value for each of the following features of the environment in each dimension: the vegetation density, slope, the presence of gravel or rocks and the presence of water; each scalar value represents an abstraction of the feature; for example, the scalar value for vegetation represents its density, a patch of terrain with grass would be represented with a low value, and a patch of terrain with a tree would be represented with a high value of vegetation. In these experiments vectors of dimension 4 were used, that is, for the patch of terrain with grass, the vector [0, 2, 0, 0] would be its representation. A human expert-traced sample paths over the terrain, as shown in Figure 1.

Figure 1. Left: Satellite like image. Right: Grid cells and lines with different colours representing sample paths.

With this information a support vector regressor (SVR) was used to learn the cost function Ri of Eq. (5). Note that, after LEARCH is executed, the trained SVR can map the features of the terrain directly into traversal costs. Figure 2 shows a diagram of the algorithm used for training.

The procedure is explained as follows:

R<sup>∗</sup> ¼ arg maxRh i �∇OF½ � C ;R<sup>∗</sup>

α<sup>x</sup> ¼ j j ∇OFx ½ � C yx ¼ � sgn ð Þ ∇OFx ½ � C

As in [1] the projection of the functional gradient can be regarded as a weighted classification problem. It can be seen that the regression targets yx are positive in regions of the feature space for which the planned path visits more than the sample path and negative in the opposite case. Here, this approach is viewed as minimizing the error induced by visiting states that are not in the sample path. Then, the visitation count U is the cumulative count of the number of states x∈ P such that Fx = F. The visitation counts can be split into positive and negative components, depending on whether they correspond to the current planned path or the sample path:

<sup>U</sup>þð Þ¼ <sup>F</sup> <sup>X</sup>

<sup>U</sup>�ð Þ¼ <sup>F</sup> <sup>X</sup>

visitation counts. Further, with this regressor, the cost function can be expressed as

Cj ¼ Cj�<sup>1</sup>

U Fð Þ¼ <sup>U</sup>þð Þ� <sup>F</sup> <sup>U</sup>�ð Þ¼ <sup>F</sup> <sup>X</sup>

A human expert-traced sample paths over the terrain, as shown in Figure 1.

x∈ P<sup>∗</sup>

x∈ Pe

Ignoring the regularization term of Eq. (4), the regression targets and weights can be computed as functions of the visitation counts. Then, the regressor targets can be obtained using these

where j = {1, 2, 3, …, n}, with n being the number of iterations; R is the regressor; and η is the

This section describes the experiment conducted to test the LEARCH generalization capabilities. Satellite-like images were selected for feature extraction. Further, patches of terrain were divided into grid cells, with a vector being created for each cell. These vectors represented a value for each of the following features of the environment in each dimension: the vegetation density, slope, the presence of gravel or rocks and the presence of water; each scalar value represents an abstraction of the feature; for example, the scalar value for vegetation represents its density, a patch of terrain with grass would be represented with a low value, and a patch of terrain with a tree would be represented with a high value of vegetation. In these experiments vectors of dimension 4 were used, that is, for the patch of terrain with grass, the vector [0, 2, 0, 0] would be its representation.

x∈P<sup>∗</sup>

∗ e

δFð Þ Fx

δFð Þ Fx

δFð Þ� Fx

X x∈Pe

X x∈ Pe ∩ P<sup>∗</sup> <sup>α</sup>xyxR Fð Þ<sup>x</sup> (6)

(7)

δFð Þ Fx (8)

njRj (9)

¼ arg max<sup>R</sup>

where

110 Advanced Path Planning for Mobile Entities

learning rate.

2.1. Learning to Search experiments


Figure 2. LEARCH diagram. F is the feature map, M is the cost map, se and ge represent the start and the goal points, respectively, P\* is the optimal path, Pe is the sample path, T is the vector of regressor targets values for training a regressor R, C (F is the cost function and j = {1,2,3,…, n}, where n is the number of iterations needed to train the cost function.

Figure 3. Example of a cost map which belongs to the real map shown in Figure 1.

The traversal costs can be color coded for demonstration purposes, as shown in Figure 3, where values near 20 represent the patches with the highest crossing difficulty and those near zero represent terrain that is easy to traverse.

In this experiment, in order to prove the generalization capabilities of LEARCH (i.e. its ability to use policies learned in past navigation episodes during new episodes), we employed the following methodology. First, we trained the LEARCH system described in Figure 1 using an initial map and one sample path. Then, we incrementally added to this learned knowledge (using the same initial map) by incorporating more paths to be learned by LEARCH (one by one). The results of these experiments are shown in Figure 4, where the image (a) of the figure shows the cost map obtained with a cost function trained using one sample path and the image (b) shows the results obtained by adding a path to the training, and so on until five sample paths are used. The image (f) of Figure 4 shows the cost map obtained with a cost function trained using eight sample paths.

From the cost map (e) of Figure 4, in comparison with the cost map (f), it is apparent that information from the environment is missing after the cost function is trained with more sample paths. That is, some states are no longer recognized as states with a high traversal cost. It is important to note that the costs that are most affected are those furthest from the sample paths, in comparison with the costs of the corresponding states on the original path; therefore, the generalization capability of the LEARCH system is very poor. This problem renders the task of finding the optimal path difficult. In addition, the path planner could compute a path that traverses dangerous terrain. Further, note that, the use of only a few sample paths is not a solution to the problem of obtaining a system with knowledge of a greater number of area costs than those attached to the sample paths. This is because such sample paths cannot contain all the information necessary for a good and complete representation of the environment.

Figure 4. Costs maps obtained using different numbers of paths of terrain.

Path Planning in Rough Terrain Using Neural Network Memory

http://dx.doi.org/10.5772/intechopen.71486

113

Figure 4. Costs maps obtained using different numbers of paths of terrain.

The traversal costs can be color coded for demonstration purposes, as shown in Figure 3, where values near 20 represent the patches with the highest crossing difficulty and those near

In this experiment, in order to prove the generalization capabilities of LEARCH (i.e. its ability to use policies learned in past navigation episodes during new episodes), we employed the following methodology. First, we trained the LEARCH system described in Figure 1 using an initial map and one sample path. Then, we incrementally added to this learned knowledge (using the same initial map) by incorporating more paths to be learned by LEARCH (one by one). The results of these experiments are shown in Figure 4, where the image (a) of the figure shows the cost map obtained with a cost function trained using one sample path and the image (b) shows the results obtained by adding a path to the training, and so on until five sample paths are used. The image (f) of Figure 4 shows the cost map obtained with a cost function

From the cost map (e) of Figure 4, in comparison with the cost map (f), it is apparent that information from the environment is missing after the cost function is trained with more sample paths. That is, some states are no longer recognized as states with a high traversal cost. It is important to note that the costs that are most affected are those furthest from the sample paths, in comparison with the costs of the corresponding states on the original path; therefore, the generalization capability of the LEARCH system is very poor. This problem renders the task of finding the optimal path difficult. In addition, the path planner could compute a path that traverses dangerous terrain. Further, note that, the use of only a few sample paths is not a solution to the problem of obtaining a system with knowledge of a greater number of area costs than those attached to the sample paths. This is because such sample paths cannot contain all

the information necessary for a good and complete representation of the environment.

zero represent terrain that is easy to traverse.

112 Advanced Path Planning for Mobile Entities

Figure 3. Example of a cost map which belongs to the real map shown in Figure 1.

trained using eight sample paths.

that is no longer useful. A combination of a CEC and its input, output and forget gates is called

The activation updates at each time step t in this type of neural network are computed as

X m

where wim is the weight of the connection from unit m to unit i. For the activation function fi, the standard logistic sigmoid function for all units is chosen, except for output units, for which it is the identity function [7]. The CEC activation, also known as memory cell state, is calculated using

tion yin, the output gate activation yout and the forget gate activation yφ, we have

yi ðÞ¼ t f <sup>i</sup>

, the output unit activation yk

wimymð Þ <sup>t</sup> � <sup>1</sup> !

> ð Þ<sup>t</sup> <sup>g</sup> <sup>X</sup> m wcv

<sup>j</sup> mymð Þ <sup>t</sup> � <sup>1</sup> !

Path Planning in Rough Terrain Using Neural Network Memory

http://dx.doi.org/10.5772/intechopen.71486

j

, the input gate activa-

(10)

115

(11)

ð Þ0 . Finally, the

a memory cell (Figure 6).

follows. For the hidden unit activation yh

scv j

Figure 6. Graphic representation of a memory cell.

ðÞ¼ <sup>t</sup> <sup>y</sup><sup>φ</sup><sup>j</sup>ð Þ<sup>t</sup> scv

activation update for the memory cell output is calculated from

j

where g is a logistic sigmoid function scaled to the [�2, 2] range and scv

ð Þþ <sup>t</sup> � <sup>1</sup> yinj

Figure 5. Left: cost map computed with five representative paths. Right: cost map computed with five non representative paths.

In this study, other experiments to prove the limitations of LEARCH were performed, in which we trained the system using nonrepresentative environment paths. That is, the paths taught by the expert traversed many cells of the environment that did not contain sufficient representative features of the environment or cells that did not have significant differences in cost. Figure 5 shows examples of these paths, which allowed the LEARCH system to acquire nonrepresentative knowledge that was then generalized over the cost map. The cost map at the left of Figure 5 is less generalized compared with the more descriptive costs shown on the map at the right of Figure 5.

Therefore, in order to address the problems with the LEARCH system, we propose the use of an LSTM as part of the system. Inclusion of an LSTM allows the navigation agent to learn navigation policies and complex traversability cost functions and, furthermore, to retain memory of the knowledge learned in the past navigation episodes for reuse during new episodes. The latter capability allows expensive retraining to be avoided when the navigation environment is similar to those already explored by the agent and allows hidden states of the extremely large state space represented by a nonstructured or rough terrain to be recognized. We present the LSTM in the next section.
