**4.1 Back propagation through time (BPTT) algorithm for FRNN**

The Back propagation Through Time (BPTT) algorithm is an algorithm that performs an exact computation of the gradient of the error measure for use in the weight adaptation. In

Recurrent Neural Network with Human Simulator Based Virtual Reality 105

which a target is defined. A target may not be defined for every single input however. For example, if we are presenting a string to the network to be classified as either grammatical or ungrammatical, we may provide a target only for the last symbol in the string. In defining an error over the outputs, therefore, we need to make the error time dependent too, so that it can be undefined (or 0) for an output unit for which no target exists at present. Let *T(t)* be the set of indices *k* in *U* for which there exists a target value *dk(t)* at time *t*. We are forced to use the notation *dk* instead of *t* here, as *t* now refers to time. Let the error at the output units be and define our error function for a single time

() () ( ) ( ) <sup>0</sup>

The error function we wish to minimize is the sum of this error over all past steps of the

*otherwise*

1

*to*

( , ) () *t*

1

( , 1) ( , ) ( 1) *w total o E t t E t t Et w total o w* (7)

(5)

(6)

(8)

(9)

(11)

*d t y t if k T t e t*

1

*E tt E* 

Now, because the total error is the sum of all previous errors and the error at this time step, so also, the gradient of the total error is the sum of the gradient for this time step and the

As a time series is presented to the network, we can accumulate the values of the gradient,

( ) ( ) *ij*

*E t w t*

After the network has been presented with the whole series, we alter each weight *wij* by

0

*t t*

1

1

*t*

*ij*

*k k k*

(10

*w* 

( )

*ij*

() () ( ) ( ) ( ) ( )

*ij k U k ij k U ij Et Et <sup>y</sup> t yt e t w yt w w*

at each time step *t*. Since we know *ek(t)* at all times (the difference between our targets and

( ) *<sup>k</sup> ij y t w* 

*w t*

*k k*

*total o*

or equivalently, of the weight changes. We thus keep track of the value

outputs), we only need to find a way to compute the second factor

We therefore need an algorithm that computes

*k*

step as

network

gradient for previous steps

this section the BPTT algorithm will be derived for a (type 1) FRNN using a Sum Squared Error measure.(Omlin, 1996)

#### **Methods of derivation of the algorithm**

There are two different methods to develop the BPTT algorithm. Both are shown in this report:


The BPTT algorithm can be summarized as follows:


#### **4.2 Real-time recurrent learning (RTRL) algorithm for FRNN**

A real-time training algorithm for recurrent networks known as Real-time Recurrent Learning (RTRL) was derived by several authors [Williams e.a., 1995]. This algorithm can be summarized by the explanation as:

In deriving a gradient-based update rule for recurrent networks, we now make network connectivity very very unconstrained. We simply suppose that we have a set of input units, *I* = {*xk(t), 0<k<m*}, and a set of other units, *U* = {*yk(t), 0<k<n*}, which can be hidden or output units. To index an arbitrary unit in the network we can use

$$z\_k(t) = \begin{cases} x\_k(t) & \text{if } \quad k \in I \\ y\_k(t) & \text{if } \quad k \in \mathcal{U} \end{cases} \tag{2}$$

Let **W** be the weight matrix with *n* rows and *n+m* columns, where *wi,j* is the weight to unit *i* (which is in *U*) from unit *j* (which is in *I* or *U*). Units compute their activations in the now familiar way, by first computing the weighted sum of their inputs:

$$met\_k(t) = \sum\_{I \text{ e } II \cup I} w\_{kl} z\_I(t) \tag{3}$$

where the only new element in the formula is the introduction of the temporal index *t*. Units then compute some non-linear function of their net input

$$y\_k(t+1) = f\_k(net\_k(t))\tag{4}$$

Usually, both hidden and output units will have non-linear activation functions. Note that external input at time *t* does not influence the output of any unit until time *t+1*. The network is thus a discrete dynamical system. Some of the units in *U* are output units, for

this section the BPTT algorithm will be derived for a (type 1) FRNN using a Sum Squared

There are two different methods to develop the BPTT algorithm. Both are shown in this


3. recursively calculate ei(*m*) then di(*m*) with backwards in time starting with *m*= *n* back to

A real-time training algorithm for recurrent networks known as Real-time Recurrent Learning (RTRL) was derived by several authors [Williams e.a., 1995]. This algorithm can be

In deriving a gradient-based update rule for recurrent networks, we now make network connectivity very very unconstrained. We simply suppose that we have a set of input units, *I* = {*xk(t), 0<k<m*}, and a set of other units, *U* = {*yk(t), 0<k<n*}, which can be hidden or output

> *y t if k U*

Let **W** be the weight matrix with *n* rows and *n+m* columns, where *wi,j* is the weight to unit *i* (which is in *U*) from unit *j* (which is in *I* or *U*). Units compute their activations in the now

> () () *<sup>k</sup> kI I IU I net t w z t*

where the only new element in the formula is the introduction of the temporal index *t*. Units

Usually, both hidden and output units will have non-linear activation functions. Note that external input at time *t* does not influence the output of any unit until time *t+1*. The network is thus a discrete dynamical system. Some of the units in *U* are output units, for

(2)

(3)

( 1) ( ( )) *k kk y t f net t* (4)

( ) ( ) ( ) *k*

*k x t if k I z t*


2. calculate the N neuron output values for time *n* using the network.

**4.2 Real-time recurrent learning (RTRL) algorithm for FRNN** 

units. To index an arbitrary unit in the network we can use

*k*

familiar way, by first computing the weighted sum of their inputs:

then compute some non-linear function of their net input

Error measure.(Omlin, 1996)

the algorithm works.

1. set initial time *n* = n0

5. update the weights wij

summarized by the explanation as:

*m* = n0.

report:

**Methods of derivation of the algorithm** 

The BPTT algorithm can be summarized as follows:

4. calculate for all *i*,*j* the weight updates .

6. increase time *n* to *n*+1 and go back to step 2

which a target is defined. A target may not be defined for every single input however. For example, if we are presenting a string to the network to be classified as either grammatical or ungrammatical, we may provide a target only for the last symbol in the string. In defining an error over the outputs, therefore, we need to make the error time dependent too, so that it can be undefined (or 0) for an output unit for which no target exists at present. Let *T(t)* be the set of indices *k* in *U* for which there exists a target value *dk(t)* at time *t*. We are forced to use the notation *dk* instead of *t* here, as *t* now refers to time. Let the error at the output units be and define our error function for a single time step as

$$e\_k(t) = \begin{cases} d\_k(t) - y\_k(t) & \text{if } \quad k \in T(t) \\ 0 & \text{otherwise} \end{cases} \tag{5}$$

The error function we wish to minimize is the sum of this error over all past steps of the network

$$E\_{\text{total}}(t\_o, t\_1) = \sum\_{\tau=to+1}^{t1} E(\tau) \tag{6}$$

Now, because the total error is the sum of all previous errors and the error at this time step, so also, the gradient of the total error is the sum of the gradient for this time step and the gradient for previous steps

$$
\nabla\_w E\_{\text{total}}(t\_o, t+1) = \nabla\_w E\_{\text{total}}(t\_o, t) + \nabla\_w E(t+1) \tag{7}
$$

As a time series is presented to the network, we can accumulate the values of the gradient, or equivalently, of the weight changes. We thus keep track of the value

$$
\Delta w\_{ij}(t) = -\mu \frac{\partial E(t)}{\partial w\_{ij}} \tag{8}
$$

After the network has been presented with the whole series, we alter each weight *wij* by

$$\sum\_{t=t\_0+1}^{t1} \Delta w\_{ij}(t) \tag{9}$$

We therefore need an algorithm that computes

$$-\frac{\partial E(t)}{\partial w\_{ij}} = -\sum\_{k \le l\ell} \frac{\partial E(t)}{\partial y\_k(t)} \frac{\partial y\_k(t)}{\partial w\_{ij}} = \sum\_{k \le l\ell} e\_k(t) \frac{\partial y\_k(t)}{\partial w\_{ij}}\tag{10}$$

at each time step *t*. Since we know *ek(t)* at all times (the difference between our targets and outputs), we only need to find a way to compute the second factor

$$\frac{\partial y\_k(t)}{\partial w\_{ij}}\tag{11}$$

Recurrent Neural Network with Human Simulator Based Virtual Reality 107

(deg) Human (deg)

Σ*0:* Reference coordinate. Σ*1:* shoulder yaw coordinate. Σ*2:* shoulder roll coordinate. Σ*3:* shoulder pitch coordinate. Σ*4:* elbow pitch coordinate. Σ*5:* wrist pitch coordinate. Σ*6:* wrist roll coordinate.

Σ*h:* End-effector coordinate (at the end of middle fingerer).

Axis Humanoid manipulator

Neck (roll and pitch) -90 ~ 90 -90 ~ 90 Shoulder (pitch) right & left -180 ~ 120 -180 ~ 120 Shoulder (roll) right/left -135 ~ 30/-30 ~ 135 -135 ~ 30/-30 ~ 135 Shoulder (yaw) right/left -90 ~ 90/-90 ~ 90 -90 ~ 90/-90 ~ 90 Elbow (roll) right/left 0 ~ 135/0 ~ -135 0 ~ 135/0 ~ -135 Wrist (pitch) right/left -30 ~ 60 -30 ~ 60 Wrist (yaw) right/left -90 ~ 60 -90 ~ 60 Hip (pitch) right & left -130 ~ 45 -130 ~ 45 Hip (roll) right/left -90 ~ 22/-22 ~ 90 - 60 ~ 45/-45 ~ 60 Hip (yaw) right/left -90 ~ 22/-22 ~ 90 - 60 ~ 45/-45 ~ 60 Knee (pitch) right &left -20 ~150 0 ~150 Ankle (pitch) right & left -90 ~ 60 -30 ~ 90 Ankle (roll) right/left -20 ~ 90/-90 ~ 20 -20 ~ 30/-30 ~ 20 Waist (yaw) -90 ~ 90 -45 ~ 45 Table 1. Comparison Joint rotation range between humanoid manipulator and Human

Fig. 13. a. Configurations of joint coordinates at the Manipulator arm with 6-DOF.

b. Structure of humanoid manipulator.

The key to understanding RTRL is to appreciate what this factor expresses. It is essentially a measure of the sensitivity of the value of the output of unit *k* at time *t* to a small change in the value of *wij*, taking into account the effect of such a change in the weight over the entire network trajectory from *t0* to *t*. Note that *wij* does not have to be connected to unit *k*. Thus this algorithm is non-local, in that we need to consider the effect of a change at one place in the network on the values computed at an entirely different place. Make sure you understand this before you dive into the derivation given next.(Baruch, 1999)
