**3. Reinforcement learning**

### Taking this extract from [19]:

*"Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond".*

As it can be read from the extract, desired behaviours will prevail over the undesired ones. This is due to the reward or satisfaction feeling, and this is precisely part of what CODA is willing to imitate in a computational environment. And since no machine can feel any kind of emotion, the reward will be numerically represented with a function.

**Figure 2.** Block diagram showing the cognition model proposed by the authors with the CODA algorithm.

**Figure 2** shows a block diagram of the learning by demonstration iterative process taken and modified from [11]. The diagram shows the process of learning a task with the proposed approach and the CODA algorithm.

where the policy will be represented by a Q matrix and adjusted directly according to the reward obtained during each try. These consequences of the adjustment in the policy are

The reward function does not actually tell the algorithm if the output is correct or not; instead it tells how much correct it is, and the stored values of the Q matrix should serve as a repertoire of knowledge in order to recognise certain "*encountered data–output action*" pairs for future

*"Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the*

As it can be read from the extract, desired behaviours will prevail over the undesired ones. This is due to the reward or satisfaction feeling, and this is precisely part of what CODA is willing to imitate in a computational environment. And since no machine can feel any kind of

measured by the reward function to guide future adjustments in the policy.

reference.

**3. Reinforcement learning**

*strengthening or weakening of the bond".*

emotion, the reward will be numerically represented with a function.

**Figure 2.** Block diagram showing the cognition model proposed by the authors with the CODA algorithm.

Taking this extract from [19]:

258 Recent Advances in Robotic Systems

The interaction between the entities is initiated by the human–machine interaction where the human expert performs the task or gives an example of the task in order for the system to record the information as a training example. In this manner, it gives direct feedback on how the skill is being modelled. The CODA algorithm will have its own evaluation and correction process in order to acquire the knowledge and reproduce the task.

During this process, a reinforcement learning algorithm is the most important tool. SARSA, which is a modification of Q-learning, was chosen. The main difference between the two algorithms is that Q-learning always attempts to search for the optimal path, which will always coincide in that it is the shortest, but this does not necessarily mean it is the best one on each task. In contrast the SARSA algorithm will not always take the shortest path but the safest, because it will try to avoid large negative rewards taking in consideration those "dangerous" situations.

Mathematically, Q-learning and SARSA differ in their expressions as the following two equations state:

$$\mathcal{Q}(\mathbf{s}\_{\mathrm{i}}, a\_{\mathrm{i}}) \leftarrow \mathcal{Q}(\mathbf{s}\_{\mathrm{i}}, a\_{\mathrm{i}}) + \mathfrak{u}[r\_{\mathrm{i}+1} + \boldsymbol{\gamma}\_{a\_{\mathrm{i}+1}} \max \boldsymbol{Q}(\mathbf{s}\_{\mathrm{i}+1}, a\_{\mathrm{i}+1}) - \mathcal{Q}(\mathbf{s}\_{\mathrm{i}}, a\_{\mathrm{i}})],\tag{1}$$

$$\underline{Q}(\mathbf{s}\_{\iota}, a\_{\iota}) \leftarrow \underline{Q}(\mathbf{s}\_{\iota}, a\_{\iota}) + \underline{\mu}[r\_{\iota+1} + \underline{\gamma}\,\underline{Q}(\mathbf{s}\_{\iota+1}, a\_{\iota+1}) - \underline{Q}(\mathbf{s}\_{\iota}, a\_{\iota})],\tag{2}$$

Eq. (1) shows the Q-learning algorithm equation for the actualisation of the Q matrix and Eq. (2) shows that of the SARSA algorithm. It can be inferred that mathematically the main difference between them is that Q-learning is off-policy since it takes the maximum value of *Q(s′,a′)* instead of computing *a*′. SARSA on the other hand makes this possible by the simple modification made where it takes care not only of the actual state and action but also the next possible state, action and reward, making SARSA an on-policy algorithm, something ex‐ plained previously.

Since the inputs and outputs of the reinforcement learning algorithms are discretised if the size of the action space and state space can be reduced, it is always a good idea since the reinforcement learner algorithm is basically a search method. Therefore it is good to search for certain techniques that would let us handle these spaces in a proper manner, and it is upon the application and the expertise of the reader how this could be approached.

**Figure 3** describes how the algorithms search for the goal state given a state space, action space and a reward function. It also explains how the learning agent, in our case the robotic hand, performs action in the state *st* and receives a reward *r(t+1)* from the environment, after this the agent will move to state *s(t+1)* because of the reward given. Basically the sensors on the hand define the state, because they are the representation of the environment around the robot, telling us if the hand is touching any surface or not. The actions are how the robot moves its actuators. The reward could be the similarity between the learnt (desired) data and the produced data on the present state, and could be bigger if similarity is higher and reward should be lower if similarity is not likely.

**Figure 3.** Reinforcement learning cycle, agent performs action, receives reward and ends up on a new state.
