Section 2 Applied Intelligence

#### **Chapter 5**

## Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO for One-Dimensional Linear Motion with Bounded Acceleration

*Hsuan-Cheng Liao, Han-Jung Chou and Jing-Sin Liu*

#### **Abstract**

The time-optimal control problem (TOCP) has faced new practical challenges, such as those from the deployment of agile autonomous vehicles in diverse uncertain operating conditions without accurate system calibration. In this study to meet a need to generate feasible speed profiles in the face of uncertainty, we exploit and implement probabilistic inference for learning control (PILCO), an existing sampleefficient model-based reinforcement learning (MBRL) framework for policy search, to a case study of TOCP for a vehicle that was modeled as a constant input-constrained double integrator with uncertain inertia subject to uncertain viscous friction. Our approach integrates learning, planning, and control to construct a generalizable approach that requires minimal assumptions (especially regarding external disturbances and the parametric dynamics model of the system) for solving TOCP approximately as the perturbed solutions close to time-optimality. Within PILCO, a Gaussian Radial basis functions is implemented to generate control-constrained rest-to-rest near time-optimal vehicle motion on a linear track from scratch with data-efficiency in a direct way. We briefly introduce the importance of the applications of PILCO and discuss the learning results that PILCO would actually converge to the analytical solution in this TOCP. Furthermore, we execute a simulation and a sim2real experiment to validate the suitability of PILCO for TOCP by comparing with the analytical solution.

**Keywords:** model-based reinforcement learning (MBRL), applied reinforcement learning, time-optimal control problem (TOCP), velocity learning, vehicle control

#### **1. Introduction**

Optimal control–based approaches have played key roles in trajectory planning or replanning and in optimization of control inputs with numerous applications, such as robotics and autonomous driving. These approaches have recently been used in

autonomous systems. Optimal control formulations typically require accurate knowledge of the dynamics and can account for a more general set of constraints and objectives (performance measures) [1, 2] relative to other approaches. Many cost functions, such as those for time and energy, are used to define the desired behavior of the controlled system. As part of an effort to enhance working efficiency and productivity by having tasks be completed as fast as possible, especially for tasks involving repetitive state-to-state transfer in trajectory execution, Time-Optimal Control Problem (TOCP), for which the objective function is the terminal time, has been extensively studied. For TOCP, control bounds, different boundary conditions or paths with different curvature profiles and lengths, and various choices of the physical parameters such as mass, friction, can produce different velocity solutions. As a result, different maximum velocities and different travel times are produced. It was studied first on articulated robot manipulators for industrial and aerospace applications [3–7].

In recent years, the deployment of agile autonomous systems such as autonomous driving vehicles [8, 9], mobile robots [1, 10, 11] and new robot platforms such as humanoid robots [12] and unmanned aerial vehicles (uavs) have posed new increasingly important challenges to TOCP. Furthermore, a crucial aspect of traditionally solving TOCP is that analytical or learned time-optimal velocity solution are both platform and path (task) dependent, therefore, model-based and goal-directed. These challenges lie in several aspects including nonlinear, nonconvex, multi-dimensional state and control spaces, as well as various platform dependent constraints, and most importantly uncertainties of the environment in real world problems and make the model only able to approximate the reality. Therefore, computational solutions to TOCP based on an not so accurate model are not reliable and not practical for online applications. By contrast, learning-based control algorithms for dynamical systems learn to generate the desired system behavior without any complicated system formalism or predefined controller parameters a priori and thereby achieve more generalization and platform independency. One promising approach in the context of intelligent planning and control involves the use of reinforcement learning (RL) [13, 14] for learned behaviors, which can be viewed as a class of optimal control resolutions. The effectiveness and performance of RL is of task instance and platform specific, i.e. as a function of the transition and reward functions induced by the evaluated policy and the system dynamics. Therefore, to address the practical challenges in facilitating RL algorithms for a wide range of real-world decision-making problems such as the autonomous vehicles in the context of diverse driving scenarios, it is generally believed that only by proposing specific applications of RL on concrete cases can better demonstrate related issues and in which algorithm works well for a specific task instance [15].

With this aim for study, in this paper, the learning goal is to recover a near timeoptimal rest-to-rest one-dimensional linear motion on a double integrator with embedded uncertainties (frictions of motion) and constant control constraint. We assume no prior knowledge of any parametric system dynamics model for deriving the optimality conditions. In addition, the characteristics of the learning task is that the vehicle mass is uncertain and the environment characteristics such as friction is unknown, and both parameters affect the maximum speed at each position along the path. It is worth mentioning that any single-input controllable second-order system is feedback equivalent to a double integrator. Thus, we use a simplified but still general, precise enough vehicle model, damped double integrator, as the foundation and demonstration for TOCP for more complicated, high-dimensional nonlinear vehicle model. The analytical solution of TOCP to double integrator subject to constant

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

acceleration bound is known of a second-order ODE solution with an assumption by bang-bang control.

In the scope of this paper, our MBRL-based approach to recover the time-optimal motion of double integrator features a two-stage process for integrating learning, planning and control. In the first stage, we employ a Model-Based Reinforcement Learning (MBRL) framework, Probabilistic Inference for Learning Control (PILCO) [16], to generate a control-constrained rest-to-rest near time-optimal motion from scratch. The analytical solution is used as a baseline for effectively assessing the learning results. Because Monte-Carlo updates of the parametrized policy renders it difficult to incorporate velocity limits within an instance of locomotion, hence, as a second stage, we apply rescaling to give a speed profile respecting the additional velocity limit.

The outcome is then a time-optimal velocity profile under both velocity and control constraints, with no prior required knowledge and replanning. Both simulation and sim-to-real experiments are conducted and confirm our approach applicable.

Our main contributions include:


The remainder of the article is structured as follows. In Section 2, we summarize the approaches to the TOCP, specifically conventional and RL approaches. In Section 3, we outline the key elements of the PILCO algorithm, which are dynamics modeling, trajectory prediction, policy evaluation, and policy improvement. In Section 4, we present our simulation results on time-optimal velocity learning for an autonomous vehicle with double integrator dynamics whose analytical solution to TOCP is derived in Appendix A as the verification baseline. A sim2real experimental validation on a low-cost car along with discussions in Section 5 is provided. Finally, Section 6 provides conclusions.

#### **2. Related work**

The aim of time-optimal vehicle control is to control a vehicle such that it reaches a target state as quickly as possible (e.g., in racing or emergencies). Minimal-time velocity profile along a prespecified curve, as a subclass of TOCP subject to hard control constraints resulting from input saturation, state constraints and external disturbance, is nowadays applied to a variety of modern autonomous systems such as autonomous driving, uav and robotics. The existence of time-optimal trajectories is guaranteed by Pontryagin Maximum Principle (PMP) [20] for a vehicle with explicit

dynamics model in state-space form [21]. The derivations of the optimal control and state trajectories are generally computationally expensive; computationally cheaper yet accurate methods are required, especially considering the need for rapid (even real-time) computation in most real-world industrial or engineering systems in response to changes in operating conditions and the environment. In this section, we briefly present the most common numerical approaches for systems with known dynamics model developed in robotics and autonomous vehicles; we then discuss RLbased approaches that handle uncertainty.

#### **2.1 Approaches to the TOCP with known dynamics**

Solutions to the TOCP can be categorized as complete or decoupled approaches. In the complete approach, where the aim is to solve challenging problems with general vehicle dynamics and constraints in their entirety, the optimal state and input trajectories are simultaneously determined; some direct or indirect transcription methods have been developed as part of this approach for trajectory optimization [2, 22, 23] and played an important role in their numerical performance of trajectories. By contrast, in the decoupled approach (e.g., the path velocity decomposition approach and path-constrained trajectory generation approach), the trajectory generation is decomposed into two subproblems for the path geometry to be decoupled from the velocity along the geometric path. In the decoupled approach, the first step is planning a geometric path for connecting two states (configurations or poses) in adherence to geometric constraints, such as obstacle avoidance or smoothness requirements, and the second step is designing a time-scaling function (that represents either information on timing or the velocity and acceleration) along the planned state-to-statetransfer geometric path. This approach results in a one-parameter family of velocity profiles, i.e. the parametrization of the vehicle-position-dependent velocity along the path as a function of the single path parameter of the arc length. The velocity and acceleration of the vehicle on each position of the path can be altered by the design of the time-scaling function respecting smoothness requirement (such as small jerk) and fixed boundary conditions (such as the initial and target positions and velocities are precisely specified) and the kinodynamic constraints constraints. A fair amount of literature on maximizing the speeds along the path with the acceleration, torque, jerk (or torque/acceleration derivative) constraints [3, 8, 10, 11]. In general, a model predictive control (MPC) framework can be used in the decoupled approach to generate the safe velocity profile and the input commands for following a given planned geometric path in terms of known system dynamics [24]. The following three methods have been commonly used for TOCP.

#### *2.1.1 Hamilton-Jacobi-bellman (HJB) equation*

A popular approach to obtain time-optimal motion for a system with known dynamic model and fixed boundary conditions under the safety and kinodynamic constraints of a vehicle is via optimal control or model predictive control formulations. This approach requires the derivation of optimality conditions for the state trajectories and control policies based on PMP or Dynamic Programming Principle (DPP) [20]. This yields the Two-point Boundary Value Problem of HJB partial differential equations with initial condition on the state and final condition on the costate for time-optimization of trajectories. The advantage of generality is that more general state and input constraints and objective functions can be taken into account at the

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

cost of heavy numerical burden. For example, time-optimality can be traded off against energy to yield less aggressive control to steer the vehicles slower but smoother. Additionally, HJB equation approach is a practically useful approach in that many numerical solvers of HJB equations are available.

#### *2.1.2 Convex optimization (CO)*

The Hamiltonian of TOCP for robotic manipulator is shown to be convex with respect to the control input. TOCP is transformed into a convex optimization problem with a single state through a nonlinear change of variables [4], where the acceleration and velocity at discretized locations on the path are the optimization variables. Then, followed by [5] the work is further extended to meet speed dependent requirements. Such approach is simple and robust thanks to the existing convex optimization libraries, yet only convex objective functions can be concerned. However, the convex optimization program contains a large number of variables and inequality constraints, making it slow and less suitable for real-time applications.

#### *2.1.3 Numerical integration (NI)*

Since the vehicle velocity highly depends on the path to be followed, the universally applicable decoupled approach splits the motion planning problems to two sub-problems of finding a geometric path and planning the velocity at each position of the vehicle along the path to manage the computational complexity to generate a suboptimal motion trajectory. This result in a one-parameter family of velocity profiles, or the velocity (bound) along the path depending on the vehicle position on the path is parametrized as a function of the single path parameter (or a scalar curvilinear abscissa coordinate) s, usually the arc length. By description of the dynamics and constraints along the path to be followed on the ð Þ *s*, *s*\_ phase plane, then this method generates the velocity limit curve on the phase plane from the velocity and acceleration bounds. The travel time is determined by the path velocity *s*\_ along the path or the time scaling function *s t*ð Þ that is solved by optimization tools to meet the imposed constraints. The generation of minimum-time velocity profile along the given path is greatly simplified to the determination of switching structure in the phase plane [12]. Essentially, NI searches for switching points on the phase plane and establishes the velocity profile by integrating forward with the acceleration limits and backward with the deceleration limits pivoting from those points.

#### **2.2 Reinforcement learning (RL)**

RL refers to the learning of a policy, defined as a mapping from state space to action space, by means of maximizing a reward the agent receives from the environments it operates and interacts. When the system dynamics is unknown, with the dynamical system modeled as a reward-maximizing RL agent and the desired behavior expressed as a reward function, the system can be trained to automatically execute an optimal sequence of actions (trajectories) under the present environmental conditions to complete a given mission. The RL framework reformulates the problem as a Markov Decision Process for the autonomous agent, which maximizes the long-term rewards and does not necessarily need the transition dynamics beforehand. RL offers a diverse set of model-based and model-free algorithms to improve the performance of

RL agent based on the reward it received. The mission may involve integrated planning and control with time and computational resource budget in real-time applications, a system model therefore is a good informative basis for predicting the behaviors and enhancing performance, instead of tuning the behaviors frequently and manually in practical situations. However, it is often infeasible to derive a precise, analytical model that is provably correct for the actual system within regions and time horizon that the model is valid, since the existence of parametric uncertainties, unmodelled dynamics, external disturbance in perception and agent-environment interaction is inevitable. To ameliorate the problems from these interwoven factors that affect planning and control performance, researchers in the field of modeling and control have been increasingly interested in model-building or model learning in a data-driven setting based on nonparametric and probabilistic models [25, 26] as a means to enhance the performance of the underlying controlled system. Model-based RL (MBRL), complementary to the control approaches such as robust and adaptive control, model predictive control and fuzzy control, is an attractive intelligent modelbased control approach that integrates learning, planning and control: it learns a dynamics model and then the derived characteristics of a learned model is exploited for generating trajectories and learning the policy. While the model-free RL attracts the most scientific interest, in MBRL algorithms that employ derived or learned system dynamic models, during policy evaluation, the state evolution calculated by a predictive model under a given policy can be used to estimate the impact of the policy on the reward. Therefore MBRL is more data-efficient in the context of diverse goaldirected planning tasks since fewer interactions between the agent and the environment are required to learn a good policy faster, in contrast to some model-free approaches. A surging number of researches demonstrated that learning, planning and control of autonomous systems such as robotics and self-driving vehicles can be cast as MBRL tasks, due to the use of an accurate, reliable learned model of the agentenvironment interactions as an internal simulation in task execution and the basis for any optimization and real time control for achieving highly effective control performance [13, 14, 27]. Despite its faster convergence over the model-free frameworks, MBRL suffers from model bias and accumulated model prediction errors that greatly affect the control policy learning and rewards by leveraging the modelgenerated trajectories characteristics. To improve model learning performance and thus policy learning performance (the controller parameters are learned in light of currently learned model), system transition modeling or model fitting techniques ranging from deterministic methods such as physics-based (first principles based) formulation, to stochastic methods are developed [15]. Among which, nonparametric regression models such as Gaussian Process (GP) that extracts the information from the sampled data with the high data efficiency to make accurate predictions in PILCO [16]. In contrast to other probabilistic models that maintain a distribution over random variables, GP builds one over the underlying functions that generate the data. Therefore, it has no prior assumption on the function mapping current states and actions to future states. The flexibility and expressiveness GP to refine the uncertainty estimate offers makes it an effective approximator for modeling the unknown system dynamics (transition function from input data to output observation or measurement) that continuously evolves over time (i.e. trajectories), and is employed in this study. Some popular GP implementations are tabulated in [28] for practitioners. The most recent open source MBRL-Lib is released to reduce the coding burden [29].

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

#### **3. MBRL for time-optimal vehicle motion**

The imperfect modeling of system dynamics and perception of environment with significant noise makes machine learning a viable approach to the practical, near minimum-time velocity planning for autonomous systems that do not rely on heavy dynamic model-based computations using identification techniques. In order to find the time-optimal control policy for a vehicle dynamics model with uncertainties along a predefined path, in RL setting, it is learned from limited trial driving experiences the action (sequence) to be applied at each possible state confined on the path for the uncertain system (i.e. system dynamics along the path) with the received rewards as feedback, aiming to allocate higher trajectory reward, to determine the next observed state. There are a number of choices of RL algorithms. An earlier work [17] used Q-learning for car-like vehicle motion planning. Another study [18] analyzed transfer learning in obstacle avoidance behaviors in similar environments with similar obstacle patterns, where the state of the environment is represented by the obstacle pattern. Recently, a model-free actor-critic RL algorithm was applied to time-optimal velocity planning along an arbitrary path in [19]. That study demonstrated that the incorporation of velocity computation through the exploitation of a vehicle dynamics model is practically feasible for improving learning outcomes. These studies exemplify the practical utility of RL in improving a model's ability to control vehicle driving (encoded as a set of trajectories). These work, among others, shows that vehicle driving skill (encoded as a set of trajectories) learning via RL is promising for real autonomous driving. Along this line of work, in the present study, the learning task for vehicle control demonstrated is a one-dimensional vehicle maneuvering task. A data-driven state feedback control approach was designed through the learning of dynamics model; in such a control scheme, the optimal time-scaling function (i.e., one that makes the vehicle reach the target as quickly as possible) is recovered or approximated through a set of sampled trajectories of unknown vehicle model under the physical constraints imposed by the vehicle. The simulated vehicle model is represented by a damped double integrator whose solution to the TOCP is known if no uncertainties exist (see Appendix A) and the numerical solution for double integrator with given boundary states and bounded acceleration is solved in e.g. [23] as the trajectory optimization via barrier function, if the system is completely known. We exploited and implemented PILCO [16], a data-efficient MBRL, in a simulation and then in a real-world experiment involving a toy car. PILCO has had high performance in benchmark tasks involving low-dimensional state spaces, such as in the control of an inverted pendulum and in cart-pole swing-up; specifically, it has demonstrated unprecedented performance in modeling uncertain system dynamics and optimizing a control policy accordingly. The paper [30] contains an easy introduction of PILCO, and some of its extensions and modifications. As summarized in Algorithm 1, PILCO employs the nonparametric GP for the learning of the unknown dynamics and corresponding uncertainty estimates in a probabilistic dynamics model. PILCO finds the optimal policy parameters which minimizes the expected episodic trajectory cost based on learned probabilistic model. The core elements of the PILCO framework, including dynamics modeling, trajectory prediction, policy evaluation and policy optimization, are briefly described in this section.

#### **Algorithm 1: PILCO** [16]

1: *Define* parametrized policy: *π* : *xt* � *θ* ! *ut* 2: *Initialize* policy parameters *θ* ∈ *N*ð Þ 0,*I* randomly 3: *Execute* actual system and record initial data 4: **repeat** 5: *Learn* system dynamics model with GP 6: **repeat** 7: *Predict* system trajectories 8: *Evaluate* policy: *J <sup>π</sup>*ð Þ¼ *<sup>θ</sup>* <sup>P</sup>*<sup>T</sup> <sup>t</sup>*¼0*γ<sup>t</sup> Ex* cost *xt* ½ � ð Þ j*θ* 9: *Policy improvement*: Update parameter *<sup>θ</sup>* by analytic gradient *dJ<sup>π</sup>* ð Þ*<sup>θ</sup> dθ* until convergence *<sup>θ</sup>* <sup>∗</sup> to obtain *<sup>π</sup>* <sup>∗</sup> <sup>¼</sup> *π θ* <sup>∗</sup> ð Þ 10: *Execute π* <sup>∗</sup> on actual system and record data 11: **until** task completed

#### **3.1 Dynamics modeling**

In the real world, model uncertainties and model errors are inevitable in the process of modeling a dynamic system. Various methods have been formulated for modeling and learning unknown system dynamics [13, 25, 26]. In a data-driven setting, system dynamics (instead of representations by differential or difference equations) or vector fields that contain uncertainties, nonlinearities, and disturbances are represented in the form of a set of trajectories obtained from the iterated performace of a mission. Therefore, the model learning algorithm must be able to cope with the uncertainty and noise in the collected trajectory data. PILCO adopts GP probabilistic modeling and inference to learn the transition dynamics of a real-world agent), as represented by a prediction model of the true system dynamics by a probability distribution on a space of transition functions for planning (computing the desired state and control trajectories) and control learning. Therefore, PILCO effectively handles uncertainties and reduces the effect of model errors or simplification on the represented system dynamics, as derived through nontrivial mathematical and physical equations. In this respect, PILCO has eliminated the common drawback of model-based frameworks to some extent. Consider an unknown system described by

$$\boldsymbol{x}\_{t+1} = \boldsymbol{f}(\boldsymbol{x}\_t, \boldsymbol{u}\_t) \text{ with } \boldsymbol{x}\_t \in \boldsymbol{R}^D, \boldsymbol{u}\_t \in \boldsymbol{R}^F \tag{1}$$

In PILCO, a GP can be used to model the unknown transition function (1). The training inputs are data in the form of state–action ð Þ *x*, *u* pairs generated by the unknown transition function

$$
\tilde{\mathbf{x}}\_t = \begin{bmatrix} \mathbf{x}\_t \\ \mathbf{u}\_t \end{bmatrix} \in \mathbf{R}^{D+F} \tag{2}
$$

where *ut* <sup>¼</sup> *<sup>π</sup> xt* ð Þ , *<sup>θ</sup>* with *<sup>θ</sup>* as policy parameters depends on a policy *<sup>π</sup>* : *<sup>R</sup><sup>D</sup>* ! *<sup>R</sup><sup>F</sup>* mapping the perceived state to an action. The training target for model learning is chosen as the delta state (the difference between consecutive states) for predicting the difference between current state and next state given action:

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

$$
\Delta\_t = \mathbf{x}\_{t+1} - \mathbf{x}\_t \in \mathbf{R}^D \tag{3}
$$

In this paper, the mean and variance of the prior multivariate Gaussian distribution for f modeled as a GP are chosen to be a zero mean function and squared exponential covariance kernel function, respectively,

$$k\left(\tilde{\mathbf{x}}\_i, \tilde{\mathbf{x}}\_j\right) = \sigma\_f^2 \exp\left(-\frac{1}{2} \left(\tilde{\mathbf{x}}\_i, \tilde{\mathbf{x}}\_j\right)^T \boldsymbol{\Lambda}^{-1} (\tilde{\mathbf{x}}\_i, \tilde{\mathbf{x}}\_j)\right) \tag{4}$$

where the variance of the function *σ*<sup>2</sup> *<sup>f</sup>* and Λ≔diag *l* 2 1, *l* 2 2, … , *l* 2 *D*þ*F* � � � � depending on the length scales are the hyperparameters. With *<sup>n</sup>* training samples *<sup>X</sup>*~≔½ � *<sup>x</sup>*~1, … , *<sup>x</sup>*~*<sup>n</sup>* and *y*≔½ � Δ1, … , Δ*<sup>n</sup>* , the posterior GP hyperparameters are learned through evidence maximization and describes a one-step prediction model of *xt*þ<sup>1</sup> for state trajectory generation from Δ*<sup>t</sup>* and *xt* as follows.

posterior state distribution

$$p(\mathbf{x}\_{t+1}|\mathbf{x}\_t, \boldsymbol{\mu}\_t) = \mathcal{N}(\mathbf{x}\_{t+1}|\boldsymbol{\mu}\_{t+1}, \boldsymbol{\Sigma}\_{t+1})\tag{5}$$

mean

$$
\mu\_{t+1} = \mathbf{x}\_t + E\_f[\Delta\_t] \tag{6}
$$

variance

$$\Sigma\_{\mathbf{t}+1} = \mathbf{Var}\_f(\Delta\_\mathbf{t}) \tag{7}$$

where *E <sup>f</sup>* Δ*<sup>t</sup>* ½ � and Var*f*ð Þ Δ*<sup>t</sup>* can be calculated from prior distribution. In practice, computationally tractable means and variances of GP are used as the most likely estimate for the training data and the confidence in the prediction, respectively, for further decision-making.

#### **3.2 Deterministic trajectory prediction**

For the subsequent step of policy evaluation, PILCO first predicts long term system trajectories with the learned transition dynamics, given a policy. The distribution of state *xt* at time *t* is assumed to be Gaussian with mean *μ<sup>t</sup>* and covariance Σ*<sup>t</sup>* where *p x*ð Þ� *<sup>t</sup> N μ<sup>t</sup>* ð Þ , Σ*<sup>t</sup>* . In order to predict next state *xt*þ1, the distribution *p*ð Þ *x*~*<sup>t</sup>* and *p u*ð Þ*<sup>t</sup>* are needed. This is done by assuming *p u*ð Þ*<sup>t</sup>* is Gaussian and by approximating statecontrol distribution *p*ð Þ *x*~*<sup>t</sup>* by a Gaussian with correct mean and variance. The mean and covariance of the predictive control distribution *p u*ð Þ*<sup>t</sup>* is obtained by integrating out the state from *ut* ¼ *π xt* ð Þ , *θ* .

$$p(u\_t) = \int p(\check{\mathbf{x}}\_t) d\mathbf{x}\_t \tag{8}$$

The distribution of the change in state Δ*<sup>t</sup>*

$$p(\Delta\_t) = \iint p(f(\tilde{\mathbf{x}}\_t)|\tilde{\mathbf{x}}\_t) p(\tilde{\mathbf{x}}\_t) d\tilde{f} d\tilde{\mathbf{x}}\_t \tag{9}$$

is subsequently approximated by a Gaussian distribution with mean *μ*<sup>Σ</sup> and variance ΣΔ through calculating of the posterior mean and covariance of *p*ð Þ Δ*<sup>t</sup>* by moment matching or linearization [16]. The posterior state distribution in (5–7) can then be approximated by *p x*ð Þ� *<sup>t</sup>*þ<sup>1</sup> *N μt*þ1, Σ*t*þ<sup>1</sup> � � with

$$
\mu\_{t+1} = \mu\_t + \mu\_\Delta \tag{10}
$$

$$
\Delta \Sigma\_{t+1} = \Sigma\_t + \Sigma\_\Delta + \text{Cov}[\mathbf{x}\_t, \Delta\_{t+1}] + \text{Cov}[\Delta\_{t+1}, \mathbf{x}\_t] \tag{11}
$$

$$\text{Cov}[\mathbf{x}\_{t}, \Delta\_{t+1}] = \text{Cov}[\mathbf{x}\_{t}, u\_{t}] \Sigma\_{u}^{-1} \text{Cov}[u\_{t}, \Delta\_{t+1}] \tag{12}$$

#### **3.3 Policy evaluation (reward function)**

The policy is evaluated with the expected return. To this end, the predictive trajectories *p x*ð Þ*<sup>t</sup>* f g , *t* ¼ 0, 1, … , *N* are retrieved, given a policy, to compute the expected cumulative reward (13, 14) for policy evaluation.

$$J(\theta) = \sum\_{t=0}^{T} \gamma^t E\_{\mathbf{x}}[\text{cost}(\mathbf{x}\_t) | \theta] \tag{13}$$

$$E\_{\mathbf{x}}[\text{cost}(\mathbf{x}\_{t})|\theta] = \int \text{cost}(\mathbf{x}\_{t}) \mathcal{N}(\mathbf{x}\_{t}|\boldsymbol{\mu}\_{t}, \boldsymbol{\Sigma}\_{t}) d\mathbf{x}\_{t} \tag{14}$$

where *γ* is the future discount factor which determines the importance of future costs on the reward and quantifies the time after which the costs have less influence on the rewards. Note that a large *γ* will cause the accumulated cost (13) calculated at the end of episode to reduce more slowly in late time as time index *t* ! *N*. A small *γ* means current reward is more important than the future rewards, thus would be good for uniform convergence. Using the currently learned model for policy evaluation is the key to data-efficiency of PILCO.

**\*Remark**. Since the learning reward (13, 14) does not include the control regularization term (such as using *L*<sup>1</sup> , *L*<sup>2</sup> , or mixed *L*<sup>1</sup> -*L*<sup>2</sup> norm regularization as additional sparsity-inducing cost for (13, 14) [9]), it allows gradient computation for modelbased optimization described in the following subsection.

#### **3.4 Cost function design**

In general, there are quite a number of cost functions possible for the reward of RL acting as the control for uncertain system, yet the effectiveness of a specific RL algorithm depends on the applications indeed. The objective function can be multimodal to allow different skills to be learned. For example, a parametric cost function [31] can be used to switch the cost from one that follows a quadratic cost function to a time-optimal cost during learning to generate different feedback control schemes in response to different events. Kabzan et al. [32] used a progress-maximizing cost function, defined as one that ensured that the learning agent drove as far as possible within every time step. For TOCP we consider, an appropriate per-step cost at *xt* ¼ *x t*ð Þ at sampling time *t* is represented in (15)

$$\text{cost}(\mathbf{x}(t)) = \mathbf{1} - \exp\left(-\frac{1}{2\sigma\_c^2} \left\|\mathbf{x}(t) - \mathbf{x}\_{\text{target}}\right\|\_2^2\right) \in [\mathbf{0}, \mathbf{1}] \tag{15}$$

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

where xtarget is the target state and the cost width can be tuned with *σc*. It measures how fast the vehicle progresses on the track to reach the target (subject to tolerance, i.e. the neighborhood of the target state) in terms of exponential function of pairwise Euclidean distance within the episode horizon. The data-driven control (18, 19) for (15) is to maximize the accumulated trajectory cost (13, 14). The expected cost depending on the state, thus on the control parameter *θ* through the mean and covariance of *ut* ¼ *π xt* ð Þ , *θ* , allows for analytic integration [16].

#### **3.5 Policy optimization**

With the model uncertainty handled by the GP, PILCO employs model-based policy search for planning and uses analytic gradient of cost function for optimization of policy parameter. The policy is improved episodically through the gradient information of (13): the policy update step is in the gradient directions toward high reward region of action space to search for optimal policy (best action sequence) directly. Since the cumulative reward function and transition function are differentiable with respect to the policy parameter *θ*, analytic gradient *J <sup>π</sup>*ð Þ*<sup>θ</sup> <sup>=</sup>d<sup>θ</sup>* with respect to the policy parameter, which depends on the policy parametrization, is available for many interesting control parametrizations, which involves several applications of chain rule [16]. Finally, an advantage of PILCO is that through the analytic expression of the cost function with respect to policy parameter, any standard gradient-based optimization method can be implemented to search directly in the policy space for the optimal policy parameter *θ* of high (say, thousands) dimension, which minimizes the total cost *J <sup>π</sup>*ð Þ*<sup>θ</sup>* so as to obtain desired state trajectory with higher reward.

#### **4. Simulation results and discussions**

#### **4.1 Simulation scenario and settings**

For the simulation purpose, we consider a simple aggressive driving task whose action space is one-dimensional acceleration in a given interval, and the state space is a two-dimensional space of position and velocity with the position confined on a linear track to constrain the exploration for sample efficiency. The driving scenario is visualized in **Figure 1**, in which the autonomous car is represented by a black box. We used a double integrator with an unknown but constant point mass to simulate the behavior of a real vehicle traveling along a straight line on flat ground with unknown but constant viscous friction. Let *x t*ð Þ denote the distance traveled by a point mass *m* on a frictional ground with viscous friction coefficient *c* controlled by an applied control subject to symmetric constraint *u*∈ *Uad* ¼ �½ � *u*max, *u*max , where the action space *Uad* is an admissible control set (a convex polytope) that overcomes static friction and avoid slipping. Defining **x** ¼ ½ � *<sup>x</sup>*, *<sup>x</sup>*\_ *<sup>T</sup>* as the state vector of simulated vehicle, the state equations for the vehicle dynamics can be written as

$$
\dot{\mathbf{x}} = \mathbf{A}\mathbf{x} + \mathbf{b}u, \mathbf{x}(\mathbf{0}) = \mathbf{0} \tag{16}
$$

where

$$\mathbf{A} = \begin{bmatrix} \mathbf{0} & \mathbf{1} \\ \mathbf{0} & -\frac{c}{m} \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} \mathbf{0} \\ \mathbf{1} \\ m \end{bmatrix} \tag{17}$$

Since the vehicle (15) is controllable, the existence of a time-optimal control input with at most one switching to steer the vehicle from a rest state **x**ð Þ¼ 0 0 at time zero to a neighborhood of another state **x**ð Þ*t* in a given time t is ensured by controllability is ensured by PMP. In Appendix A, we provide the analytical solution of state-to-state transfer TOCP to (15) with explicit parameter dependence. The insights offered by the analytical solution for learning and its performance is that given the symmetric control bound, the control trajectory and switch time are determined by *<sup>c</sup> <sup>m</sup>* and *c* (the natural frequency and damping ratio), while the state (position and velocity) trajectory depends on *<sup>c</sup> <sup>m</sup>*, *m* explicitly and *c* implicitly via the control.

**Settings**. For concreteness, the learning scenario is illustrated in **Figure 1**, in which the vehicle is represented by a black box. The vehicle had a horizontal length of 30 cm and a mass of 0*:*5 kg. In addition, to mimic a real vehicle on the road, we set a symmetric bound of �4 m*=*s2 on the acceleration control and a friction coefficient of 0*:*1. The motion began from rest at the origin and proceeded along a linear track for a fixed duration *T*terminal per an acceleration policy. It is desired that the target state **<sup>x</sup>**ð Þ¼ *<sup>T</sup>* **<sup>x</sup>**target <sup>¼</sup> ½ � 5, 0 *<sup>T</sup>* for a prescribed distance *<sup>L</sup>* <sup>¼</sup> 5 with a *<sup>T</sup>* <sup>≤</sup>*T*terminal value that is as small as possible (i.e., the policy determines the maximum *x*\_ for each point *x* on the pre-specified path and a travel time *T* for task completion), where *T*terminal is the duration of learning. Crucially, the agent itself has no prior knowledge on the mass or friction coefficient. The simulation is conducted on an Intel Core i7-8700k and 16 GB RAM using MATLAB.

To facilitate the episodic policy search in PILCO, we set each episode to be *T*terminal ¼ 4 *s*, and each episode was further discretized into 40 time steps (i.e., Δ*t* ¼ 0*:*1 *s* for each time step). We ran 16 episodes for the agent to learn the task, and the first task was randomly initialized. The code is available at https://github.com/brianhc liao/PILCO\_AlphaBot. Our choice of learning time *T*terminal ¼ 4 (for one episode) in the simulations was a trade-off between computational cost (which increases if more

#### **Figure 1.**

*Setup of one-dimensional state-to-state transfer task. The black box depicts the car, which is modeled using a point-mass double integrator. The car begins moving from the origin (green star) at rest along a straight line to reach the target (red star) along a rough plane. The task involved the execution of different acceleration control policies on the double integrator with embedded uncertainties from the same rest state to reach a target state. The resulting state–input pair and cost at each sampling time point were recorded.*

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

data are collected) and performance, where overly small and large scales of *T*terminal result in aggressive and relaxed learning, respectively. When each episode was run, the agent considered an episode, in which the vehicle was returned to the same initial state (rest state at *t* ¼ 0) after a policy in the parametric form of (18, 19) to the simulated damped double integrator model (15). In this task, both the state and input could be observed for the data to be collected. In one control cycle and for each maneuver policy, we collected a batch of trajectory data generated by (15) (the trajectory did not violate the acceleration limits) at the sampling times points of *t* ¼ 0*:*1, 0*:*2, … , 3*:*9, 4*:*0 (in seconds) over the time horizon 0, 4 ½ �. Therefore, a total of 16 episodes (including an initial random-policy round) were executed, each of which had a total of *N* ¼ 40 samples of state–control pairs. Subsequently, a constant-size data set *Di* , (*i* ¼ 1, 2, … , 16) was constructed, and each of which is composed of a sequence of precisely 40 state–action pairs, where the corresponding state was visited by the vehicle and corresponding control input

$$D^i = \left( \left( \mathbf{x}\_1^{(i)}, \mathbf{u}\_1^{(i)} \right), \dots, \left( \mathbf{x}\_{N\_{\text{target}}}^{(i)}, \mathbf{u}\_{N\_{\text{target}}}^{(i)} \right), \dots, \left( \mathbf{x}\_{40}^{(j)}, \mathbf{u}\_{40}^{(j)} \right) \right) \tag{18}$$

$$\mathbf{u}\_1^{(i)} = \pi\left(\mathbf{x}\_1^{(i-1)}, \boldsymbol{\theta}^{(i-1)}\right), \dots, \mathbf{u}\_{40}^{(i)} = \pi\left(\mathbf{x}\_{40}^{(i-1)}, \boldsymbol{\theta}^{(i-1)}\right) \tag{19}$$

where *N*target ≤ *N* ¼ 40 denote the first sampling time at which the car passes through the target region. These 40 state-control pairs collected at different sampling times are correlated via state transition which maps current state and acceleration to the next state. The time-discrete system as the state transition function is obtained by using Euler method to (15) and gets the controllability matrix

$$\mathbf{x}\_{n+1}^{(i)} = \mathbf{x}\_{n}^{(i)} + \Delta t \mathbf{A} \mathbf{x}\_{n}^{(i)} + \Delta t \mathbf{b} \boldsymbol{u}\_{n}^{(i)} = [\mathbf{I} + \Delta t \mathbf{A}] \mathbf{x}\_{n}^{(i)} + \Delta t \mathbf{b} \boldsymbol{u}\_{t}^{(i)} \tag{20}$$

$$\mathbf{C} = \begin{bmatrix} \mathbf{I} + \Delta t \mathbf{A} \end{bmatrix} = \begin{bmatrix} \mathbf{1} & \Delta t \\ \mathbf{0} & \mathbf{1} - \Delta t \frac{\mathbf{c}}{m} \end{bmatrix} = \text{controlability matrix} \tag{21}$$

where *<sup>n</sup>* <sup>¼</sup> 0, 1, 2, … , 39, xð Þ*<sup>i</sup>* <sup>0</sup> <sup>¼</sup> x0, *<sup>u</sup>*ð Þ*<sup>i</sup>* 0 � � given, *<sup>u</sup>*ð Þ*<sup>i</sup>* <sup>0</sup> is an initial input for arbitrary initial exploration. Since **C** is nonsingular for any *m* and *c*, the system considered is

ensemble controllable to guarantee the existence of an appropriate input for steering task, even the system parameters are unknown. The data set *Di* in each trial recorded how the vehicle modified its input at each sampled point on the path in accordance with the stage cost at each sampling time point, which was the only type of feedback information that the vehicle received during the vehicle–environment interaction during learning. The *i*-th episode learning data were to be used for a demonstration in which the reward in the next (*i* + 1)th learning cycle was used to revise the policy for an optimization over action sequences in an effort to minimize the distance between the predicted future state and the target by large progress; this revision was conducted through an online estimation of the vehicle model based on the batch of collected data (for the whole completed episode) during the learning process.

#### **4.2 Data-driven approximate time-optimal control design**

We posit the basis functions *ϕ<sup>i</sup>* f g , *i* ¼ 1, 2, … *nb* where *nb* is the total number of basis functions that can be used to represent the policy as a linear combination with a set of parameters over the basis functions [16]; this linear combination is represented as follows:

$$u(\mathbf{x}, \theta) = \sum\_{t=0}^{n\_b} w\_i \phi\_i(\mathbf{x}) \tag{22}$$

We choose *ϕi*ð Þ *x* in the form of Gaussian Radial Basis Function (RBF)

$$\phi\_i(\mathbf{x}) = \exp\left(-\frac{1}{2}(\mathbf{x} - \mu\_i)^T \Delta^{-1} (\mathbf{x} - \mu\_i)\right) \tag{23}$$

where *μ<sup>i</sup>* f g , *i* ¼ 1, 2, … , *nb* are support points, and Δ is covariance matrix. Then *θ<sup>i</sup>* ¼ *wi*, *μ<sup>i</sup>* ½ � , Δ represents the policy parameter vector of the weight, mean, and covariance of each Gaussian RBF. The choice of parametric controller (18, 19) in the form of linear combination of Gaussian RBFs lies in function approximation property and good generalization property, where the controller parameters and the number of basis functions can be optimized using various methods. This implies robustness to vehicle parametric uncertainties and disturbances in learning. Therefore, the class of learning-based parametrized control policy for effectively reject external disturbance and compensate parametric uncertainties we consider is defined with a mapping that maps weights *wi* and basis functions *ϕ<sup>i</sup>* to a full (predicted) state (position and velocity) feedback control. Since the optimal control is generally unknown, *nb* is chosen to be sufficiently large (100 in simulation) to allow an accurate approximation in model prediction and input-constrained control for the specific state-to-state transfer task.

#### **4.3 Model learning performance and efficiency**

PILCO uses GP that relates the policy learning performance to the double integrator with embedded uncertainties [specifically mass m and friction coefficient *c* in (15)] through its interactions with the environment for model learning. GP maps the policy parameter to the reward, which correlates the policy learning and one-step predictive model trajectory learning. How accurate the learned model trajectories can thus be measured by the reduction in travel time or increase in rewards over episodes, since an accurate dynamic model is required to derive the optimality conditions. In our case, the maneuver time minimization is not directly involved in the transformed cost (13, 14, 19). In fact, via maximizing the expected sum of discounted rewards of progress (or elapsed distance measured along the path) per step by applying admissible acceleration at each time step over the entire episode horizon 0, *T*terminal ½ � of learning, the maneuver or control policy tends to achieve a maximum progress per sampling time. Since the decrease of distance to target is directly related to reduction of travel time, it is worth noting that the model learning performance is gradually improved over episode as validated in the cost v.s. the number of episodes (time complexity) plot of **Figure 2** with reduced cost and consistently with the reduced motion duration. This is because decaying factor to the power of *t*, *γ<sup>t</sup>* in the time-accumulated trajectory cost sum (13) favors the immediate distance cost (19) by maximizing the distance traveled in the first few samples using the discounted factor in the cost function. This promotes the goal-reaching to be achieved with large progression per step at the beginning of the trajectory subject to *Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

#### **Figure 2.**

*Graph of total cost against episode. Cost was reduced until (near) convergence was achieved at the second episode and was stable after convergence.*

#### **Figure 3.**

*Converged learning outcome of state (position, velocity) and input (acceleration) trajectories of the vehicle. The velocity profile was triangular and the acceleration input exhibited bang–bang control characteristics. After reaching the neighborhood of the target, the vehicle attempted to stop and remain at the target. A little steady-state oscillation around the target was present because the vehicle could not decelerate fast enough to come to rest at the target. That is, the required fast motion induces overshooting the target if it does not brake in time and then reversed as hard as possible to be closer to the target.*

symmetric acceleration bound of the vehicle, thus faster trajectory so as to fit trajectories with higher rewards.

We can see two highly different trajectory behaviors in **Figure 3a**, while **Figure 3b** shows the learning curves in the position-velocity phase plane. The initial motion direction is either aligned with the desired direction toward the target or counterdirectional with it. The first trajectory generated in the first episode was counterintuitive in that the vehicle drove away from the destination. This unusual behavior was associated with a high trajectory cost due to divergence from the target, causing the rest of the trajectories in the second to last episodes to have their directions of motion switched when the vehicle was initially traveling toward the target. We observe that in the second episode, a nearly correct state response is generated to fit the highest reward. The learning behavior from second episode until the final one is relatively identical with small steady-state oscillation due to no terminal cost to control the terminal state at the end of horizon. The task-specific cost (19) is approaching zero as the final state is only around the target (i.e. **x**ð Þ! *T* **x**target), considering the inaccuracy in reaching the target and residual vibration was present after the vehicle reached the neighborhood of the target. The required fast motion induces overshooting the target and when overshoot occurs, turn-back to be closer to the target is observed during learning. The policy is updated and exploration is terminated to maintain the reward at its highest throughout the remaining learning episodes.

#### **4.4 Policy learning performance-verifying time-optimality**

Theoretically, in ideal situations of no external disturbances and no model errors, by taking into account the input constraints, the analytical solution can be obtained (see Appendix A) for the damped double integrator when the time-optimal acceleration input is a bang-bang control [20] (i.e., a piecewise constant �*u*max at all time points that is on the boundary of the admissible control set *Uad*) with one sign change (see also the following interpretation in Section 5.2). Accordingly, the effectiveness of the learned policy with respect to the time-optimal motion task is measured by analytical solution. As shown in **Figure 4b** and **Table 1**, the learned velocity trajectory is visually converges to a profile which is very similar the analytical solution. We see that the characteristics of exact solution are learned: the learned velocity profile almost coincides with the time-optimal zigzag profile with exactly one switching point whose height and corresponding time are, respectively, the maximum allowed velocity and *tsw*, and optimal acceleration is applied for each possible state on the linear track under symmetric acceleration constraint. For different *c=m* and *c* parameters, the fastest goal-reaching behavior from the same initial rest state results in different zigzag velocity profiles with different travel times for given *L* and fixed *u*max. **Table 1**

#### **Figure 4.**

*Learned response in time and phase plane. Episode 1 features the car reversing. In episode 2, a correct, nearly timeoptimal motion toward the target was produced. (a) Vehicle position at each episode over 40 sampling time steps spanning a time horizon of 4 s. (b) Predicted position–velocity trajectories in the phase plane through application of data-driven control on learned model for the goal-reaching task along a linear track. The state trajectories became more accurate (time-optimal) with respect to the predicted velocity trajectory as the model was updated during learning. The time-optimal velocity trajectory upon convergence with one instance of switching is shown.*

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*


*a m is the mass, a given parameter.*

*b c is the coefficient of friction, also a given parameter.*

*c tsw is the switching time under bang-bang control. <sup>d</sup>*

*T is the minimum time spent in the whole motion.*

#### **Table 1.**

*This shows tsw and T*<sup>∗</sup> *of the triangle profile in the presence of* ∓4% *deviation of both mass and friction coefficient simultaneously with accurate c* ¼ 0*:*1*, m* ¼ 0*:*5 *for given L* ¼ 5 *and u*max ¼ 4*. For comparison, tsw* ¼ 0*:*79057 *and T*<sup>∗</sup> <sup>¼</sup> <sup>2</sup>*tsw* <sup>¼</sup> <sup>1</sup>*:*<sup>58114</sup> *for the case of c* <sup>¼</sup> <sup>0</sup>*. It indicates the friction clearly slows down the fastest motion*.

shows the effect of parameter variations due to approximate dynamics model on achievable minimum time. Note that *<sup>T</sup>* <sup>∗</sup> 6¼ <sup>2</sup>*tsw* when *<sup>c</sup>* 6¼ 0. It is because the vehicle should not only have the highest acceleration to reach a highest speed at *tsw* and decelerate sufficiently fast to meet the null boundary conditions, but also, affected by the viscous frictional force which resists the vehicle's movement speed to increase. The acceleration policy that produces this unique zigzag position-velocity profile corresponding to the bang-bang control with appropriate switching time to meet the boundary conditions is the goal of the policy learning based on learned model in an attempt to match.

In conclusion, the success of the simulation of goal-reaching trajectories along the track (task relevant states in reachable set) rendered the time-optimal trajectories on a straight road feasible. As the simulation shows, the objective function decreases the approximation errors introduced by the currently learned GP model and reduces consistently the travel time very effectively for optimal policy search to fit trajectories with higher rewards, and the resulting policy encodes the desirable spatial and temporal correlations approximately. However, the GP learning results from the collected data set are only valid for the given path and task, but the laws governing physical dynamics apply across paths and tasks. Furthermore, there is discrepancy between the desired time-optimality and the objective of cumulative maximum progress per step we implemented, we observed that the algorithm, despite the natural exploitation– exploration characteristics of the saturating cost function, was sometimes not guaranteed to be globally optimal; it could get stuck in a local optimum because the optimization problem was not convex [16, 18], but still yields good results owing to learning convergence of the generalizable controller as approximate solution to TOCP, whose sensitivity with respect to parameter variations of *m* and *c* is small.

#### **5. Sim2Real policy transfer experiment**

In real-world experiments, a relatively general uncertain nonlinear vehicle model is either not readily available or overly complicated for the design of state-to-state steering maneuvers. Therefore, to solve this problem, one approach to sim2real

**Figure 5.**

*Low-cost model car AlphaBot for the experiment (left photo) and the experimental setup involved in driving AlphaBot along a linear track.*

control is the transfer of skills learned from a simple simulated model to a similar real system executing a similar task. To test the learned policy in a real-world environment and as detailed in this section, we conducted a sim2real validation experiment with a low-cost Raspberry Pi–controlled small car, called AlphaBot, as shown in **Figure 5**. The simple car was equipped with photo-interrupters and ultrasonic sensors for state measurements. It could be controlled to either accelerate or brake. Because the robot unit was inexpensive, the low pulse-width-modulation duty cycles that were translated from the control signals might be incapable of driving the car, which also makes the task more complex. Before we conducted the experiment, we verified that the robot had sufficient power to travel a distance of *L* along the linear track to reach the target. In contrast to the simulation scenario, the experimental setup had an additional velocity limit for the car, which functioned as a state constraint. The velocity limit stemmed from the voltage constraints of the board and was therefore not directly handled by the control signal.

#### **5.1 Setup**

A small car was designated to travel from a point 180 cm from the wall to a point 50 cm from the wall. Its heading was fixed at a forward-facing angle of 0<sup>∘</sup> for straight line motion. The task characteristics (system dynamics along a linear track on a frictional plane) and the environment for learning in the simulation and experiment were very similar. We assumed that the model of the car was similar to that in the experiment; it was a second-order dynamics model with slightly changed parameters. This model is universal because it is based on fundamental physical principles. Thus, we could reuse the same RBF controller (18, 19) that was used in the simulated double integrator (15) as a time-optimal feedback control of the actual vehicle. When applied to a real vehicle, the learned policy from simulation runs thus can be viewed as an optimal demonstration by the simulated system (15). This policy provides a good understanding of the time-optimal motion of the vehicle motion with only an input constraint under no disturbance and no state constraint. In the experiment, the control signal generated from the RL algorithm ranged from 2 to 2 and was translated into the rate of change of the duty cycles of the motor PWM signal on board.

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

**Figure 6.**

*Experimental results. (a) Graph of total cost against episodes. Trajectory cost decreased after some transients until convergence at the seventh episode where a travel time of approximately 2 s was achieved. (b) Intermediate costs in selected episodes. (c) Outcome with respect to learning state and control trajectories.*

#### **5.2 Results**

As indicated in **Figure 6**, the sim2real transfer experiment, which applied the same data-driven control learned from simulation, worked well on the low-cost car. The vehicle drove at each point on the path at its highest allowable velocity; this observation was similar to that in the simulation. This validates the slight generalization capability (robustness and stability) of the MBRL with Gaussian RBF kernel under similar dynamics and task constraints. The learning curve is illustrated in **Figure 6a**. The total cost was 40 at the initial trial of the task for the saturating cost function. The vehicle attained a lowest cost of approximately 16 and a travel time of approximately 2.2 s at the seventh episode with a total experience time of 28 *s*. This approach was deemed to be very efficient because learning was successful after only a small number of trials. In **Figure 6b**, immediate cost at every time step (per-step cost) at various episodes is plotted, showing the learning process and indicating that the arrival time decreased over episodes. The decrease was, however, not monotonic; thus, the intermediate trajectories prior to the completion of learning may not be acceptable until a nearly time-optimal control input is finally obtained at convergence, which was the desired control goal. As indicated in **Figure 6c**, by following a nearly time-optimal velocity along the track, the final position of the car, despite not overshooting the target or fluctuating, was slightly off the target by 50 *cm* partly because of modeling uncertainties and localization errors from lateral tracking error and sensor inaccuracies. A velocity limit was set because of the electronic voltage constraints on the robot and was therefore not directly handled by the robot control signal.

#### **5.3 Discrepancies between simulation and experiment**

Prior knowledge about a detailed description of the vehicle dynamics that contain uncertainties is not available or is not required in both simulation and sim2real experiment. Instead, the model is learnt through the Gaussian Processes to reduce the accumulated cost, thus contributing to decrease the maneuver time. However, the convergence in the simulation tends to require less learning episodes to converge, which possibly stems from the following major differences. The first important difference is that the learning experiment is on the basis of more complicated vehicle dynamics. In fact, the inherent factors confronted during real world experiment that cannot be neglected or accounted for in a simple physical model (18). These can affect the control performance include uncertainties caused by lateral drift and wheel

sideslip of the vehicle, physical properties such as inertia and complicated, hard to model friction characteristics (such as Coulomb friction proportional to signð Þ *x*\_ and aerodynamic (quadratic) drag force proportional to *x*\_ <sup>2</sup> , in addition to viscous friction proportional to *x*\_ ), or unknown external forces inherent in the vehicle-terrain interaction as a result of the inaccuracy of sensor measurements, motor characteristics and torque disturbances and variation of environment interactions such as uneven ground. These factors make the prediction of the sampled states on the linear track under the input ambiguous and affect the stability of system to reach the target. Considerations of vehicle hardware also entail an additional velocity limit (due to a limit to how much voltage the hardware can take) that ensures that rapid motion causes no harm or instability on the vehicle at each trial. These factors result in differing state trajectories in the experiment and simulation under the same control. Due to low sensitivity of TOCP with respect to system parameters, however, the convergence of learning to the approximate solution is ensured under these factors. Therefore, a new approximate optimal solution, if the system deviated from the initially demonstrated trajectory, is recovered by learning controller.

#### **5.4 Interpretations of learning results**

In a scenario of short-distance driving on a linear path, we see that data-driven control (18, 19) realizing the nearly time-optimal state-to-state steer control in a simulated model is demonstrated as a good approximated time-optimal control of actual system with similar dynamics and task characteristics. The error which is in sim2real experiment between the learned simulated model from the dataset and learned near time-optimal control performance can be split into two aspects. The first one is the error between the actual state x*a*ð Þ*t* and the theoretical time-optimal state xoptð Þ*t* . The second one is the error between the theoretical time-optimal state xoptð Þ*t* and the learned simulated state x*t*. Let xoptð Þ*<sup>t</sup>* , *<sup>u</sup>*optðÞ¼ *<sup>t</sup> <sup>u</sup>* xoptð Þ*<sup>t</sup>* , *<sup>θ</sup>*optð Þ*<sup>t</sup>* � � � � be the time-optimal state-input pair of simulation model (15). For a successful sim2real experiment, we assume that the simulation model (15) can be extended to the actual system (24) in real experiment

$$\dot{\mathbf{x}}\_{d} = \mathbf{A}\mathbf{x}\_{d} + \mathbf{b}u\_{\text{opt}} + \delta \mathbf{A} \mathbf{x}\_{d} \tag{24}$$

$$\mathbf{x}\_{d}(\mathbf{O}) = \mathbf{0}, u \in U\_{ad} = [-u\_{\text{max}}, u\_{\text{max}}]$$

where x*a*ð Þ*t* is the actual state, and

$$
\delta \mathbf{A} = \begin{bmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \frac{\delta \mathbf{c}}{m} \end{bmatrix} \tag{25}
$$

is a bounded additive disturbance due to the error *δc* of actual frictions and simulated friction settings with other a priori unknown factors that linearly related to x*a*, *u*max is the maximum control limitation (or state constraint) confined by hardware setting. Note (24) applies the same time-optimal control (18) learned from the simulated system. We assume the error *δc* is bounded by *kc* > 0. Since x*opt* satisfies (15), we then have

$$
\dot{\mathbf{x}}\_{\rm at} - \dot{\mathbf{x}}\_{\rm opt} = \mathbf{A} \left( \mathbf{x}\_{\rm a} - \mathbf{x}\_{\rm opt} \right) + \delta \mathbf{A} \mathbf{x}\_{\rm a} \tag{26}
$$

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

Integrating (26) and moving associated terms, we have

$$\mathbf{x}\_{a} - \mathbf{x}\_{opt} = e^{\mathbf{A}t} \int\_{0}^{t} e^{-\mathbf{A}s} (\delta \mathbf{A} \mathbf{x}\_{a}) ds \tag{27}$$

for 0< *t*<*T*terminal. Since *T*terminal is finite, x*<sup>a</sup>* is bounded in 0, *T*terminal ½ � (note, (24) can be solved analytically), the difference ∣x*<sup>a</sup>* � xopt∣ is proportional to *kc* with a constant *C*, independent of *t*. Therefore, the error between the actual state x*a*ð Þ*t* and the theoretical time-optimal state xoptð Þ*t* is arbitrarily small if *kc* tends to 0.

On the other hand, the dynamics system with transition function is discretized by Euler method as

$$\mathbf{x}\_{t+1} = \mathbf{x}\_t + \Delta t \mathbf{A} \mathbf{x}\_t + \Delta t \mathbf{b} u\_t \tag{28}$$

Since this is the dynamic system learned by PILCO, thus the difference between x*<sup>t</sup>* and x*opt* is proportional to ð Þ <sup>Δ</sup>*<sup>t</sup>* <sup>2</sup> with constant *<sup>C</sup>*, independent of *<sup>t</sup>*. Therefore, as *kc*, Δ*t* ! 0, we have x*<sup>a</sup>* ! xopt.

#### **6. Conclusions**

A case study of TOCP, whose solution typically requires a known model with given boundary conditions for applying PMP, is framed as a MBRL task for rest-to-rest steering along a linear path by a vehicle that was modeled as an uncertain double integrator subject to constant acceleration bound. An important aspect of this paper is a novel application of PILCO, an existing data-efficient MBRL, to approximately solve TOCP for uncertain damped double integrator without accurate parameters. Feasibility of learning convergence is empirically verified by parametric sensitivity of exact time-optimal solution. The consistency of learned velocity profile closer to that obtained by analytical time-optimal control is shown by simulation first and then implemented on a sim2real experiment. The learned velocity can be further revised to account for a velocity limit through scaling without the need of replanning or relearning for performing less aggressive maneuvers. Our case study expands the scope of problems that can be successfully solved by MBRL (specifically, PILCO), serving as a robust adaptive optimal control, without prior parametric model representation, and it demonstrates the capability in compensating for uncertainties and external disturbances, which can cause the state trajectories to deviate from the optimal simulated state trajectory. For the challenging problem of learning a safe velocity for various road topologies and traffic flows, MBRL suffers from the accumulated compounding error over long horizon. And the comparison with other learning approaches to solution of optimal control problems is our future work.

#### **Acknowledgements**

This work was supported by an internal funding from Institute of Information Science, Academia Sinica, Taipei, Taiwan.

#### **Appendix A**

**Instead of data-driven assessment by comparing with state of the art algorithms, an analytical time-optimal solution is provided to validate the learned trajectory solution. The system of the TOCP to damped double integrator can be rewritten as**

$$\frac{d^2x}{dt^2} = -\frac{c}{m}\frac{dx}{dt} + \frac{1}{m}u(t) \tag{29}$$

where

$$\begin{aligned} \dot{\varkappa}(\mathbf{0}) &= \mathbf{0}, \varkappa(T) = L \\ \dot{\varkappa}(\mathbf{0}) &= \mathbf{0}, \dot{\varkappa}(T) = \mathbf{0} \\ |u(t)| &\le u\_{\max} \end{aligned} \tag{30}$$

Integrating the Eq. (29), we then get

$$
\dot{\boldsymbol{x}}(t) - \dot{\boldsymbol{x}}(0) = -\frac{c}{m}(\boldsymbol{x}(t) - \boldsymbol{x}(0)) + \frac{1}{m}(U(t) - U(0))\tag{31}
$$

where

$$U(t) = U(0) + \int\_0^t u(s)ds\tag{32}$$

Reducing the Eq. (31) by invoking the boundary conditions, we get

$$
\dot{\mathbf{x}}(t) = -\frac{c}{m}\mathbf{x}(t) + \frac{1}{m}(U(t) - U(0))\tag{33}
$$

Thus,

$$\varkappa(t) = \varkappa(\mathbf{0}) + e^{-\frac{\omega}{m}t} \int\_0^t \frac{e^{\frac{\varepsilon}{m}}}{m} (U(t) - U(\mathbf{0})) ds \tag{34}$$

The state trajectory with fixed initial state can be expressed by the control *u t*ð Þ, thereby removing the system dynamics constraint.

From the terminal condition at *T*, we get

$$L = e^{-\frac{\varepsilon T}{m}} \int\_0^T \frac{e^{\frac{\omega}{m}}}{m} (U(t) - U(0)) ds \tag{35}$$

Therefore,

$$\begin{split} \boldsymbol{L}e^{\overset{\circ}{\boldsymbol{\pi}}T} &= \int\_{0}^{T} \frac{\dot{\boldsymbol{e}}^{\delta}}{m} \left( \int\_{0}^{t} u(\boldsymbol{\bar{s}}) d\boldsymbol{\bar{s}} \right) ds = \int\_{0}^{T} \int\_{0}^{t} \frac{\dot{\boldsymbol{e}}^{\delta}}{m} u(\boldsymbol{\bar{s}}) d\boldsymbol{\bar{s}} ds \\ &= \int\_{0}^{T} \int\_{\boldsymbol{\varepsilon}}^{T} \frac{\dot{\boldsymbol{e}}^{\delta}}{m} u(\boldsymbol{\bar{s}}) ds d\boldsymbol{\bar{s}} = \int\_{0}^{T} \left( u(\boldsymbol{\bar{s}}) \right)\_{\boldsymbol{\varepsilon}}^{T} \frac{\dot{\boldsymbol{e}}^{\delta}}{m} ds \Big| d\boldsymbol{\bar{s}} = \frac{1}{c} \int\_{0}^{T} u(\boldsymbol{\bar{s}}) \left( \dot{\boldsymbol{e}}^{\delta T} - \dot{\boldsymbol{e}}^{\delta} \right) d\boldsymbol{\bar{s}} \\ &= \frac{1}{c} \dot{\boldsymbol{e}}^{\delta T} \int\_{0}^{T} u(\boldsymbol{s}) ds - \frac{1}{c} \int\_{0}^{T} u(\boldsymbol{s}) \dot{\boldsymbol{e}}^{\delta} ds = \frac{1}{c} \dot{\boldsymbol{e}}^{\delta T} (U(T) - U(0)) - \frac{1}{c} \int\_{0}^{T} u(\boldsymbol{s}) \dot{\boldsymbol{e}}^{\delta} ds \end{split} \tag{36}$$

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

And substituting into (33) with boundary condition of *x T*\_ð Þ¼ 0, we get

$$U(T) = L\mathfrak{c} + U(\mathfrak{0})\tag{37}$$

Therefore,

$$\int\_{0}^{T} u(s)e^{\frac{\varepsilon}{m}}ds = 0\tag{38}$$

Thus, the conditions (32), (37) and (38) can be equivalently converted into

$$\begin{cases} \int\_0^T u(s)ds = Lc \\\\ \int\_0^T u(s)e^{\frac{\Delta}{\alpha}}ds = 0 \end{cases} \tag{39}$$

(39) together with another condition ∣*u t*ð Þ∣ ≤*u*max on 0, ½ � *T* , these three are complete constraints on *u t*ð Þ of the problem. Now, according to bang-bang control, the solution *u t*ð Þ should be �*u*max for *T* to achieve the minimum. Since (29) is a linear second-order system, the bang-bang control has exactly one switching time. Let *tsw* denote the switching time and *T*<sup>∗</sup> the minimum time possible. Based on this pattern for *u t*ð Þ,

$$u(t) = \begin{cases} u\_{\text{max}} & \text{if } \ t \in [0, t\_{sw}] \\\\ -u\_{\text{max}} & \text{if } \ t \in [t\_{sw}, T^\*] \end{cases} \tag{40}$$

we can define

$$u(t) = c\_1 I\_{[0, t\_{uv}]} + c\_2 I\_{[t\_{uv}, T^\*]} \tag{41}$$

where

$$I\_{[a,b]}(\mathbf{x}) = \begin{cases} \mathbf{1} & \text{if } \mathbf{x} \in [a,b] \\ \mathbf{0} & \text{otherwise} \end{cases} \tag{42}$$

Substituting (41) into (39), it becomes

$$\begin{cases} c\_1 \mathfrak{t}\_{sw} + c\_2 (T - \mathfrak{t}\_{\mathfrak{t}w}) = L\mathfrak{c} \\ c\_1 \frac{m}{c} \left( e^{\tilde{\mathfrak{T}}\_{\mathfrak{t}w}} \right) + c\_2 \frac{m}{c} \left( e^{\tilde{\mathfrak{T}}T} - e^{\tilde{\mathfrak{T}}\_{\mathfrak{t}w}} \right) = \mathbf{0} \end{cases} \tag{43}$$

Therefore, the *c*<sup>1</sup> and *c*<sup>2</sup> are

$$\begin{cases} c\_1 = \frac{L\mathcal{L}\left(e\_n^{\xi\_T T} - e\_n^{\xi\_{tw}}\right)}{t\_{sw}(e\_n^{\xi\_T T} - 1) - T(e\_n^{\xi\_{tw}} - 1)} \\\\ c\_2 = \frac{-L\mathcal{L}(e\_n^{\xi\_{tw}} - 1)}{t\_{sw}(e\_n^{\xi\_T T} - 1) - T(e\_n^{\xi\_{tw}} - 1)} \end{cases} \tag{44}$$

According to ∣*u t*ð Þ∣ ≤*umax* and bang-bang principle, we have

$$\begin{cases} \ c\_1 = \frac{L\boldsymbol{\omega}\left(e^{\boldsymbol{\xi}\_n^{\text{r}}} - e^{\boldsymbol{\xi}\_{\text{rw}}}\right)}{t\_{\text{sw}}\left(e^{\boldsymbol{\xi}\_n^{\text{r}}} - 1\right) - T\left(e^{\boldsymbol{\xi}\_{\text{rw}}} - 1\right)} = \boldsymbol{u}\_{\text{max}} \\\\ c\_2 = \frac{-L\boldsymbol{\omega}\left(e^{\boldsymbol{\xi}\_{\text{rw}}} - 1\right)}{t\_{\text{sw}}\left(e^{\boldsymbol{\xi}\_n^{\text{r}}} - 1\right) - T\left(e^{\boldsymbol{\xi}\_{\text{rw}}} - 1\right)} = -\boldsymbol{u}\_{\text{max}} \end{cases} \tag{45}$$

Thus, *e c <sup>m</sup><sup>T</sup>* � *<sup>e</sup> c <sup>m</sup>tsw* � � <sup>¼</sup> *<sup>e</sup> c mtsw* � <sup>1</sup> � � and we get *tsw* <sup>¼</sup> *<sup>m</sup> <sup>c</sup> ln <sup>e</sup> c mT*þ<sup>1</sup> 2 � �. Now, substituting it into

$$\mathcal{L}\_1 = \frac{Lc\left(e^{\varepsilon T}\_n - e^{\varepsilon t\_{\text{av}}}\_n\right)}{t\_{sw}(e^{\xi T}\_n - 1) - T(e^{\xi t\_{\text{av}}} - 1)} = u\_{max} \tag{46}$$

we get the minimum case of *<sup>T</sup>*, i.e. *<sup>T</sup>* <sup>¼</sup> *<sup>T</sup>*<sup>∗</sup> . (This can be computed by numerical methods. See results in **Table 1**).

#### **Author details**

Hsuan-Cheng Liao†, Han-Jung Chou† and Jing-Sin Liu\*† Institute of Information Science, Taipei, Taiwan, ROC

\*Address all correspondence to: liu@iis.sinica.edu.tw

† These authors contributed equally.

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

#### **References**

[1] Ostafew CJ, Schoellig AP, Barfoot TD, Collier J. Speed daemon: Experiencebased mobile robot speed scheduling. In: Canadian Conference on Computer and Robot Vision. USA: IEEE; 2014. pp. 56-62

[2] Rao AV. Trajectory optimization: A survey. In: Optimization and Optimal Control in Automotive Systems. Cham: Springer; 2014. pp. 3-21

[3] Bobrow J, Dubowsky S, Gibson J. Time-optimal control of robotic manipulators along specified paths. International Journal of Robotics Research. 1985;**4**(3):3-17

[4] Verscheure D, Demeulenaere B, Swevers J, DeSchutter J, Diehl M. Timeoptimal path tracking for robots: A convex optimization approach. IEEE Transcation on Automatic Control. 2009;**54**(10):2318-2327

[5] Tohid A, Norrlöf M, Löfberg J, Hansson A. Convex optimization approach for time-optimal path tracking of robots with speed dependent constraints. IFAC Proceedings Volumes. 2011, 2011;**44**(1):14648-14653

[6] Shin K, McKay N. Selection of nearminimum time geometric paths for robotic manipulators. IEEE Transactions on Automatic Control. 1986;**31**(6): 501-511

[7] Wigstrom O, Lennartson B, Vergnano A, Breitholtz C. High-level scheduling of energy optimal trajectories. IEEE Transactions on Automation Science and Engineering. 2013;**10**(1):57-64

[8] Bianco CGL, Romano M. Optimal velocity planning for autonomous vehicles considering curvature constraints. In: IEEE International Conference on Robotics and Automation. USA: IEEE; 2007. pp. 2706-2711

[9] Dinev T, Merkt W, Ivan V, Havoutis I, Vijayakumar S. Sparsityinducing Optimal Control Via Differential Dynamic Programming. USA: IEEE; 2020. arXiv preprint arXiv: 2011.07325

[10] Kunz T, Stilman M. Time-optimal trajectory generation for path following with bounded acceleration and velocity. In: Proceedings of Robotics Science and Systems VIII. Cambridge, Massachusetts, United States: MIT Press; 2012. pp. 1-8

[11] Jond HB, Nabiyev VV, Akbarimajd A. Planning of mobile robots under limited velocity and acceleration. In: 22nd Signal Processing and Communications Applications Conference. USA: IEEE; 2014. pp. 1579-1582

[12] Pham Q. A general, fast, and robust implementation of the time-optimal path parameterization algorithm. IEEE Transactions on Robotics. 2014;**30**(6): 1533-1540

[13] Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: Applications on robotics. Journal of Intelligent & Robotic Systems. 2017; **86**(2):153-173

[14] Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013;**32**(11): 1238-1274

[15] Dulac-Arnold G, Levine N, Mankowitz DJ, Li J, Paduraru C, Gowal S, et al. Challenges of real-world reinforcement learning: Definitions, benchmarks and analysis. Machine Learning. 2021;**110**(9):1-50

[16] Deisenroth M, Rasmussen CE. PILCO: A model-based and dataefficient approach to policy search. In: 28th International Conference on Machine Learning (ICML-11). Bellevue, WA, USA: ICML; 2011. pp. 465-472

[17] Martinez-Marin, T. (2005). Learning optimal motion planning for car-like vehicles. IEEE International Conference on Computational Intelligence for Modelling, Control and Automation IEEE USA pp.601-612

[18] Saha O, Dasgupta P, Woosley B. Real-time robot path planning from simple to complex obstacle patterns via transfer learning of options. Autonomous Robots. 2019:1-23

[19] Hartman G, Shiller Z, Azaria A. Deep reinforcement learning for time optimal velocity control using prior knowledge. In: IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). USA: IEEE; 2018. arXiv preprint arXiv:1811.11615

[20] Liberzon D. Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press; 2011

[21] Ozatay E, Ozguner U, Filev D. Velocity profile optimization of on road vehicles: Pontryagin's maximum principle based approach. Control Engineering Practice. 2017;**61**:244-254

[22] Stryk O, Bulirsch R. Direct and indirect methods for trajectory optimization. Annals of Operation Research. 1992;**37**(1):357-373

[23] Hauser J, Saccon A. A barrier function method for the optimization of trajectory functionals with constraints. In: Proceedings of the 45th IEEE Conference on Decision and Control. USA: IEEE; 2006. pp. 864-869

[24] Qian X, Navarro I, de La Fortelle A, Moutarde F. Motion planning for urban autonomous driving using Bézier curves and MPC. In: IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). USA: IEEE; 2016. pp. 826-833

[25] Song C, Boularias A. Identifying Mechanical Models Through Differentiable Simulations. Ithaca, New York: Cornell University; 2020. arXiv preprint arXiv:2005.05410

[26] Geist AR, Trimpe S. Structured learning of rigid-body dynamics: A survey and unified view. GAMM‐ Mitteilungen. 2020;**44**(2):e202100009. arXiv preprint arXiv:2012.06250

[27] Moerland TM, Broekens J, Jonker CM. Model-based Reinforcement Learning: A Survey. Ithaca, New York: Cornell University; 2020. arXiv preprint arXiv:2006.16712

[28] Liu M, Chowdhary G, Da Silva BC, Liu SY, How JP. Gaussian processes for learning and control: A tutorial with examples. IEEE Control Systems Magazine. 2018;**38**(5):53-86

[29] Pineda L, Amos B, Zhang A, Lambert NO, Calandra R. MBRL-LIB: A Modular Library for Model-based Reinforcement Learning. Ithaca, New York: Cornell University; 2021. arXiv preprint arXiv:2104.10159. Available from: https://github.com/facebookresea rch/mbrl-lib

[30] Brunzema P. Review on Data-Efficient Learning for Physical Systems *Velocity Planning via Model-Based Reinforcement Learning: Demonstrating Results on PILCO… DOI: http://dx.doi.org/10.5772/intechopen.103690*

using Gaussian Processes. Berlin, Germany: ResearchGate; 2021. Available from: researchgate.net

[31] Sprague CI, Izzo D, Ögren P. Learning a Family of Optimal State Feedback Controllers. Ithaca, New York: Cornell University; 2019. arXiv preprint arXiv:1902.10139

[32] Kabzan J, Hewing L, Liniger A, Zeilinger MN. Learning-based model predictive control for autonomous racing. IEEE Robotics and Automation Letters. 2019;**4**(4):3363-3370

#### **Chapter 6**

## Self-Supervised Contrastive Representation Learning in Computer Vision

*Yalin Bastanlar and Semih Orhan*

#### **Abstract**

Although its origins date a few decades back, contrastive learning has recently gained popularity due to its achievements in self-supervised learning, especially in computer vision. Supervised learning usually requires a decent amount of labeled data, which is not easy to obtain for many applications. With self-supervised learning, we can use inexpensive unlabeled data and achieve a training on a pretext task. Such a training helps us to learn powerful representations. In most cases, for a downstream task, self-supervised training is fine-tuned with the available amount of labeled data. In this study, we review common pretext and downstream tasks in computer vision and we present the latest self-supervised contrastive learning techniques, which are implemented as Siamese neural networks. Lastly, we present a case study where selfsupervised contrastive learning was applied to learn representations of semantic masks of images. Performance was evaluated on an image retrieval task and results reveal that, in accordance with the findings in the literature, fine-tuning the selfsupervised training showed the best performance.

**Keywords:** self-supervised learning, contrastive learning, representation learning, computer vision, deep learning, pattern recognition

#### **1. Introduction**

For an effective training, supervised learning requires a decent amount of labeled data, which is expensive. Unlabeled and inexpensive data (e.g. text and images on the Internet) is considerably more than the limited size datasets labeled by humans. We can use unlabeled data and perform a training on a pretext task, which is a **selfsupervised** approach since we do not use the labels in our real task. Although the task and the defined loss are not the ones in our actual objective, we can still learn some representations that are valuable enough to be used for the final task. We basically learn a parametric mapping from the input data to a feature vector or tensor. In most cases, a smaller amount of labeled data is used to fine-tune the self-supervised training.

Although its origins date as back as 1990s [1, 2], contrastive learning has recently gained popularity due to its achievements in self-supervised learning, especially in computer vision. In **contrastive learning**, a representation is learned by comparing among the input samples. The comparison can be based on the similarity between positive pairs or dissimilarity of negative pairs. The goal is to learn such an embedding space in which similar samples stay close to each other while dissimilar ones are far apart. Contrastive learning can be applied to both supervised and unsupervised settings. Let us consider image classification problem. In supervised setting, positive pairs are different instances with the same label and negative samples are selected from other labels (**Figure 1**). On the other hand, in unsupervised (or self-supervised) setting, positive pairs are parts (or augmented versions) of the same instance and negative samples are other instances with any label. Khosla *et al.* [3] provide a performance comparison between supervised and self-supervised training for image classification problem. Also, a more comprehensive review of contrastive learning can be found in [4].

Since a self-supervised model does not know the actual labels corresponding to the inputs, its success depends on the design of the pretext tasks to generate the pseudolabels from part of the input data. With these pseudo-labels, training on pretext task is performed with a 'supervised' loss function. Final performance on the pretext task is not important, but we hope that the learned intermediate representations can capture good information and be beneficial to a variety of downstream tasks.

Especially in computer vision and natural language processing (NLP), deep learning has become the most popular machine learning approach [5]. In parallel, selfsupervised learning studies in computer vision have employed CNNs. **Figure 2** shows the knowledge transfer from a self-supervised training to a supervised one in a deep learning setting. We save convolutional layers which are assumed to produce learned representations. We change/add fully connected layers, place a classifier head and train with the limited amount of labeled data for a downstream task like image classification or object detection.

The remainder of this chapter is structured as follows. Pretext tasks that are common in literature are reviewed in Section 2. Section 3 has detailed information about recent self-supervised learning models that use Siamese architectures. Section 4 provides our own experimental study where self-supervised contrastive learning is employed to learn representations of semantic segmentation masks, which is followed by the conclusions in Section 5.

#### **Figure 1.**

*Self-supervised (left) vs. supervised (right) contrastive learning. Training results in an embedding space such that similar sample pairs stay close to each other while dissimilar ones are far apart. Figure is reproduced based on [3].*

*Self-Supervised Contrastive Representation Learning in Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.104785*

#### **Figure 2.**

*A model is first trained with a pretext task with unlabeled data, then fine-tuned on the downstream task with limited amount of labeled data. Usually convolution layers, which are mostly responsible of learning representations, are transferred. A few fully-connected layers towards the end are changed or retrained.*

#### **Figure 3.**

*Several random transformations applied to a patch from the unlabeled dataset to be used for self-supervised learning. Original sample is in top-left. The idea was first used by [6] and the figure is from the original paper with author's permission.*

#### **2. Pretext tasks for self-supervised learning**

**Image Distortion**. We expect that when an image goes through a small amount of distortion, its semantic meaning does not change. Dosovitskiy *et al.* [6] used this idea to create a exemplar-based classification task, where a surrogate class is formed with each dataset sample by applying a variety of transformations, namely translation, scaling, rotation, contrast and color (**Figure 3**). When this approach is applied to whole image instances, it can be called as 'instance discrimination' [7], where augmented versions of the same image (positive pair) should have similar representations and augmented versions of the different images (negative pair) should have different representations.

This is not only one of the first pretext tasks but also a very popular one. We will see in Section 3 that the mentioned type of augmentations have succeeded in learning useful representations and have achieved state-of-the-art results in transfer learning for downstream computer vision tasks.

**Image Rotation**. Each input image is first rotated by a multiple of 90° at random. A model is trained to predict the amount of rotation applied [8]. In basic setting, it is a 4-class classification problem, but different versions can be conceived. To estimate the amount of rotation, this pretext task forces the model learn semantic parts of objects,

**Figure 4.**

*Self-supervised representation learning by rotating input images, implemented in [8]. The model classifies the rotation [0°, 90°, 180°, 270°].*

such as arms, legs, eyes. Thus, it would serve well for a downstream task like object recognition (**Figure 4**).

**Jig-saw Puzzle**. Noroozi and Favaro [9] designed a jigsaw puzzle game, where a CNN model is trained to place 9 shuffled patches back to the original locations. Each patch is processed independently with shared weights and a probability vector estimated per patch. Then, these estimations were merged to output a permutation.

**Image Colorization**. The task is to colorize gray-scale images into colorful images [10]. A CNN is trained to predict the colorized version of the input (**Figure 5**). Obtaining a training dataset is inexpensive since training pairs can be easily generated. Model's latent variables represent grayscale images and can be useful for a variety of downstream tasks.

**Image Inpainting.** The pretext task is filling in a missing piece in the image (e.g. Pathak *et al*. [11]). The model is trained with a combination of the reconstruction (L2) loss and the adversarial loss. It has an encoder-decoder architecture and encoder part can be considered as representation learning.

Last two pretext tasks (image colorization and inpainting) and some other GANs (e.g. image super-resolution [12]) are generation-based methods, where a missing info in the content is generated from available input. Whereas distortion, rotation and jigsaw are context-based self-supervision methods. For more detailed literature on pretext tasks, we refer the readers to the review in [13].

In our study, we concentrate on the context-based approach. Taking advantage of contrastive learning, this approach nowadays achieves state-of-the-art performance

**Figure 5.**

*A model is trained to predict the colorized version of grayscale images (obtaining the dataset is inexpensive).*

#### *Self-Supervised Contrastive Representation Learning in Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.104785*

[14–17]. We will go into details, especially the models with Siamese architecture, in Section 3. The generation-based and context-based method distinction also exists for video representation learning. In [18], an encoder network is used to learn video representations. Then, a decoder uses the representations to predict future frames. Differently, Qian *et al*. [19] employ contrastive learning with distortions (augmentations) to learn representations and to classify video clips.

The rest of our chapter will consider works on image data. Before proceeding, let us give a few examples where contrastive learning is used for image-text pairs. Contrastive Language-Image Pre-training (CLIP, [20]) is a pretext task, where a text encoder and an image encoder are jointly trained to match captions with images. Training set consists of 400 million (image,text) pairs and an inter-modal contrastive loss is defined such that image and text embeddings of same objects will be closer to each other. Then, this pretraining is employed for a downstream task of zero-shot class prediction from images. Li *et al*. [21] performed a similar task for semantic segmentation. An image encoder is trained with a contrastive objective to match pixel semantic embeddings to the text embeddings. Another example presents contrastive learning of medical visual representations from paired images and text [22].

#### **3. Self-supervised contrastive learning models**

The goal of contrastive learning is to learn such an embedding space in which similar sample pairs stay close to each other while dissimilar ones are far apart. Implemented using Siamese networks, recent approaches create two different augmentations of samples and feed into the networks for contrastive learning. While SimCLR [14] and MoCo [15] use the negative samples directly along with the positive ones, BYOL [16] and SimSiam [17] achieved similar performance just with the positive samples. Differently, SwAV [23] forced consistency between cluster assignments of augmentations, instead of comparing features directly. Shortly after, vision transformers were included in self-supervised learning architectures [24, 25]. According to the results, not only image classification, but also object detection and semantic segmentation as downstream tasks benefit from self-supervised contrastive learning. Let us briefly explain some of these main approaches.

#### **3.1 SimCLR**

Let us describe SimCLR [14] first, then we will describe other methods by comparing to previous ones. SimCLR uses both positive and negative samples, but being positive or negative does not correspond to actual class labels. Augmented versions of the anchor are taken as positives, whereas samples belong to different instances are taken as negatives (**Figure 6**).

• Let *T* be the set of image transformation operations where *t* � *T* and *t* <sup>0</sup> � *T* are two different transformation operators independently sampled from *x*. These transformations are random cropping and resizing, random Gaussian blur and random color distortion. A *x*~*i*, *x*~ *<sup>j</sup>* pair of query and key views is positive when these two views are created by applying different transformations on the same image *x*: *x*~*<sup>i</sup>* ¼ *t x*ð Þ and *x*~ *<sup>j</sup>* ¼ *t* 0 ð Þ *x* .

#### **Figure 6.**

*SimCLR framework [14], where two separate data augmentations are sampled from a predefined family of augmentations (crop, blur color jitter). An encoder network f*ð Þ� *and a projection head g*ð Þ� *are trained to maximize the agreement between the embeddings of these two samples. When the self supervised learning is over, projection head can be thrown away.*


SimCLR uses the contrastive loss given in Eq. (1). This is a categorical crossentropy loss to identify the positive sample among a set of negative samples (inspired from InfoNCE [27]).

$$L^{self} = \sum\_{i \in I} L\_i^{self} = -\sum\_{i \in I} \log \frac{\exp\left(\mathbf{z}\_i \cdot \mathbf{z}\_j/\tau\right)}{\sum\_{a \in A(i)} \exp\left(\mathbf{z}\_i \cdot \mathbf{z}\_a/\tau\right)}\tag{1}$$

*N* images are randomly taken from the dataset. Thus, the training batch consists of 2*N* images to which data augmentations are randomly applied. Let *i*∈ *I* � f g 1*:::*2*N* be the index of an arbitrary augmented sample, then *j* is the index of the other

augmentation of the same original image. *τ* ∈ *R*<sup>þ</sup> is a scalar temperature parameter, � represents the dot product, and *A i*ðÞ� *I* � f g*i* . We call index *i* the anchor, index *j* is the positive, and the other 2ð Þ *N* � 1 indices as negatives. The denominator has a total of 2*N* � 1 terms (one positive and 2*N* � 2 negatives).

A common protocol to evaluate self-supervised model efficiency is to place a linear classifier on top of (frozen) layers learnt by self-supervised training and train it for the downstream task with the labeled data. If the performance gap between this self-supervised encoder + linear classifier and a fully-supervised model is small, then the self-supervised training considered as efficient. An alternative evaluation protocol uses semi-supervised learning, i.e. pretrained network is re-trained as a whole with a certain percentage of available labels. Experiments reveal that re-training with only 10% of the labeled data achieves a performance (92.8%) very close to fullysupervised training performance on the whole dataset (94.2%) as reported in [14] (performances are top-5 classification accuracy on ImageNet dataset for ResNet-50).

#### **3.2 Momentum contrast (MoCo)**

Contrastive methods based on InfoNCE loss tend to work better with high number of negative examples since negative examples may represent underlying distribution more efficiently. SimCLR requires large batches (4096 samples) to ensure that there is enough negatives which demands high computation power (8 V100 GPUs in their study). To alleviate this need, MoCo [15] uses a dictionary of negative representations that is structured as a FIFO queue. This queue-based dictionary enables us to reuse representations of immediately preceding mini-batches of data. Thus, the main advantage of MoCo compared to SimCLR is that MoCo decouples the batch size from the number of negatives. SimCLR requires a large batch size and suffers performance drops when the batch size is reduced.

Given a query sample *xq* , we get a query representation through our online encoder *<sup>q</sup>* <sup>¼</sup> *<sup>f</sup> <sup>q</sup> <sup>x</sup><sup>q</sup>* ð Þ. A list of key representations {*k*0,*k*1,*k*2,...} coming from the dictionary and are encoded by a different encoder *ki* <sup>¼</sup> *<sup>f</sup> <sup>k</sup> xk i* as shown in **Figure 7**. Naming two

#### **Figure 7.**

*MoCo framework [15]. The encoder that takes negative samples (from a FIFO queue) is not updated by backpropagation but with the other encoder's parameters with a momentum coefficient. That's why it is called the momentum encoder.*

compared representations as query and key is new, but one can think of them as the two augmentations of the same sample in SimCLR (**Figure 6**).

Let us assume that there is a single positive key, *k*þ, in the dictionary that matches *q*. Then, the contrastive loss with one positive and *N* � 1 negative samples becomes:

$$L\_{MoCo} = -\log \frac{\exp\left(q \cdot k\_{+}/\tau\right)}{\sum\_{1}^{N} \exp\left(q \cdot k\_{i}/\tau\right)}\tag{2}$$

From the two encoders defined above, for *f <sup>k</sup>* we can not apply backpropagation since it works on the queue. Copying online encoder's ( *f <sup>q</sup>*) weights to *f <sup>k</sup>* could be a solution, however MoCo proposed to use a momentum-based update with a momentum coefficient:

$$
\theta\_k \gets m\theta\_k + (1-m)\theta\_q \tag{3}
$$

where *m* ∈ ½ � 0, 1 is the momentum coefficient and *θ<sup>q</sup>* and *θ<sup>k</sup>* are parameters of *f <sup>q</sup>* and *f <sup>k</sup>* respectively (**Figure 7**).

Later on, two design choices in SimCLR, namely MLP projection head and more stronger data augmentation were integrated into the approach resulting in MoCo-v2 [28].

#### **3.3 Bootstrap your own latent (BYOL)**

Different from the approaches above, BYOL [16] achieves similar representation performance without using negative samples. It relies on two different neural networks (in contrast to SimCLR but similar to MoCo), referred to as online and target networks that interact. Online network has a predictor head. Target network has the same network architecture with the online network except for the predictor head (**Figure 8**). Parameters of the target network are not updated with back-propagation, but with a moving average of online network's weights just as MoCo did for the momentum encoder.

It is curious that how the model escapes from collapsing (i.e. a trivial solution of fixed vector for each sample) when no negative samples are used. Authors of BYOL thought it is due to the momentum update, but later (with SimSiam [17]) it was discovered that using stop-gradient and predictor head is enough.

#### **Figure 8.**

*Comparison of some Siamese architectures: SimCLR, BYOL and SimSiam. Dashed lines indicate backpropagation. Components colored in red are no more needed in SimSiam. Figure is reproduced based on [17].*

#### **3.4 Simple Siamese (SimSiam)**

BYOL needs to maintain two copies of weights for the two separate networks which can be resource demanding. SimSiam [17] solves this problem with parameter sharing between the networks (with and w/o predictor head). The encoder *f*ð Þ� shares weights while processing two views. A prediction MLP head, denoted as *g*ð Þ� transforms the output of only one view. Thus, two augmented views (*x*<sup>1</sup> and *x*2) results in two outputs: *p*<sup>1</sup> ¼ *g fx* ð Þ ð Þ<sup>1</sup> and *z*<sup>2</sup> ¼ *f x*ð Þ<sup>2</sup> . Their negative cosine similarity is denoted as *D p*1, stopgradð Þ *z*<sup>2</sup> , where stopgrad operation is an important component. It implements that *z*<sup>2</sup> is treated as a constant term and encoder receives no gradients from *z*2. Gradient only flows back to the encoder through the prediction head.

Finally, negative cosine similarity based total loss is computed in a symmetric fashion:

$$L\_{SimSam} = \frac{1}{2}D\left(p\_1, \text{stopgrad}(\mathbf{z}\_2)\right) + \frac{1}{2}D\left(p\_2, \text{stopgrad}(\mathbf{z}\_1)\right) \tag{4}$$

**Figure 8** compares SimSiam with SimCLR and BYOL. SimSiam [17] does not use negative samples as SimCLR and MoCo did. Success with SimSiam also shows that momentum encoder (or any sort of moving average update of weights) is not needed. Stop-gradient operation and including predictor head are enough to prevent the model from collapsing.

SimSiam also presents transfer learning results for object detection and semantic segmentation downstream tasks. Results reveal that starting with a self-supervised pre-training on ImageNet outperforms image classification pre-training on ImageNet.

#### **3.5 Self-supervised vision transformers**

Caron *et al*. [24] proposed another Siamese architecture where one of the network's parameters are updated with a moving average of other's parameters. More interestingly, they replaced encoder CNNs with vision transformers and reported increasing success for various downstream tasks. Shortly after, Li *et al*. [25] proposed a more efficient vision transformer architecture together with a new pre-training task which is based on region matching.

#### **4. Case study: semantic mask representation learning**

As a case study, we employ self-supervised contrastive learning to learn representations of semantically segmented images, i.e. semantic masks. This learning task is especially useful when two scenes are compared according to their semantic content. A use case would be image retrieval based localization, where standard approach extract features from RGB images and compare them to find the most similar image in the database [29, 30]. Recently, several studies showed that checking semantic resemblance between query and database images and using this similarity score while retrieving images improves localization accuracy [31–33]. The reason of improvement is that there is appearance difference between images taken at different times (querydatabase) due to illumination differences, viewpoint variations, seasonal changes.

#### **Figure 9.**

*The image on top-left was taken in 2008 and the image on top-right was taken in 2019 (source: Google street view) which respectively represent query and database for image retrieval. Observe illumination differences, viewpoint variations and changing objects. Bottom row shows their semantic segmentation results. Semantic similarity can help to verify/deny the localization result.*

Although RGB image features are directly affected by those changes, semantic labels are stable most of the time (**Figure 9**).

Given a semantic mask, obtaining the most similar result among the alternatives is not a trivial task. SIFT-like features do not exist to match. Moreover, two masks of the same scene are far from being identical not only because of changing content but also due to camera position and viewpoint variations. Thus, instead of employing a pixelby-pixel label comparison score, a trainable semantic feature extractor is preferable.

Measuring semantic similarity to distinguish if two images belong to the same scene or not is a task especially suitable for self-supervised learning. Because datasets has to be prepared such that query and database are the same scene but different images (preferably long-term difference) is not easy. However, large amount of semantic masks can easily be obtained for a self-supervised training. We do not need groundtruth masks, since a successful estimation is enough to compute semantic similarity.

#### **4.1 Dataset and self-supervised contrastive training**

Our unsupervised learning dataset composed of 3484 images randomly taken from UCF dataset [34]. These are perspective images obtained from Google Street View panoramas which where taken in Pittsburgh, PA before 2014. Our supervised training and test datasets have query-database image pairs. Query images were also taken from UCF dataset (not coinciding with the 3484 images mentioned above). Database images were collected again from Google Street View panoramas at the same locations of query images but in 2019. This time gap results in seasonal changes and illumination variances. Also, a wide camera baseline between the database and query images conforms better to the long-term localization scenario [35]. Top row in **Figure 9** shows an example of query-database image pair with time difference.

#### *Self-Supervised Contrastive Representation Learning in Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.104785*

Since our aim to learn representations for semantic masks, we first automatically generated a semantic mask for each image in our dataset using a wellperforming CNN model [36]. The CNN model we employed trained on Cityscapes [37], which is an urban scene understanding dataset consists of 30 visual classes. Examples are in **Figure 9** (bottom row). After this point, we only have semantic masks in our dataset.

We used SimCLR [14] as our contrastive learning model and trained a ResNet-18 as the encoder. Encoder network (**Figure 10**) produces *<sup>h</sup>* <sup>¼</sup> *Enc x*ð Þ∈*R*<sup>512</sup> features, whereas projection network produces *<sup>z</sup>* <sup>¼</sup> *Proj h*ð Þ∈*R*<sup>512</sup> features. We set batch size as 85 and resized semantic mask to 64 � 80 resolution (due to GPU memory limitation) and used two different data augmentation methods during the training: random resized crop and random rotation. We set maximum rotation parameter as 3°, since severe rotations are not expected between query and database images. Crop parameter, however, is important to represent the variation in our dataset. Results for varying crop parameter values will be discussed in Section 4.2. Augmentation of semantic masks is visualized in **Figure 10**. Other augmentations (such as color jitter, horizontal flip, brightness and contrast distortions), which are common for image classification and object detection downstream tasks are not included since they are not expected distortions for semantic masks. We used AMSGrad optimizer which is a variant of Adam.

CNN model, trained as explained above, is now ready to produce a similarity score when two semantic masks (one query and one database) are given. After selfsupervised training, same network can be fine-tuned with a labeled dataset (query and database segmentation masks for the same scene). For this purpose, we prepared a dataset of 368 query images with their corresponding database images and extracted their semantic masks. **Figure 9** shows an example of this preparation. Not surprisingly, this paired dataset is much smaller than the self-supervised training dataset. Here, common practice in literature is that the projection head (**Figure 10**) is removed

#### **Figure 10.**

*Illustration of training a CNN model with self-supervised contrastive loss on a dataset that consists of semantically segmented masks. A positive pair is created from two randomly augmented views of the same mask, while negative pairs are created from views of other masks. All masks are encoded by the a shared encoder and projection heads before the representations are evaluated by the contrastive loss function.*

after pretraining and a classifier head is added and trained with labeled data for the downstream task. However, our pretext and downstream tasks are the same. We learn semantic representations by treating each sample as its own class (exemplar-CNN [6], instance discrimination [7]). Thus, we do not place a classifier head, but we retrain the network (partially or full).

#### **4.2 Experimental results**

To be able to measure the capability of representing semantic masks, we conduct experiments that compare the retrieval accuracies of three training schemes. First is the CNN model which is trained with the supervised training set (368 query-database pairs). This is the baseline model that does not exploit self-supervised training at all. Second is the CNN model that is trained in a self-supervised fashion with 3484 individual semantic masks (no matching pairs). Lastly, the model with self-supervised training is retrained with the supervised training set. Two versions exist: *i*) only replacing dense layers and training them, *ii*) retraining all layers.

Trained models are tested on a test set which consists of 120 query-database pairs (different from 368 pairs used in training). Performances are compared with Recall@N metric. According to this metric, for a query image, the retrieval is considered successful if any of top-N retrieved database images is a correct match. In other words, Recall@1 is the recall when only the top-most retrieval is checked.

We observe in **Table 1** that, only supervised training is not very successful. In fact, for certain N values self-supervised training managed to outperform supervised training alone. This shows the power of self-supervised learning when a large dataset is provided. Our unlabeled dataset is much larger than the labeled dataset (3484 ≫ 368). Regarding the two fine-tuning schemes, replacing dense layers and training them from scratch improved self-supervised training but not for all N values. On the other hand, fine-tuning all layers worked best by a considerable margin. Since our pretext and downstream tasks are the same (i.e. we do not train a classification head etc.), it is not surprising that replacing dense layers did not help much. **Figure 11** shows several examples where supervised training fails but the proposed selfsupervised approach (after fine-tuning) succeeds.

**Table 2** presents the effect of minimum crop ratio parameter used in data augmentation module. Since it is an important parameter to represent the variation in our semantic masks, we compare the performance for minimum crop ratio from 0.9 to 0.1. Apart from individual Recall@N values, we also compute and plot mean recall (mean of all N values) in **Table 2** last column and in **Figure 12**. We observe that it is


**Table 1.**

*Only supervised training is compared with self-supervised training and fine-tuned versions of it.*

*Self-Supervised Contrastive Representation Learning in Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.104785*

#### **Figure 11.**

*Each row shows a retrieval result for a given query (left column). Examples show the cases where only supervised training (middle column) fails at Recall@1, but utilizing self-supervised training and then fine-tuning on the labeled dataset (query-database pairs) correctly retrieves (last column).*


#### **Table 2.**

*Effect of the minimum crop ratio parameter in data augmentation at the stage of retraining of the self-supervised model.*

**Figure 12.** *Mean Recall@N values for varying min. Crop ratio parameter. Observe the reverse U-shape with a peak at 0.6.*

highest around 0.6 and 0.7. Performance gradually drops as we increase or decrease the minimum crop ratio. A minimum random crop parameter of 0.6 means that cropped mask covers at least 60% area of the original mask. Since query and database masks in our training and test datasets have a considerable overlap ratio, it is reasonable that 0.6 or higher overlaps serve best. This result is also in accordance with the finding in [38] that there is a reverse U-shape relationship between the performance and the mutual information within augmented views. When crops are close to each other (high mutual information, e.g. crop ratio = 0.9) the model does not benefit from them much. On the other hand, for low crop ratios (low mutual information) model can not learn well since views look quite different from each other. Peak performance stays somewhere in between.

#### **5. Conclusions**

In this chapter, we presented the main concepts in self-supervised contrastive learning and reviewed the approaches that attracted attention due to their success in computer vision. Contrastive learning that aims to end up in an embedding space where similar samples stay close to each other was implemented successfully with Siamese neural networks. Necessity on huge computation power was also alleviated with the most recent models. Currently, for common downstream tasks of computer vision such as object detection and semantic segmentation, self-supervised pretraining is a better alternative than using a model trained on ImageNet for image classification.

We also presented a case study where self-supervised contrastive learning is applied to learn representations of semantic masks of images. Performance was evaluated on an image retrieval task where the most similar semantic mask is retrieved from the database for a given query. In compliance with the results on other vision tasks in the literature, fine-tuning the self-supervised model with available labeled data gave better results than the supervised training alone.

#### **Acknowledgements**

This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) under Grant No. 120E500 and also under 2214-A International Researcher Fellowship Programme.

#### **Abbreviations**


### **Author details**

Yalin Bastanlar\* and Semih Orhan Department of Computer Engineering, Izmir Institute of Technology, Izmir, Turkey

\*Address all correspondence to: yalinbastanlar@iyte.edu.tr

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **References**

[1] Bromley J, Guyon I, LeCun Y, Sackinger E, Shah R. Signature verification using a siamese time delay neural network. Processing Systems. Advances in Neural Information Processing Systems. 1993:737-744

[2] Becker S, Hinton GE. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature. 1992; **355**:161-163

[3] Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, et al. Supervised contrastive learning. Advances in Neural Information Processing Systems. 2020

[4] Le-Khac PH, Healy G, Smeaton AF. Contrastive representation learning: A framework and review. IEEE Access. 2020

[5] Tekir S, Bastanlar Y. Deep learning: Exemplar studies in natural language processing and computer vision. In: Data Mining - Methods, Applications and Systems. London: InTechOpen; 2020. DOI: 10.5772/ intechopen.91813

[6] Dosovitskiy A, Springenberg JT, Riedmiller M, Brox T. Discriminative unsupervised feature learning with convolutional neural networks. Advances in Neural Information Processing Systems. 2014

[7] Wu Z, Xiong Y, Yu SX, Lin D. Unsupervised feature learning via nonparametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018

[8] Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by predicting image rotations. ICLR. 2018

[9] Noroozi M, Favaro P. Unsupervised learning of visual representations by solving jigsaw puzzles. European Conference on Computer Vision (ECCV). 2016

[10] Zhang R, Isola P, Efros A. Colorful Image Colorization. European Conference on Computer Vision (ECCV). 2016

[11] Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA. Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

[12] Ledig C, Theis L, Huszar F, Caballero J, Cunningham A, Acosta A, et al. Photo-realistic single image superresolution using a generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017

[13] Jing L, Tian Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021;**43**(11): 4037-4058. DOI: 10.1109/ TPAMI.2020.2992393

[14] Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (ICML). 2020

[15] He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020

*Self-Supervised Contrastive Representation Learning in Computer Vision DOI: http://dx.doi.org/10.5772/intechopen.104785*

[16] Grill JB, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, et al. Bootstrap your own latent: A new approach to self-supervised learning. Advances in Neural Information Processing Systems. 2020

[17] Chen X, He K. Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021

[18] Srivastava N, Mansimov E, Salakhutdinov R. Unsupervised learning of video representations using LSTMs. International Conference on Machine Learning. 2015

[19] Qian R, Meng T, Gong B, Yang MH, Wang H, Belongie S, et al. Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021

[20] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning. 2021

[21] Li B, Weinberger KQ, Belongie S, Koltun V, Ranftl R. Language-driven semantic segmentation. International Conference on Learning Representations (ICLR). 2022

[22] Zhang Y, Jiang H, Miura Y, Manning CD, Langlotz CP. Contrastive learning of medical visual representations from paired images and text. arXiv:2010.00747v1. 2020

[23] Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural

Information Processing Systems (NeurIPS). 2020

[24] Caron M, Touvron H, Misra I, Jegou H, Mairal J, Bojanowski P, et al. Emerging properties in self-supervised vision transformers. International Conference on Computer Vision (ICCV). 2021

[25] Li C, Yang J, Zhang P, Gao M, Xiao B, Dai X, et al. Efficient selfsupervised vision transformers for representation learning. International Conference on Learning Representations (ICLR). 2022

[26] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

[27] van den Oord A, Li Y, Vinyals O. Representation learning with contrastive predictive coding. arXiv:1807.03748. 2018

[28] Chen X, Fan H, Girshick R, He K. Improved baselines with momentum contrastive learning. arXiv:2003.04297. 2020

[29] Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J. Net VLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

[30] Ge Y, Wang H, Zhu F, Zhao R, Li H. Self-supervising fine-grained region similarities for large-scale image localization. European Conference on Computer Vision (ECCV). 2020

[31] Cinaroglu I, Bastanlar Y. Long-term image-based vehicle localization improved with learnt semantic descriptors. Engineering Science and

Technology, an International Journal. 2022;**35**:101098

[32] Cinaroglu I, Bastanlar Y. Training semantic descriptors for image-based localization. ECCV Workshop on Perception for Autonomous Driving. 2020

[33] Orhan S, Guerrero JJ, Bastanlar Y. Semantic pose verification for outdoor visual localization with self- supervised contrastive learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 2022

[34] Zamir AR, Shah M. Image geolocalization based on multiple nearest neighbor feature matching using generalized graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2014;**36**(8):1546-1558

[35] Sattler T, Maddern W, Toft C, Torii A, Hammarstrand L, Stenborg E, et al. Benchmarking 6DOF outdoor visual localization in changing conditions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2018

[36] Sun K, Zhao Y, Jiang B, Cheng T, Xiao B, Liu D, et al. High-resolution representations for labeling pixels and regions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2019

[37] Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

[38] Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems (NeurIPS). 2020

#### **Chapter 7**

## Analysis of Brain Computer Interface Using Deep and Machine Learning

*Nabil Ajali-Hernández and Carlos M. Travieso-Gonzalez*

#### **Abstract**

Pattern recognition is becoming increasingly important topic in all sectors of society. From the optimization of processes in the industry to the detection and diagnosis of diseases in medicine. Brain-computer interfaces are introduced in this chapter. Systems capable of analyzing brain signal patterns, processing and interpreting them through machine and deep learning algorithms. In this chapter, a hybrid deep/machine learning ensemble system for brain pattern recognition is proposed. It is capable to recognize patterns and translate the decisions to BCI systems. For this, a public database (Physionet) with data of motor tasks and mental tasks is used. The development of this chapter consists of a brief summary of the state of the art, the presentation of the model together with some results and some promising conclusions.

**Keywords:** brain-computer interfaces, deep learning, machine learning, pattern recognition, artificial intelligence, neural network

#### **1. Introduction**

The brain is the most important organ in the human body. It processes, integrates and coordinates the information it receives from the organs and the senses and makes decisions, sending them to the rest of the body, like a processor in a computer. The brain works through electrochemical impulses, called synapses, which allow the transmission of information between neurons [1–3].

These impulses could be classified by their frequency into different types of brain waves. Delta (1–3 Hz), theta (3–8 Hz), alpha (8–13 Hz), beta (13–30 Hz), and gamma (30–100 Hz) waves [4]. These brain waves are the reflection of electrical activity (in microvolts) and therefore of thoughts and motor intentions. They can be captured by the electroencephalogram (EEG) and their study can lead to the detection of pathologies related to the brain (Alzheimer's, Parkinson's, epilepsy) [5, 6].

**Figure 1** shows the EEG of a normal person, where 64 channels have been placed throughout the head (10–20 system) and brain waves are monitored over time. It can be seen how there are an infinity of patterns that provide important information about what is happening.

#### **Figure 1.** *Excerpt from a normal encephalogram. Channels on the left and brain waves monitored.*

As a result of technological advances in both software and hardware, concepts such as artificial intelligence, machine learning or deep learning have been developed in last decades. This has allowed an evolution in many fields of society. In the field of brain signals, pattern recognition is of vital importance, both for diagnosis and for the development of applications that improve quality of life or simply for the development of mind-controlled tools or games. Thus, the Brain Computer Interfaces were born**.**

A brain-computer interface (BCI) or Brain-machine Interface is a system based on the recording or acquisition of the brain signal that is linked by direct communication (wired or wireless) to a machine or computer capable of interpreting and transforming thoughts or intentions into actual actions [7]. **Figure 2** shows the complete process of a BCI. First the obtaining of brain signals and their processing. Subsequently, the machine associated with the system is capable of receiving and interpreting these signals and resulting in a response.

In this way, using a public database, called Physionet, which consists of 109 subjects who perform tests of motor intentions, we are working on a BCI system whose objective is to be able to recognize the patterns of these thought movements and transfer them to a robotic arm that moves accordingly. Contributing to the current state of the art a new method that seeks to take into account the immediate previous mental state (IPMS) of each subject to improve pattern recognition.

*Analysis of Brain Computer Interface Using Deep and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.106964*

**Figure 2.** *Brain computer Interface workflow.*

#### **2. Pattern recognition with immediate previous mental state**

#### **2.1 State of art**

From the last 10–15 years and until 2022, many studies have led to great advances in pattern recognition, human-computer interaction and in applications that use task classification or event-related potential (P300). Machine and deep learning algorithms and techniques have developed as computers and systems have improved over the years. They use different characteristics extracted from the data obtained from brain waves to classify or recognize certain patterns, getting better and better success in the pattern recognition and prediction [8–11]. E.g., in 2018 by Bird et al. made predictions of mental states of relaxation, neutrality and concentration [12]. Also, in emotional mental states such as being negative, neutral and positive [13].

In the field of EEG, large amounts of studies have been carried out in those years, addressing issues such as EEG channel selection, methods for make an optimal feature extraction or types of classifiers to predict patterns. For example, Feng et al. [14] published in 2019 an optimized channel selection method based on multi-frequency CSP-Rank for BCI systems using motor images. Jiménez et al. [15], propose an upper limb to assist in the rehabilitation of people with cerebrovascular accident or people with some disability or amputation. There are even lines of research that began to focus on the creation of brain-machine systems capable of acting under the orders of thought.

In 2020, philanthropist Elon Musk and his company Neuralink created an implantable BCI system to control systems with the mind which was successfully implanted in a pig in 2021. In addition, he announced that a monkey had been successfully made to play video games using this device [16]. The downside is that these BCI devices are invasive and still in the early stages of development.

All these advances always depend on the nature of the problem. In many cases the EEG image is taken directly and by using Common Spatial Patterns (CSP) the

problem is solved. Other times 3D or 2D matrices are created to evaluate the problem and applying deep learning methods such as the use of Convolutional Neural Networks (CNN) together with Long-Short Term Memory (LSTM) the problem is solved.

The advantage of these techniques is the automation and success of the learning process. But, on the other hand, the disadvantage is the cost in terms of computation and the large amount of data that is required.

Normally all systems have work using a common scheme, as shown in **Figure 2**, which has the following steps:

1.Signal acquisition. Directly/EEG database.

2.Feature extraction.

3.Method selection for pattern recognition.

4.Train-Validation-Test. Feedback.

Subsequently, the rank of success in brain task classification of many works and also the most important work to date is shown.

In this chapter, a large number of avant-garde articles have been reviewed to extract the most important concepts and ideas in this field in order to adequately explain and expose the work carried out.

#### **2.2 Pattern recognition with immediate previous mental state**

As stated above, current works use machine learning and deep learning to recognize brain patterns and classify them. The success rate vary depends on the database used, the signal processing, the feature extraction or the types of classifiers used. Many of these works are between 61 and 76% in their success rate [17, 18].

The best work to date is that provided by Zhang et al. [19] where they claim to achieve success rates in the classification of at least 93% and up to 98%, using the same database (Physionet). They do this through cascade deep learning techniques, mixing a 3-layer CNN with recurrent neural networks (RNN) to take into account spatiotemporal characteristics of the experiments.

In other hand, we have a previous work, that are pending to be published, where we achieved experiments with up to 93% success using a mixture of machine learning classifiers. Taking this into account, we have done a lecture of the state of the art and concluded that according to several studies, such as that of Roc et al. [9], the success rate of the classification of mental tasks by in BCI systems is directly related to the subject in question and also to his mental state at that moment (relaxed, altered, nervous).

#### *2.2.1 Our proposal*

For this reason, our proposal consists of following a generalized scheme where the signal from the database is acquired, processed, the features are extracted and then a classification is done. However, we have decided to add the mental state prior to the moment of decision of the subjects as a variable, see **Figure 3**. That is, the moment before making a decision. To do this, after make a signal processing and a feature

*Analysis of Brain Computer Interface Using Deep and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.106964*

**Figure 3.** *Proposal workflow scheme.*

extraction a classification of the mental tasks of various subjects with several classifiers of machine learning is done.

Later this is repeated, but taking into account their IPMS to compare the results. It seeks to demonstrate that there is a significant improvement in pattern recognition taking into account the average rank of success in classification and the standard deviation. We expect that the success rate in classification will increase and the standard deviation to decrease. This study is focus not in the success rate per se, but in the difference between taking into account IPMS or not.

#### **2.3 Development of the BCI system**

#### *2.3.1 Signal acquisition*

To carry out this work, the signal has been acquired by downloading the public database Physionet.org. A database developed in collaboration with the developers of the BCI 2000 system [20–22]. The database consists of 109 users who perform different types of tests, obtaining more than 1500 EEG records. Using a 10/20 system and placing a total of 64 electrodes. The tests to be carried out consist of 14 tests of approximately 2 minutes per test, where the subjects alternate periods of 4.1 s of rest with different tasks (of between 4.1 and 4.2 s) such as opening and closing their fist or imagining these movements.

In order to prove the hypothesis, in this chapter, several simplifications have been done:

• First. A group of 10 subjects is chosen as representative group of the set to demonstrate if there is evidence of success in having taken the IPMS into account.


Obtaining in this way labeled data matrices for the experiments having taken into account the IPMS and without taking it into account. To obtain the IPMS, a difference between the interaction intervals (T1/T2) and the rest intervals prior to the motor imagery (T0) is made.

#### *2.3.2 Feature extraction and pattern recognition*

After acquiring the signal, the next step when developing a BCI system is to perform an optimal feature extraction that allows good pattern recognition.

The first thing to do is remove the noise that masks the signal and thus obtain a better signal-to-noise ratio. This is because when taking biological measurements, factors such as breathing itself, the heartbeat (low frequencies) or the electricity that runs through the circuit (high frequencies) are factors that add noise and can mask the signal. Considering the frequency of brain signals (1–100 Hz), in the EEG a key factor is to remove these unnecessary frequencies. Therefore, based on multiple studies [24–26], a bandpass filter is applied between 0.5 Hz and 50 HZ in order to avoid the electrical 50 Hz band and the low frequencies.

After noise removal, the discrete wavelet transform (DWT) is used. DWT is a signal processing tool used to perform multi-resolution analysis with variable time windows. It is a tool capable of decomposing and recomposing signals according to their time and frequency to facilitate the analysis [27]. Brain signals have unpredictable frequency and intensity over time, so they are nonstationary. Using the DWT is capable of breaking a signal with low-pass and high-pass filters at different levels, see **Figure 4**. Thus, obtaining a high frequency component and a low frequency component (with different information) at various levels. These levels correspond to different types of brain waves (delta to beta) that provide different types of information and could facilitate the pattern recognition.

Mathematically, the equation behind the DWT is given by Eq. (1), [28]:

$$f(t) = \sum\_{k} \sum\_{j} a\_{j,k} \,\,\rho\_{j,k}(t) \tag{1}$$

This equation is expressed in terms of two indices, the translation time *k* and the scaling index *j*. These two indices are integers values and the wavelet functions form an orthogonal set of functions (base) [29].

*Analysis of Brain Computer Interface Using Deep and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.106964*

**Figure 4.**

*Level 1 of decomposition using DWT. Original signal is separated in low and high frequency.*

Daubechies family of wavelet have a better performance at time of classification as is shown by Alomari et al. [30] and others [20, 21].

Feature extraction allows at this point to have a better signal-to-noise ratio in the labeled matrices obtained in Section 2.1.1 and also a separation based on their frequencies. Thus, allowing pattern recognition when applying machine learning to be less expensive in terms of computing. This is due motor activities requires an active state of mind, and in consequence the result are the alpha and beta waves (8–30 Hz) already separated.

#### *2.3.3 Classification*

A series of machine learning classifiers are used to recognize the patterns of motor imagery (MI) and classify them. Each of the users has 3 real experiments and 3 MI experiments.

To test the hypothesis, the classifiers will be used with the MI experiments. In this way, it will be observed if there is an improvement in performance when taking into account the IPMS. The hold-out cross validation method is used to train and test the success in classification, training each subject with 70% of the set and blindly testing with 30%. The classifiers used are:


We will not go into the mathematical details behind these classifiers as they are widely referenced and used in the world of pattern recognition and artificial intelligence. Attached is the reference where this information can be found.

#### **3. Results and discussion**

As explained in Section 2.2.1 the method to evaluate the success when taking into account the IPMS is the average success of the results and its standard deviation. In **Table 1**, the classify results without taking into account IPMS:


#### **Table 1.**

*Classification without taking into account IPMS hypothesis.*

#### *Analysis of Brain Computer Interface Using Deep and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.106964*

In **Table 2**, the classify results taking into account IPMS:

It must be remembered that what is sought is to observe an improvement in pattern recognition by implementing the IPMS hypothesis and we are not focused on the search for the best classifier.

After obtaining the results of the classifications with and without IPMS, a comparison of the average of both methods is made, which is shown in **Table 3**. It can be seen that in 100% of the cases the classification improves significantly, with results of


#### **Table 2.**

*Classification taking into account IPMS hypothesis.*


**Table 3.**

*Comparison between NO IPMS and IPMS classification. Avg change and standard deviation changes of both methods.*

up to 12% of improvement in classification. This leads us to think that for the recognition of brain patterns, taking into account the mental state prior to motor imagery is essential to obtain a better performance.

#### **4. Conclusions**

In this chapter a brief introduction to the field of brain signals has been made. The pattern recognition for the development of applications in fields such as medicine or

*Analysis of Brain Computer Interface Using Deep and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.106964*

industry and how to analyze brain signals for that propose has been explained. Subsequently, the BCIs have been introduced, explaining their operation and purpose. The IPMS hypothesis is proposed as an improvement of pattern recognition in BCI systems.

A public database (Physionet), where EEG records of subjects performing a series of motor and imaginary tasks are collected, is presented and using this dataset the steps to develop a BCI system are proposed. The signal is processed, features are extracted for pattern recognition and finally a series of classifiers are proposed to test the IPMS theory.

The results show there is evidence that taking into account the mental state prior to performing mental tasks directly affects the recognition of brain patterns and consequently the success in classifying them, improving them by up to 12%.

Note that research has been focused on testing this hypothesis and not on finding the best classifier. In future lines, we will try to apply the IPMS hypothesis to the best state-of-the-art classifiers.

#### **Acknowledgements**

This work was funded "Agencia Canaria de Investigación, Innovación y Sociedad de la Información de la Consejería de Economía Conocimiento y Empleo y por el Fondo Social Europeo (FSE) Programa Operativo Integrado de Canarias 2014-2020, Eje 3 Tema Prioritario 74 (85%)" from "Gobierno de Canarias" in Spain, under the reference "TESIS2020010118".

#### **Appendices and nomenclature**


*Artificial Intelligence Annual Volume 2022*

### **Author details**

Nabil Ajali-Hernández\* and Carlos M. Travieso-Gonzalez University of Las Palmas de Gran Canaria (ULPGC), Las Palmas de Gran Canaria, Spain

\*Address all correspondence to: nabil.ajali101@alu.ulpgc.es

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

*Analysis of Brain Computer Interface Using Deep and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.106964*

#### **References**

[1] CogniFit. CogniFit.com [Internet]. 2019. Available from: https://www.cog nifit.com/es/cerebro

[2] Corbetta M, Shulman GL. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews. Neuroscience. 2002;**3**(3):201-215

[3] Hammond C. Cellular and Molecular Neurobiology. Deluxe ed. San Diego: Academic Press; 2001

[4] Buzsaki G. Rhythms of the Brain. Oxford: Oxford University Press; 2006

[5] Ramos-Argüelles F, Morales G, Egozcue S, Pabón RM, Alonso MT. Técnicas básicas de electroencefalografía: principios y aplicaciones clínicas. Spain: Canales del sistema sanitario de Navarra; 2009. pp. 69-82

[6] Barros MIM, Guardiola GT. Conceptos básicos de electroencefalograf {\'\i}a. Duazary. 2006;**3**(1):18-23

[7] Krucoff MO, Rahimpour S, Slutzky MW, Edgerton VR, Turner DA. Enhancing nervous system recovery through neurobiologics, neural interface training, and neurorehabilitation. Frontiers in Neuroscience. 2016;**10**:584

[8] Gajic D, Djurovic Z, Di Gennaro S, Gustafsson F. Classification of EEG signals for detection of epileptic seizures based on wavelets and statistical pattern recognition. Biomedical Engineering Applications Basis and Communications. 2014;**26**(02):1450021

[9] Oh SL, Hagiwara Y, Raghavendra U, Yuvaraj R, Arunkumar N, Murugappan M, et al. A deep learning approach for Parkinson's disease diagnosis from EEG signals. Neural

Computing and Applications. 2020; **32**(15):10927-10933

[10] Fan M, Yang AC, Fuh J-L, Chou C-A. Topological pattern recognition of severe Alzheimer's disease via regularized supervised learning of EEG complexity. Frontiers in Neuroscience. 2018;**685**:3-5

[11] Gurumoorthy S, Muppalaneni NB, Gao X-Z. Analysis of EEG to find Alzheimer's disease using intelligent techniques. In: Computational Intelligence Techniques in Diagnosis of Brain Diseases. Singapore: Springer; 2018. pp. 61-70

[12] Bird JJ, Manso LJ, Ribeiro EP, Ekart A, Faria DR. A study on mental state classification using eeg-based brainmachine interface. In: International Conference on Intelligent Systems (IS). Madeira: IEEE; 2018. pp. 795-800

[13] Bird JJ, Ekart A, Buckingham CD, Faria DR. Mental emotional sentiment classification with an eeg-based brainmachine interface. In: Proceedings of the International Conference on Digital Image and Signal Processing (DISP'19). Oxford: 2019

[14] Feng JK, Jin J, Daly I, Zhou J, Niu Y, Wang X, et al. An optimized channel selection method based on multifrequency CSP-rank for motor imagery-based BCI system. Computational Intelligence and Neuroscience. 2019:6-8

[15] Jiménez AR, Grisales CA, Sotelo JL. Diseño de un sistema cerebro-maquina de miembro superior para la asistencia a la rehabilitación de personas con accidente cerebro-vascular. Encuentro Int Educ en Ing. 2019

[16] Pérez E. Un mono jugando al Pong es la primera demostración de Neuralink, el proyecto de Elon Musk para conectar el cerebro con los ordenadores [Internet]. Xakata. 2021. Available from: https:// www.xataka.com/investigacion/ mono-jugando-al-pong-primera de mostracion-neuralink-proyecto-paraconectar-cerebro-ordenadores-elon-musk

[17] Major TC, Conrad JM. The effects of pre-filtering and individualizing components for electroencephalography neural network classification. Southeast Construction. 2017;**2017**:1-6

[18] Wu SL, Liu YT, Hsieh TY, Lin YY, Chen CY, Chuang CH, et al. Fuzzy integral with particle swarm optimization for a motor-imagery-based brain–computer interface. IEEE Transactions on Fuzzy Systems. 2016; **25**(1):21-28

[19] Zhang D, Yao L, Chen K, Wang S, Chang X, Liu Y. Making sense of spatiotemporal preserving representations for EEG-based human intention recognition. IEEE Transactions on Cybernetics. 2019; **50**(7):3033-3044

[20] Physionet.org. PhysioNet [Internet]. 2019. Available from: https://www. physionet.org/physiobank/database/ eegmmidb

[21] Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation. 2000;**101**(23):215-220

[22] Schalk G, McFarland DJ, Hinterberger T, Birbaumer N, Wolpaw JR. BCI2000: A general-purpose brain-computer interface (BCI) system. IEEE Transactions on Biomedical Engineering. 2004;**51**(6):1034-1043

[23] Craik A, He Y, Contreras-Vidal JL. Deep learning for electroencephalogram (EEG) classification tasks: A review. Journal of Neural Engineering. 2019; **16**(3):31001

[24] Ji N, Ma L, Dong H, Zhang X. EEG signals feature extraction based on DWT and EMD combined with approximate entropy. Brain Sciences. 2019;**9**(8):201

[25] Medina B, SIERRA JE, ULLOA AB. Técnicas de extracción de caracter{\'\i} sticas de señales EEG en la imaginación de movimiento para sistemas BCI. Revista ESPACIOS. 2018:8-10

[26] Noguera MAP, Ortega CEM, Castro W, Ordoñez DH. Análisis De Señales EEG Para Detección De Intenciones Motoras Aplicadas A Sistemas BCI.

[27] Bhattacharya S, Haddad RJ, Ahad M. A Multiuser EEG Based Imaginary Motion Classification Using Neural Networks. Southeastern Construction. Norfolk: IEEE; 2016. pp. 1-5

[28] Burrus CS. Introduction to Wavelets and Wavelet Transforms: A Primer. Englewood Cliffs. New Jersey: Prentice Hall; 1997

[29] Wei D, Tian J, Wells RO, Burrus CS. A new class of biorthogonal wavelet systems for image transform coding. IEEE Transactions on Image Processing. 1998;**7**(7):1000-1013

[30] Alomari MH, AbuBaker A, Turani A, Baniyounes AM, Manasreh A. EEG mouse: A machine learning-based brain computer interface. International Journal of Advanced Computer Science and Applications. 2014;**5**(4):193-198

[31] Strobl C, Malley J, Tutz G. An introduction to recursive partitioning: Rationale, application, and characteristics of classification and

*Analysis of Brain Computer Interface Using Deep and Machine Learning DOI: http://dx.doi.org/10.5772/intechopen.106964*

regression trees, bagging, and random forests. Psychological Methods. 2009; **14**(4):323

[32] McLachlan GJ. Discriminant Analysis and Statistical Pattern Recognition. New Jersey: John Wiley & Sons; 2005

[33] Cramer JS. The Origins of Logistic Regression. UK: Cambridge University Press; 2002

[34] Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995; **20**(3):273-297

[35] Altman NS. An introduction to kernel and nearest-neighbor nonparametric regression. American Statististics. 1992;**46**(3):175-185

[36] Sayad DS. An Introduction to Data Science [Internet]. 2021. Available from: https://injuryfacts.nsc.org/motorvehicle/overview/introduction/

[37] Natekin A, Knoll A. Gradient boosting machines, a tutorial. Frontiers in Neurorobotics. 2013;**7**:21

[38] Lv C, Xing Y, Zhang J, Na X, Li Y, Liu T, et al. Levenberg–Marquardt backpropagation training of multilayer neural networks for state estimation of a safety-critical cyber-physical system. IEEE Transactions on Industrial Informatics. 2017;**14**(8):3436-3446

#### **Chapter 8**

## Multi-Features Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale Invariant Heat Kernel Signature

*Kishore Kumar Kamarajugadda and Movva Pavani*

#### **Abstract**

Face recognition across aging emerges as a significant area among researchers due to its applications such as law enforcement, security. However, matching human faces with different age gaps is still bottleneck due to face appearance variations caused by aging process. In regard to mitigate such inconsistency, this chapter offers five sequential processes that are Image Quality Evaluation (IQE), Preprocessing, Pose Normalization, Feature Extraction and Fusion, and Feature Recognition and Retrieval. Primarily, our method performs IQE process in order to evaluate the quality of image and thus increases the performance of our Age Invariant Face Recognition (AIFR). In preprocessing, we carried out two processes that are Illumination Normalization and Noise Removal that have resulted in high accuracy in face recognition. Feature extraction adopts two descriptors such as Convolutional Neural Network (CNN) and Scale Invariant Heat Kernel Signature (SIHKS). CNN extracts texture feature, and SIHKS extracts shape and demographic features. These features plays vital role in improving accuracy of AIFR and retrieval. Feature fusion is established using Canonical Correlation Analysis (CCA) algorithm. Our work utilizes Support Vector Machine (SVM) to recognize and retrieve images. We implement these processes in FG-NET database using MATLAB2017b tool. At last, we validate performance of our work using seven performance metrics that are Accuracy, Recall, Rank-1 Score, Precision, F-Score, Recognition rate and computation time.

**Keywords:** age-invariant face recognition, image quality evaluation, pose normalization, multiple feature extraction, recognition and retrieval

#### **1. Introduction**

As one of the most significant topics in computer vision and pattern recognition, face recognition attains much attention from both academic and industries over recent decades [1, 2]. With the evolution of neural networks, general face recognition technology emerged as a noteworthy area among researchers [3–5]. However, identifying face images across widespread range of ages is shortcoming due to human face

appearance changes affected by aging process [6, 7]. In order to achieve human face recognition under difference ages, Age-Invariant Face Recognition (AIFR) approach is progressed [8]. AIFR recognizes faces using facial features extracted from human images. AIFR method uses three different models such as generative, discriminative [9], and deep learning methods [10]. Generative approaches are based on the age progression methods in regard to converting the probe image into the same age as that of gallery image [11]. However, generative schemes have several shortcomings [12]. Optimizing the recognition performance in generative model is not easier task. Estimating the accurate results in generative model is highly difficult since it cannot handle aging impact. Discriminative approaches [13] are introduced to resolve discrepancy of generative scheme [14]. It develops feature matching using local descriptors [15] in AIFR. Multiple descriptors-based AIFR is introduced to extract features from periocular region [16]. In this, two descriptors are used to extract features that are Scale-Invariant Feature Transform (SIFT) and Speeded-Up Robust Features (SURF).

In order to achieve better result in AIFR, deep learning method is integrated with discriminative approach [17]. In deep learning, Convolutional Neural Network (CNN) algorithm plays vital role in recognizing face with different aging images [18]. Large age gap verification is performed by injecting features in deep networks [19]. Here, deep CNN is used to recognize face where texture features are considered. Aging model-based face recognition with different aging images is introduced under deep learning method [20]. Here, CNN descriptor is utilized to match image with different aging images.

From the aforesaid studies, we determine that there are still many issues present in recognizing face with aging progress. The issues are discussed as follows:


These problems impose confines on the present AIFR systems and also complicate the recognition and retrieval task especially under different aging images.

#### **1.1 Research contribution**

In order to tackle abovementioned issues, our work contributes the following processes:

• In order to reduce time wastages in preprocessing, we initially execute novel Image Quality Evaluation (IQE) method, which estimates Image Quality Metric (IQM) for each image. If IQM value is below Image Quality Threshold (IQT),

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

then only preprocessing is performed for that image or else directly gone into the pose normalization process.


#### **1.2 Research outline**

Outline of this chapter is summarized as follows: Section 2 deliberates state-of-theart works existing in AIFR with their limitations. Section 3 exemplifies problems occurring in previous works related to AIFR. Section 4 explains brief study of our proposed work with our proposed algorithms. Section 5 illustrates numerical results obtain from our simulation environment and also compares it with existing methods. Finally, section 6 concludes our contribution and also provides comment on our future work.

#### **2. Related work**

This section discusses the state-of-the-art work related to AIFR along with their limitations. In this, we discussed works that comprise preprocessing, feature extraction, recognition, and retrieval processes.

Kishore et al. [21] have suggested Periocular Region-Based AIFR Using Local Binary Pattern. In this, three sequential processes are executed to recognize faces that are preprocessing, feature extraction, and classification. In preprocessing, enhancement and denoising processes are employed in each facial image. Local Binary Pattern (LBP) descriptor [22] was used to extract features from the periocular region [23] of the given face image. Periocular region contains eyes, eye lashes, and eye brow parts of the face. Chi-square distance was used as classifier to recognize face after feature extraction. Chi-square distance doesn't recognize face accurately since it is highly sensitive to the sample size.

Nanni et al. [24] have introduced ensemble of texture descriptor and preprocessing techniques to recognize image effectually. Four face recognition processes are performed that are preprocessing, feature extraction, feature transform, and classification. Preprocessing executes three techniques that are adaptive single index retinex (AR) in order to enhance scene detail and color enhancement in darker area. Anisotropic smoothing and different of Gaussian (DoG) are algorithms executed to normalize the illumination field. Features are extracted using two descriptors that are Patterns of the Oriented Edge Magnitudes (POEM) and the Monogenic Binary Coding (MBC). At last, different distance functions are used to recognize face. Accuracy of face recognition was very less due to poor feature extraction mechanism. Chi et al. [25] have offered temporal nonvolume preserving approach to facial age progression and AIFR. In preprocessing, face region was detected and aligned based on the fixed position of the eyes and mouth corners. And then it maps the texture features of the test image with the trained image in order to verify images. Here, deep CNN algorithm was utilized to map features. In this, preprocessing step doesn't perform effective processes such as normalization, noise removal that tend to reduce system performance.

Bor et al. [26] have introduced Cross Age Reference Coding (CARC) for AIFR. Initially, it executes face detection algorithm in order to detect face region in image. And it extracts features from the detected region for which it utilizes high-dimensional LBP algorithm. LBP extracts 59 local features from the detection regions. In this, Principal Component Analysis (PCA) algorithm was used to reduce dimensionality of extracted feature. After that, CARC recognizes face using local features transformation. More analysis is required on feature extraction since it plays vital role in AIFR. Yali et al. [27] have pointed out distance metric optimization driven CNN for AIFR. Here, two models are integrated that are feature learning and distance metric learning. This integration is achieved through CNN algorithm with parameters optimized using network propagation algorithm. CNN learns features using the convolution layer and recognizes face using the distance metric. Finally, recognized images are retrieved effectually. Herein, recognition rate was very less due to ineffective feature extraction.

Pournami et al. [28] have offered deep learning and multiclass SVM algorithm to recognize face. Here, preprocessing was performed to increase the accuracy of the face recognition where image resizing was performed. CNN feature descriptor was used to extract features from the given image. Here, fully connected layer extracts features from the image and then features are given as input to the multiclass SVM classifier. Resizing only performed in preprocessing thus introduced more noise in extracted feature. Garima et al. [29] have suggested techniques for face verification across different age progression with large age gap. Initially, image normalization was performed where RGB image was converted into the grayscale image and the image is rotate as the eyes are aligned horizontally. In this, face features are extracted using Center Symmetric Local binary Pattern (CSLBP) algorithm. And also weighted K-nearest Neighbor (K-NN) algorithm was used to recognize face from extracted features. K-NN doesn't perform well for large dataset and thus reduces the accuracy of face recognition. Saroj et al. [30] have pointed out pyramid binary pattern for ageinvariant face verification. In this, pyramid binary pattern was used to extract texture feature. Texture features are given as input to the PCA in order to reduce dimensionality of the extracted features. And then, classification was performed through SVM algorithm. Here, texture feature was only extracted to classify the face with age invariant. Thus it reduces accuracy in face recognition since dataset contains different images with large age gap.

#### *Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

Mrudula et al. [31] have offered face recognition across aging using GLBP features. Preprocessing performs three sequential processes that are image resizing, RGB to gray, and illumination normalization. Here, combined feature descriptor was used to extract features from the given image. LBP and Gabor descriptors are combined, which was known as GLBP descriptor. During classification, PCA was used to reduce feature dimensionality and K-NN algorithm was used to recognize face across aging. Herein, GLBP descriptor introduces high false-positive rate in age-invariant face recognition. Zhen et al. [32] have pointed out local polynomial contrast binary patterns for face recognition. Polynomial filters are used to extract the attributes from the given image. In this, LBP descriptor was used to extract texture from the given image. Fisher Linear Discriminant (FLD) algorithm is used to reduce dimension of extracted features. Here, extracted features are classified using nearest neighbor classifier to recognize given image in training set. Nearest neighbor classifier consumes more time to classify image since all the work is performed in testing stages only.

Mohanraj et al. [33] have suggested ensemble of CNN for face recognition in order to resolve aging, pose variation, and low-resolution problem. Preprocessing was established to resize the given image. After that, features are extracted using three different CNN algorithms. Features are concatenated and given to the classifier in order to predict the person. Here, random forest classifier is used to recognize the face. Noise removal was not performed in preprocessing and thus reduces the accuracy of face recognition. Rupali et al. [34] have introduced component-based face recognition. Here, three face components are considered that are nose, lips, and ears. Preprocessing is performed to resize the image and features are extracted using CNN algorithm. Features are extracted from nose and face regions that are given to the FLD algorithm to reduce the dimensions. These features are given to KNN classifier in order to predict the image. In KNN, initial K value prediction is complex that leads to ineffective results. Venkata et al. [35] have pointed out real-time face recognition using deep learning and LBP. During preprocessing, it resizes the given image. In this, LBP was used to extract features from the given images. Extracted features are given to the CNN in order to provide weight to each feature. CNN provides weight in order to estimate the matched face with the training images. Here, texture feature only extracted to recognize face across aging that tends to reduce recognition rate.

Mohsen et al. [36] have offered age-based human face image retrieval using zernike moments. In this, Zernike moment was used to extract features from the images. Here, Zernike moment utilizes Zernike Basis Function (ZBF), which captures both local and global featured fro face image. And, Multi-Layer Perceptron (MLP) algorithm was used to recognize age in training image. Accurate result was not obtained in MLP classifier, thus reducing the recognition rate. Danbei et al. [37] have offered face aging synthesis application based on feature fusion. Initially, face detection was performed and feature points are positioned. For this purpose, triangulation and affine transformations are used, which position the feature points. Here, facial texture features are extracted to recognize face across aging. Extracted features are fused in order to recognize face with the training images effectually. More analysis is required on facial recognition since it describes up to feature fusion process.

#### **3. Problem statement**

Kishore et al. [38] have offered Hybrid Local Descriptor (HLD) and LDA-assisted K-Nearest Neighbor classification in AIFR. Here, Gaussian filter was used to reduce

noise that results in information degradation, since it removes fine details of the image and resultant image is blurred. WLD-based feature extraction loses more information due to lack of pixel consideration. K-NN-based classification requires more time due to absence of training phase and finding good similarity measure is also difficult. Muhammad et al. [39] have introduced Demographic Features (DF)-assisted AIFR and retrieval. In this, feature extraction takes more time, since each feature was extracted in three individual CNNs. Position and orientation of the object were ignored in hidden layer of CNN that result in less accuracy in feature extraction and recognition. Chenfei et al. [40] have pointed out Coupled Auto Encoder (CAN) algorithm based feature extraction in AIFR. Herein, feature extraction was not effective due to lack of texture and shape-oriented features. In CAN, data relationships are not considered that affect classification results and weight computation is also very difficult. Huiling et al. [41] have introduced Identity Inference Model (IIM)-based age subspace learning to recognize image in AIFR. Herein, wLBP-based feature extraction was used that results in less accuracy, since it contains more noise in extracted features due to absence of noise removal process. Fahad et al. [42] have introduced Composite Temporal Spatio (CTS) modeling in order to recognize image in AIFR. Here, preprocessing was required to improve the accuracy in age-invariant face recognition, since image database contains illumination, pose variation, etc. Naïve Bayes–based classification results are always biased one, since it doesn't rely on class conditional dependency.

#### **4. Proposed work**

This section briefly describes our proposed method in detail along with the description of utilized algorithm.

#### **4.1 System overview**

Our Multi-Feature-assisted AIFR (MF-AIFR) method tackles problems that are present in the previous AIFR works. For this purpose, MF-AIFR establishes the five consecutive processes that are IQE, Preprocessing, Pose Normalization, Feature Extraction and Fusion, Feature Recognition and Retrieval as depicted in **Figure 1**. Our work novelty is present in the IQE method, since previous AIFR method doesn't concentrate on the quality evaluation. In order to save time, MF-AIFR performs IQE where images that are not satisfied IQT only given to the preprocessing step or else it is directly given to the pose normalization process. During preprocessing, MF-AIFR performs two processes that are illumination normalization using DGC-CLAHE and noise removal using ASBF algorithm. Pose normalization is executed to enhance feature extraction performance where EA-AT algorithm is utilized. Multiple features are extracted from the three different regions of face image that are periocular, mouth, and nose in regard to enhancing accuracy result. Here, two descriptors are executed that are CNN for texture feature and SIHKS for demographic and shape features extraction. Here, demographic features comprise age, gender, and race. Extracted features are fused using CCA in accord to reduce the complex recognition process. For recognition and retrieval, MF-AIFR pursues SVM algorithm, which has high scalability compared with other machine learning algorithm.

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

**Figure 1** illustrates the architecture for our proposed work. The process depicted in architecture is described briefly in upcoming sections.

#### *4.1.1 Image quality evaluation (IQE)*

Reducing computation time in AIFR and retrieval is noteworthy in order to achieve efficient performance. For this purpose, MF-AIFR performs novel IQE, which estimates IQM for each image. IQM comprises subsequent metrics that are Brightness *Evaluation*, Region Contrast *Evaluation*, Edge Blur *Evaluation*, Color Quality *Evaluation*, and Noise *Evaluation*.

These metrics are designated as follows:

*Brightness Evaluation:* It defines the image darkness degree in order to get easy viewing. It can be measured as follows:

$$I\_{BD} = \frac{\left\{\sum\_{k\_1=1}^{N\_1} \sum\_{k\_2=0}^{255} h\_{k\_1 k\_2} \times (k\_2)^t \right\}}{N} \tag{1}$$

Where *hk*1*k*<sup>2</sup> represents the pixel quantity of the gray value *k*<sup>2</sup> in the histogram of the *kth* <sup>1</sup> image block. *s* indicates the parameter where *s* ¼ 3. *N* indicates number of sample blocks.

*Region Contrast Evaluation:* This metric is used to distinguish difference images effectually. It can be measured through below expression:

$$I\_{CD} = \frac{\left\{\sum\_{k=1}^{N\_1} \left[I\_k^{\max} - I\_k^{\min}\right] / \left[I\_k^{\max} + I\_k^{\min}\right]\right\}}{N} \tag{2}$$

Where *I max <sup>k</sup>* and *I min <sup>k</sup>* represent the maximum and minimum gray values of the kth image block.

*Edge Blur Evaluation:* It defines the clearness of the image for easy analysis. This metric can be measured as follows:

$$I\_{\rm EBD} = \max\_{I \in \mathcal{G}} \frac{\left\{ \arctan \left[ I(i\_1, j\_1) - I(i\_2, j\_2) \right] \right\}}{\textit{Wid}\_{12}} \tag{3}$$

Where ⊙ denotes set of image blocks, *I i*1, *j* 1 � � and *I i*2, *j* 2 � � represent the gray values of first and second image blocks. And, *Wid*<sup>12</sup> represents the width of the edge spread points such as *i*1, *j* 1 � � and *i*2, *j* 2 � �.

*Color Quality Evaluation:* This metric defines the image quality in terms of the color. It can be measured using following expression:

$$I\_{CQD} = \frac{\sum\_{i=1}^{l} \sigma\_i^C}{l} \tag{4}$$

Where *σ<sup>C</sup> <sup>i</sup>* represents the ith standard deviation of the component intensity in the HSV color space. *l* indicates channel number, *l* ¼ 3.

*Noise Evaluation:* This metric can be used to measure noise present in the image. It can be measured using following expression:

$$I\_{ND} = \frac{\sigma\_n}{I\_{BD}}\tag{5}$$

Where *σ<sup>n</sup>* represents the standard deviation of the image block. It can be estimated using below expression,

$$
\sigma\_n = a \times \lg\left(b \times \frac{255}{I\_{BD}}\right) \times \min\_{I \in \mathbb{G}} \sigma\_i \tag{6}
$$

Where *a* and *b* are constant values.

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

Using above parameters, we estimate IQM for each image. It can be measured as follows:

$$I\text{QM} = \sum \frac{I\_{BD} + I\_{CQD} + I\_{CD}}{I\_{ND} + I\_{EBD}} \tag{7}$$

After computing IQM, this value is compared with the IQT in order to select whether next process is preprocessing or pose normalization for given image.

$$Ne\_{p,i} = \left\{ \begin{array}{c} IQM\_i > IQT \rightarrow \text{Pose Normalization} \\ IQM\_i < IQT \rightarrow \text{Preprocesing} \end{array} \right. \tag{8}$$

Where *Nep*,*<sup>i</sup>* represents the next process for image i. Using the above condition, we select next process for each given image and thus avoid time wastages in performing preprocessing for all images. In the meantime, it also reduces the total computation time for recognition and retrieval.

#### *4.1.2 Preprocessing*

MF-AIFR performs preprocessing in order to enhance the recognition rate in simulation results. For this purpose, we perform two processes in preprocessing that are illumination normalization and Noise removal.

#### *4.1.2.1 Illumination normalization*

Illumination normalization is performed in order to enhance the image quality and also avoid negative effects of the image. MF-AIFR adopts DGC-CLAHE algorithm for illumination normalization. Proposed DGC-CLAHE performs better than existing CLAHE method. It enhances both luminance and contrast of the image adaptively. Our DGC-CLAHE algorithm performs dual gamma correction, which enhances the dark areas of the image. This algorithm adaptively sets the clip points of each image, which depends on the dynamic range of each block of the image. In this, first gamma correction is executed to boost the entire luminance present in the image block. Second gamma correction is executed to adjust the contrast in very dark region in order to avoid overenhancement in bright regions.

Initially, DGC-CLAHE sets clip point adaptively based on the dynamic range, which can be expressed as follows:

$$\beta = \frac{p}{d\_r} \left( 1 + \tau \frac{\mathbf{g}\_{\text{max}}}{R} + \frac{a}{100} \left( \frac{\sigma}{A\_v + c} \right) \right) \tag{9}$$

Where *p* represents number of the pixels in each block, *dr* denotes dynamic range in this block. *τ* and *α* represent the constant parameters, which are used to control the weight of dynamic range and entropies. *σ* indicates the standard deviation of the block; *Av* indicates mean value; and c represents the small value in order to avoid division by 0. *R* represents entire dynamic range of the image. *g*maxrepresents maximum pixel value of the image. After completing setting up of clip points, dual gamma corrections are performed.

DGC-CLAHE defines enhancement weight for the global gray levels of the blocks by first gamma correction (*γ*1), which can be expressed as follows:

$$\mathcal{W}\_{\varepsilon} = \left(\frac{Gr\_{\text{max}}}{Gr\_{\text{ref}}}\right)^{1-\gamma\_1} \tag{10}$$

Where *Grmax* indicates maximum gray value of the image, and *Grref* indicates reference gray value of the image. First (*γ*1) and second gamma (*γ*2) corrections are represented as follows:

$$\gamma\_1 = \frac{\ln\left(o + cdf\_{\left.\left(G\eta\right)\right)}\right)}{8} \tag{11}$$

$$\gamma\_2 = \frac{1 + cdf\_w(Gr\_l)}{2} \tag{12}$$

Where *o* indicates constant, *cdf <sup>ω</sup>* indicates cumulative distribution function weight, and *Grl* represents gray level of the image. *γ*<sup>1</sup> and *γ*<sup>2</sup> are increased by *Grl* in order to avoid under enhancement in darker region of the image. The first and second gamma correction setting based normalization provides better result in image with nonuniform illumination. Thus, it enhances the image effectually, which in turn increases recognition rate.

#### *4.1.2.2 Noise removal*

Noise removal is substantial process in face recognition in regard to enhancing recognition accuracy. For this purpose, our MF-AIFR utilizes ASBF algorithm to remove noise from given image. Proposed ASBF algorithm preserves fine details of the image while removing noise and also sharpens the image. ASBF algorithm is used to remove universal noises such as impulse and Gaussian.

In ASBF algorithm, noisy pixel is detected using Sorted Quadrant Median Vector (SQMV), which incorporates significant features such as edge or texture information. Our ASBF algorithm executes three sequential processes as depicted in **Figure 2**. Initially, Adaptive Median Filter (AMF) is used to identify the corrupted pixels in the image. Secondly, the edge of the image is preserved using edge detector, which accurately predicts the edge existence in the current window. Noise detector is used to classify the noise into impulse and Gaussian. Switching Bilateral Filter (SBF) contains ranging filter, which switches the modes between impulse and Gaussian based on noise detector result.

#### *4.1.2.2.1 AMF*

Existing noise filtering algorithm utilizes constant window size such as 3\*3, which may fail to distinguish noisy and noise-free pixel accurately and thus results in blur output image. In order to avoid this drawback, our AMF adaptively changes the window size based on the number of noisy pixels present in given image.

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

#### *4.1.2.2.2 Noise detector*

Noise detector is used to predict whether pixel is filtered by SBF Gaussian (*SBFg*) or SBF impulse (*SBFi*). Let us consider *S*<sup>1</sup> and *S*2, which are binary control signals where *S*<sup>1</sup> is generated by AMF and *S*<sup>2</sup> is generated by noise detector. Then the filtered image is represented as follows:

$$\begin{aligned} \; \_f f \;= \begin{cases} \; \_S BF\_\\$\; \_S \mathbf{1} = \mathbf{1}^{\mathcal{S}\_2} = \mathbf{1} \\ \; \_S BF\_\text{i} \; \_S \mathbf{1} = \mathbf{1}^{\mathcal{S}\_2} = \mathbf{0} \\ \; \_{\eta\_{\hat{f}\hat{p}}} \mathbf{S}\_1 = \mathbf{0}^{\mathcal{S}\_2} = \mathbf{0} \end{cases} \end{aligned} \tag{13}$$

At last, pixel with Gaussian and impulse noises are classified based on the above discussed conditions. These outputs are given as input to the SBF with SQMV.

*SBF with SQMV:* SBF switches its mode based on the classification results from the noise detector. Here, SQMV scheme is used to predict the optimum median effectively even in the larger window. SMQV detects noisy pixel by estimating difference between current pixel and reference median pixel. If difference is large, then current pixel is considered as the noisy pixel. Let us consider *ρ<sup>i</sup>*,*<sup>j</sup>* as the current pixel and *ρ<sup>i</sup>*þ*s*,*j*þ*<sup>t</sup>* as the pixels in a 2ð Þ� *N* þ 1 ð Þ 2*N* þ 1 window surrounding *ρ<sup>i</sup>*,*<sup>j</sup>*.

The output from the SBF filter is expressed as follows:

$$O\_{i,j} = \frac{\sum\_{m=-n}^{n} \sum\_{t=-n}^{n} W\_{sr}(m,t)\rho\_{i+\iota,j+t}}{\sum\_{m=-n}^{n} \sum\_{t=-n}^{n} W\_{\mathcal{g}}(m,t)W\_{sr}(m,t)} \tag{14}$$

Where,

$$\mathcal{W}\_{\mathcal{g}} = e^{-\frac{\left[\left(i-m\right)^{2} + \left(-j-t\right)^{2}\right]}{2\sigma\_{r}^{2}}} \tag{15}$$

$$\mathcal{W}\_{sr} = e^{-\frac{I - \rho\_{i+t,j+t}}{2\sigma\_R^2}}$$

$$\mathcal{W}\_{sr} = e^{-\frac{I - \rho\_{i+t,j}}{2\sigma\_R^2}}\tag{16}$$

Where *I* represents the reference median for impulse noise (*S*<sup>1</sup> ¼ 1 and *S*<sup>2</sup> ¼ 1 ) and *I* ¼ *ρi*,*<sup>j</sup>* for Gaussian noise (*S*<sup>1</sup> ¼ 1 and *S*<sup>2</sup> ¼ 0).

From the above discussions, we conclude that our proposed ASBF removes not only Gaussian noise but also impulse noise while keeping the image fine details and images. This way of performing preprocessing increases the accuracy in AIFR.

#### *4.1.2.3 Pose normalization*

Pose normalization is substantial process to increase accuracy in face recognition. Since, our database FG-NET contains different pose images and thus requires pose normalization before entering into feature extraction and retrieval. Our MF-AIFR carried out EA-AT algorithm in order to correct the different poses into the frontal view and thus increases the feature extraction efficiency. EA-AT algorithm initially estimates pose angle of given image using Euler Angle. Then, estimated angle is provided to the Affine Transformation to get frontal view of the given image. Euler angles are three angles in order to describe the orientation of the face with respect to the fixed coordinate.

**Figure 3** illustrates the Euler angle with their coordinates in Z vector. Three angles are describes as follows: Yaw, Pitch, and Roll. In this, yaw angle (*α*) is estimated using below expression:

$$a = \arccos\left(\mathbf{Z\_3}\right) \tag{17}$$

Where *Z*<sup>2</sup> and *Z*<sup>3</sup> represent the Z vectors of the given image.

**Figure 3.** *Euler angles representation.*

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

Roll angle (*φ*) can be estimated using below expression:

$$\varphi = \arccos\left(-\frac{Z\_2}{\sqrt{1 - Z\_3^2}}\right) \tag{18}$$

Pitch angle (*τ*) can be estimated using below expression:

$$\pi = \arccos\left(\frac{Y\_3}{\sqrt{1 - Z\_3^2}}\right) \tag{19}$$

These three angles are given as input to the affine transformation algorithm in order to rotate into the correct view. There exist four basic affine transformations that are illustrated as follows:


In mathematical form, an affine transformation of <sup>N</sup>*<sup>n</sup>* is a map of *<sup>F</sup>*: <sup>N</sup>*<sup>n</sup>* ! <sup>N</sup>*<sup>n</sup>*

$$F(\mathfrak{s}) = L\_t(\mathfrak{s}) + \mathbb{Q}\mathfrak{sl} \in \mathfrak{N}^n \tag{20}$$

Where, *Lt* indicates the linear transformation of N*<sup>n</sup>*, and *ℚ* defines the translation vector in N*<sup>n</sup>*. A rotation performed in the affine transformation is illustrated as follows:

$$\begin{aligned} \text{Rotation about } x \text{ axis:} & \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos \theta\_x - \sin \theta\_x & 0 \\ 0 & \sin \theta\_x \cos \theta\_x & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \\ \text{Rotation about } y \text{ axis:} & \begin{bmatrix} \cos \theta\_x & 0 & \sin \theta\_y & 0 \\ 0 & 0 & 0 & 0 \\ -\sin \theta\_y & 0 & \cos \theta\_y & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \\ \text{Rotation about } z \text{ axis:} & \begin{bmatrix} \cos \theta\_x - \sin \theta\_x & 0 & 0 \\ \sin \theta\_x \cos \theta\_x & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} \end{aligned} \tag{22}$$


#### **Table 1.**

*Features description.*

Where *θx*, *θy*, and *θ<sup>z</sup>* represent the rotations about three axes known as Euler angles. This way of rotation in affine transformation results in frontal view of the given image. After completing pose normalization, we crop the image in order to change the size of all images into same one.

#### *4.1.2.4 Feature extraction and fusion*

Feature extraction and fusion are a major part of this work in order to produce optimum results in AIFR. Our MF-AIFR extracts multiple features from three set of regions. We extract images from three regions that are periocular, nose, and mouth. Since, these three regions are significant to recognize the image across aging. From these regions, we extract three type of features that are texture, shape, and demographic, which are briefed in **Table 1**. Here, texture feature is extracted using the CNN descriptor, and SIHKS descriptor is used to extract the shape and demographicrelated features.

#### *4.1.2.4.1 Texture feature extraction*

Our MF-AIFR utilizes CNN descriptor for texture feature extraction since it provides robust performance in learning features layer by layer. CNN applies multiple filters on the raw input image in order to extract high-level features. Here, we extract six texture features in given image such as contrast, dissimilarity, entropy, homogeneity, correlation, and angular second moment. These features are described as follows: In CNN, three different types of layers are present that are Convolutional layer, Polling layer, and Fully connected layer.

#### *4.1.2.4.2 Convolutional layer*

It gathers image from the input layer, which is made up of a set of learnable filters. In our work, convolutional layer comprises six filters in order to generate feature map. Six filters in the convolutional layer generate six feature maps. The feature map is the consequence of the every filter that convolved through whole image. Convolution operation can be described as follows:

$$\mathbf{x}\_{j}^{l} = \mathbf{a}\_{f} \left( \sum\_{i \in \mathcal{M}\_{l}} \mathbf{x}\_{j}^{l-1} \* f\_{ij}{}^{l} + \right) \tag{24}$$

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

Where *a <sup>f</sup>* describes the activation function, j represents the specific convolution feature map, *l* represents the layer in the CNN, *fij* indicates the filter, *b <sup>j</sup>* represents the feature map bias, and *Ml* is a selection of feature map.

#### *4.1.2.4.3 Pooling layer*

It is used to perform downsampling operation in order to reduce the spatial size of the convolutional layers. Polling operation is implemented on the pixel values captured by the pooling mask. The pooling operation is described as follows:

$$p\_j^l = a\_f \left( \mathbf{C}\_j^l 
pool\left(p\_j^{l-1}\right) + b\_j^l\right) \tag{25}$$

Where *p<sup>l</sup> <sup>j</sup>* represents the result of the pooling region applied on the jth region in the input image. *pl*�<sup>1</sup> *<sup>j</sup>* describes the jth region of interest captured by the pooling mask in previous layer. *C<sup>l</sup> <sup>j</sup>* indicates the trainable coefficient.

#### *4.1.2.4.4 Fully connected layer*

Fully connected layer is used to extract the features that are obtained in the preceding layers. The results obtained in the last convolutional and pooling layer are given as input to the fully connected layer in order extract features.

#### *4.1.2.4.5 Shape and demographic feature extraction*

Shape and demographic features are extracted using SIHKS algorithm. Shape features are boundary of the eye, nose and mouth, Convexity, and Solidity. Demographic features comprise age, race, and gender information. Here, race feature represents the skin tone of the face image. These features plays key role in recognizing face across aging.

Proposed SIHKS descriptor performs better than HKS algorithm since conventional method has drawback such as sensitivity to scale especially to the global scale. Hence, we proposed SIHKS algorithm, which performs better in scale invariance, and it is able perform at any point even at scale selection is impossible. In addition to it, it also performs well extracting shape and demographic-oriented features compared with other shape feature descriptor. SIHKS extracts features using three steps that are listed as follows:

• Logarithmical sampling in time t. It can be expressed using below equation.

$$h\_t' = h'(x, \alpha^t) \tag{26}$$

Where *ht* <sup>0</sup> represents logarithm sampling of heat kernel signature.

• Taking logarithm of heat signature with time variations. It can be described as the below equation,

$$h\_t' = h\_{t+s} \text{With } h\_t = \log h\_{t+1} - h\_t \tag{27}$$

Where *ht*þ*<sup>s</sup>* represent the shift in the heat kernel signature.

**Figure 4.** *Feature extraction in CNN.*

• Taking discrete time Fourier transform of heat signature. It can be expressed as below equation,

$$F[h\_t']w = H'[w] = H(w)e^{-2\pi ws} = F[h\_{t+s}](w) \tag{28}$$

With the above steps, our SIKHS estimates scale-invariant quantity j j *H w*ð Þ at each point x without performing scale selection. Using this quantity, our SIKHS algorithm estimates the shape and demographic-oriented features effectually.

**Figure 4** illustrates the texture feature extraction in CNN with their significant layers such as convolutional layer, pool layer, and fully connected layer.

#### *4.1.2.4.6 Feature fusion*

Feature fusion is estimated to reduce extracted feature dimension of extracted features such as shape, texture, and demographic features. This dimensionality reduction will result in better performance in face recognition, which the process of recognition and retrieval is easier. For this purpose, our MF-AIFR algorithm utilizes CCA algorithm, which performs effectively in feature fusion.

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

Feature fusion is defined as the combination of multiple feature vectors into single feature vector. Proposed CCA is a statistical tool for recognizing linear relationship among sets of features vectors in order to determine the inter subject covariances. Canonical covariates of the given feature vectors are obtained using below expression,

$$A\_1^T = \mathfrak{u} \mathbf{X}\_1^T : A\_2^T = \mathfrak{v} \mathbf{X}\_2^T : A\_3^T = d \mathbf{X}\_3^T \tag{29}$$

Where *A*1, *A*2, *A*<sup>3</sup> represent the canonical covariates of the feature vectors *X*1, *X*2, *X*3, which indicates texture, shape, and demographic features. And, *u*, *v*, *d* describe the eigen vectors of the features.

#### *4.1.3 Recognition and retrieval*

Recognition and retrieval are final process in our MF-AIFR, which is performed by utilizing SVM algorithm. Here, we select SVM algorithm to correctly recognize the face cross aging and also retrieve the recognized image for given input image. **Figure 5** illustrates the input and output space models of the SVM algorithm.

**Figure 5.** *SVM input and feature space representation.*

Proposed SVM algorithm performs well in even unstructured and semistructured data. It addition to it, SVM also scales relatively well to high dimensionality of database. SVM gets input as fused features from previous process obtained using CCA algorithm. SVM is the binary classification method that discovers the optimal linear decision surface based on the concept of structural risk minimization. The decision surface represents the weighted combination of the elements present in the training set. These elements are illustrated as the support vectors and characterize the boundary between two different classes. The output of the SVM algorithm is a set of support vectors *Si*, coefficient weights *we*, class labels *yi* of the support vectors, and constant term z

The linear surface is represented as follows:

$$k.z + b = 0\tag{30}$$

Where k represents the weight factor and b represents the bias term and z represents the training or testing data. These two parameters are used separate the hyperplane position and orientation. The weight factor k is calculated using below expression,

$$k = \sum\_{i=1}^{N\_l} w\_{\varepsilon i} y\_i \mathbf{S}\_i \tag{31}$$

Kernel function plays vital role in SVM, which classifies features effectually. In MF-AIFR, we use Radial Basis Function (RBF) kernel. RBF performs well compared with other kernel functions. It doesn't require any prior knowledge about data. It can be expressed as follows:

$$r(\nu - \nu\_i) = e^{-\delta||\nu - v\_i||^2} \tag{32}$$

Here, *δ* represents the regularization parameter, and *v* � *vi* represents the different between feature vectors. By utilizing RBF kernel function, our MF-AIFR method recognizes and retrieves the images that are same as given test image. For example, if we give an input as image of person "A" at the age of 33, then it retrieves of the person "A" image from the age 2 to 60, since our FG-NET database contains subjects with the age from 0 to 69.

#### **5. Experimental study**

To characterize the performance of the proposed MF-AIFR, this section is divided into four aspects such as dataset description, simulation setup, application scenario, results, and discussion.

#### **5.1 Dataset description**

This section deliberates dataset information used in this chapter. Here, we utilize FG-NET database to perform face recognition and retrieval. Face and gesture recognition NETwork (FG-NET) aging database was released in the year of 2004 in an attempt to support research activities regarding the changes in the facial appearance *Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*


#### **Table 2.**

*Dataset description.*


#### **Table 3.**

*Different age bands of FG-NET dataset.*

caused by aging. FG-NET database comprises 1002 images from 82 different subjects. Each subject comprises 6–18 images with the age ranging between the newborns to the 69-year-old subjects. Our FG-NET database contains considerable variations such as poses and illuminations.

**Table 2** illustrates the details of the FG-NET dataset briefly. Dataset contains 34 male subjects and 48 female subjects' images. Each subject has 1–12 images across their age progression.

Different age bands present in the FG-NET dataset are represented in **Table 3**. FG-NET dataset comprises subjects from the age of 0 to 69 years old.

#### **5.2 Experimental setup**

Our proposed MF-AIFR is implemented in MATLAB R2017b tool with C programming language. Our MATLABR2017b is executed in windows operating system. MATLAB is a multi-paradigm statistical computing environment developed by MathWorks. MATLAB permits matrix manipulations, implementation of algorithms plotting of functions and data, creation of user interfaces, and interfacing with programs written in other languages, which include C, C++, C#, JAVA, and Python.

#### **5.3 Performance metrics**

To evaluate performance of the MF-AIFR, we consider following metrics that are described as follows:

• *Accuracy:* It is defined as the ratio of correct classification with respect to that of total images. The accuracy is measured based on the succeeding expression:

$$Accuracy = \frac{\text{\textquotedbl{}correct\textquotedbl{}}\text{\textquotedbl{}}classification}{Total\text{\textquotedbl{}} images} \tag{33}$$

• *Recall:* It is defined as the proportions of the cases that are correctly classified by a class. Recall is also called as True Positive (TP) cases. It can be illustrated as follows:

$$Recall = \frac{T\_P}{T\_P + F\_N} \tag{34}$$

Where *TP* is defined as the positive cases that are correctly labeled as positive. *FN* is defined as the noise samples that are incorrectly labeled as negative.

• *Precision:* It is designated as the ratio of the number correctly classified samples with all classified samples. Precision is also called as positive predictive value. It can be designated as follows:

$$Precision = \frac{T\_P}{T\_P + F\_p} \tag{35}$$

Where *Fp* represents the number of noise lesions correctly detected as samples.

• *F-Score:* It is the combination of precision and recall. It can be calculated as follows:

$$F-Score = \frac{2 \ast (Recall \ast Precision)}{Recall + Precision} \tag{36}$$


#### **5.4 Comparative analysis**

This compares the simulation results of the MF-AIFR with existing methods such as HLD, DF, and CAN. Here, we compare results using six performance metrics that are Accuracy, Recall, Precision, Recognition Rate, Rank-1 Score, and F-Score. **Table 4** illustrates the comparisons of previous methods with their strength, weakness, and research statements.

#### *5.4.1 Impact on accuracy*

Accuracy metric is one of the significant metrics to evaluate the performance of the proposed work. This metric defines the how accurate our MF-AIFR in terms of correct classification of images. The performance of this metric is evaluated by alternating the number of images.

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*


**Table 4.**

*Comparisons on previous methods in AIFR.*

**Figure 6.** *Comparisons on accuracy.*

**Figure 6** demonstrates that comparisons on accuracy of the MF-AIFR with respect to the existing methods such as CAN, DF, and HLD. These comparisons show that our MF-AIFR achieves better performance compared with the existing methods. Since, our method utilizes better feature descriptors such as CNN and SIHKS. Both algorithms extract features effectually from three regions that are periocular, nose, and mouth. This selected region plays a key role in recognizing face across aging. And CNN and SIHKS provide robust performance even in high-dimensional dataset. As a result, our method achieves high accuracy as 95%. By contrast, CAN and DF method attain less accuracy compared with our method due to its poor feature extraction procedures since it doesn't concentrate on the vital regions such as periocular, nose, and mouth. Meanwhile, HLD obtains high accuracy compared with both CAN and DF method due to its feature extraction from periocular region, which plays significant role in face recognition across aging. Though, it achieves less accuracy compared with our method due to its poor descriptor algorithm since it loses large amount of information during feature extraction.

**Table 5** illustrates the average simulation results comparison of accuracy with the existing and proposed methods.

From the above comparison, it is noticed that our method achieves better accuracy percentage as 90.2% compared with the existing methods.

#### *5.4.2 Impact on recall*

Recall is used to evaluate the performance of the MF-AIFR in terms of the correct recognition of face image. Recall performance is evaluated by changing the number of images.


*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

### **Figure 7.**

*Comparisons on recall.*


#### **Table 6.**

*Recall comparisons [average].*

**Figure 7** shows that our MF-AIFR achieves less recall percentage compared with other methods.

Since, our MF-AIFR correctly recognizes the face as per given test image, thus reduces false detection of face images. Reason for this is that our method executes pose normalization before entering into the feature extraction process. Pose normalization enhances the feature extraction efficiency. Thus it leads to correct identification and retrieval of the test image. As a result, our MF-AIFR achieves less recall percentages compared with existing methods. Whereas existing methods such as DF and CAN achieves high recall percentages due to lack of pose normalization and complex feature extraction procedures. In the meantime, HLD method reduces recall percentage compared with DF and CAN methods since it doesn't follow complex feature extraction procedures. Still, recall of HLD is high compared with MF-AIFR due to lack of pose normalization and information degradation in noise removal process. **Table 6** designates the average simulation results comparison of recall with the existing and proposed methods.

From the above comparison results, it is seen that our MF-AIFR method achieves less recall percentage as 70% compared with the existing methods.

#### *5.4.3 Impact on precision*

Precision is used to measure performance of our work in terms of relevance instances retrieved compared with the total images. Precision performance is measured via altering the number of image.

**Figure 8** depicts that MF-AIFR achieves high precision percentages compared with existing methods. MF-AIFR performs preprocessing process before entering into the feature extraction and recognition process. Preprocessing performs illumination

**Figure 8.**

*Comparisons on precision.*


#### **Table 7.**

*Precision comparisons [average].*

normalization and noise filtering since our FG-NET dataset contains illumination and noises in images. These two processes enhance the quality of the image that tends to easy the feature extraction and recognition process. CAN and DF methods achieves less precision due to lack of preprocessing such as noise removal and illumination normalization. Likewise, HLD also obtains less precision owing to fine detail removal in Gaussian-based noise filtering. Since Gaussian filter doesn't concentrate on fine details of the image, which results in blur image.

**Table 7** designates the average simulation results comparison of precision with the existing and proposed methods. From the above comparison, we conclude that MF-AIFR achieves better precision percentage as 90.6% compared with existing methods.

#### *5.4.4 Impact on F-score*

F-Score metric considers both false positive and false negative values in account to estimate performance of this work. The performance of this metric is simulated by varying the number of images.

**Figure 9** illustrates that comparison on F-Score result of MF-AIFR with existing methods such as DF, CAN, and HLD. From this figure, it is noticed that our method achieves high F-Score compared with existing methods. Our MF-AIFR uses two descriptors such as CNN and SIHKS to extract texture, shape, and demographic features. Here, SIHKS descriptor performs very well in scale invariance and also provides better extraction results even when scale selection is impossible. It extracts shape and demographic features effectually, which plays substantial role in face recognition across aging. At the same time, CAN and DF methods attain less F-Score owing to the absence *Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

#### **Figure 9.**

*Comparisons on F-score.*


#### **Table 8.**

*F-score comparisons [average].*

of significant feature extraction such as texture and shape features. Meanwhile, HLD also attains less F-Score since it doesn't concentrate on shape features extraction and thus reduces the face recognition and retrieval efficiency.

**Table 8** describes the average simulation results comparison of F-Score with the existing and proposed methods. From the above comparison, we observed that MF-AIFR method achieves high F-Score percentage as 87.2% compared with existing methods.

#### *5.4.5 Impact on recognition rate*

Recognition rate is used to measure the ability of MF-AIFR in terms of the face recognition. It can be measured through changing the number of features.

**Figure 10** designates the comparisons on recognition rate of MF-AIFR with respect to the existing methods CAN, DF, and HLD methods. From this figure, it is observed that our MF-AIFR attains high recognition rate compared with existing method. We propose SVM algorithm for recognition and retrieval. It performs well in recognition even in high dimensionality of dataset. In addition to it, we also perform feature fusion before entering into the recognition and retrieval process.

Feature fusion reduces the dimension of feature vectors and thus tends to enhance the performance of SVM algorithm. Therefore, our method achieves better recognition rate compared with existing method. Meanwhile, DF method has less recognition rate compared with other methods due to lack of effective recognition and retrieval processes since it simply ranks the images. Likewise, CAN also attains less recognition rate compared with our method since it isn't able to establish data relationship between different features. Meantime, HLD method attains less recognition rate due

#### **Figure 10.**

*Comparisons on recognition rate.*


#### **Table 9.**

*Recognition rate comparisons [average].*

to usage of KNN for recognition. KNN takes more time, and discovering similarity measure is tedious.

**Table 9** defines the average simulation results comparison of recognition rate with the existing and proposed methods. Above comparison illustrates that recognition rate of MF-AIFR is higher than that of other existing methods.

#### *5.4.6 Impact on rank-1-score*

Rank-1 Score considers the performance of cumulative match for given images in proposed work. It represents the efficacy of our work in terms of recognition and retrieval.

**Figure 11** exhibits comparisons on rank-1 score results with respect to the existing methods. From this figure, it is seen that our MF-AIFR attains high rank-1 score compared with the existing methods. Our proposed DGC-CLAHE algorithm based illumination normalization performs well compared with existing CLAHE; it enhances the fine details of the image. ASBF-based noise filtering also provides better performance in noise removal, which sharpens the image. This way of preprocessing results in high matching results in face recognition. At the same time, existing methods such as DF and CAN attain less rank 1 score since it doesn't use effective algorithm for preprocessing and thus reduce the quality of given image drastically. Likewise, HLD also attains less rank 1 score compared with our method. Since, it doesn't perform illumination normalization and noise filtering also not effective. From this analysis, we conclude that our MF-AIFR attains better results in rank 1-score compared with other methods.

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

#### **Figure 11.**

*Comparisons on rank 1-score.*


#### **Table 10.**

*Rank 1 score comparisons [average].*

**Table 10** signifies average simulation results comparison of rank 1-score with the existing and proposed methods. From the above comparison, we prove that our MF-AIFR method achieves higher rank 1 score percentage as 89.8% compared with existing methods.

#### *5.4.7 Impact on computation time*

Performance of the computation time is evaluated by varying the number of images. This metric must be low in order to attain better performance in image retrieval across aging.

**Figure 12** depicts the comparisons on computation time results with respect to the existing methods. It is noticed that our MF-AIFR method achieves less computation time compared with the existing methods such as CAN, DF, and HLD. MF-AIFR performs IQE process before entering into the preprocessing step. The images that are not satisfying IQT only undergone preprocessing; otherwise it is directly given to the pose normalization step. Thus it reduces the time wastages in performing preprocessing for all input images. In addition to it, our work also reduces time in feature extraction and classification by using effective algorithms such as CNN, SIHKS, and SVM. These algorithms require less time to process the given inputs. As a result, MF-AIFR achieves less computation time. In the meantime, existing methods such as CAN and DF attain high computation time compared with other methods. Since it performs preprocessing for all images and also doesn't utilize effective algorithm to process the given input image and thus leads to increase in computation time. Likewise, HLD also attains high computation time compared with MF-AIFR since it performs preprocessing for all images regardless of their quality.

#### **Figure 12.**

*Comparisons on computation time.*


#### **Table 11.**

*Computation time comparisons [average].*

**Table 11** deliberates the comparisons of computation time and thus shows that our method attains less computation time as 12.4ms compared with other methods including HLD, DF, and CAN.

#### **5.5 Research highlights**

This section signifies highlights of this research regarding face recognition across aging. In order to achieve better performance in AIFR, our work establishes five consequent processes. **Table 12** describes the benefits of proposed algorithms along


**Table 12.** *Benefits of proposed algorithms.* with their functionalities. This table illustrates each algorithm with their benefits in performance metrics such as precision, recall, accuracy, recognition rate, and rank 1 score.

### **6. Conclusion and future work**

Face recognition across aging becomes challenging due to changes in the human faces with age progressions. In order to address this bottleneck, this chapter proposes MF-AIFR method where four successive processes performed that are listed as follows: IQE is performed to reduce time spend in preprocessing and thus enhances performance of our system drastically. An image that doesn't satisfy the IQT is given as input to the preprocessing step. Here, illumination normalization and noise removal are performed, which enhances the accuracy in face recognition and retrieval. Illumination normalization adopts DGC-CLAHE, and noise removal adopts ASBF algorithm. In order to normalize the pose, we adopt EA-AT algorithm, which is performed to enhance the feature extraction efficacy. Two types of descriptors are utilized for features extractions that are CNN and SIHKS. Here, we extract multiple features such as texture, shape, and demographic features. We extract features from three types of regions that are periocular, nose, and mouth. CNN extracts texture features, and SIHKS extracts shape and demographic features. This way extracting features increases our recognition rate. In recognition and retrieval, we execute SVM algorithm, which follows the simple procedure and provides better results. At last, we evaluate the performance of MF-AIFR system using seven metrics that are Accuracy, Recall, Precision, Rank-1 Score, F-Score, Recognition rate, and Computation time. Thus it shows that our work performs better than existing methods such as HLD, DF, and CAN.

#### **Author details**

Kishore Kumar Kamarajugadda\* and Movva Pavani Faculty of Science and Technology, Department of Electronics and Communication Engineering, ICFAI Foundation for Higher Education, Hyderabad, India

\*Address all correspondence to: kkishore@ifheindia.org

© 2022 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### **References**

[1] Tong SG, Huang YY, Tong ZM. A robust face recognition method combining LBP with multi-mirror symmetry for images with various face interferences. International Journal of Automation and Computing. 2019;**16**: 1-12. DOI: 10.1007/s11633-018-1153-8

[2] Moon HM, Seo CH, Pan SB. A face recognition system based on convolution neural network using multiple distance face. Soft Computing. 2017;**21**(17): 4995-5002

[3] Roh S-B, Oh S-K, Yoon J-H, Seo K. Design of face recognition system based on fuzzy transform and radial basis function neural networks. Soft Computing. 2019;**23**(13):4969-4985

[4] Singh R, Om H. New born face recognition using deep convolutional neural network. Multimedia Tools and Applications. 2018;**76**(18):19005-19015

[5] Agarwal V, Bhanot S. Radial basis function neural network based face recognition using firefly algorithm. Neural Computing and Applications. 2018;**30**(8):2643-2660

[6] Wang Y, Gong D, Zheng Z, Ji X, Wang H, Li Z, et al. Orthogonal deep features decomposition for age-invariant face recognition. Computer Vision and Pattern Recognition. 2018:1-13

[7] Wang H, Gong D, Li Z, Liu W, Tencent AI Lab. Decorrelated adversarial learning for age-invariant face recognition. Computer Vision and Pattern Recognition. 2019:1-10

[8] Li Z, Gong D, Li X, Tao D. Aging face recognition: a hierarchical learning model based on local patterns selection. IEEE Transactions on Image Processing. 2016;**25**(5):2146-2154

[9] Kishore KK, Trinatha RP. Biometric identification using the periocular region, information and communication technology for intelligent systems (ICTIS 2017), volume 2. Smart Innovation, Systems and Technologies. Cham: Springer. 2018;**84**:619-628. DOI: 10.1007/978-3-319-63645-0\_69

[10] Sawant MM, Bhurchandi K. Age invariant face recognition: A survey on facial aging databases, techniques and effect of aging. Artificial Intelligence Review. 2018:1-28

[11] Feng S, Lang C, Feng J, Wang T, Luo J. Human facial age estimation by costsensitive label ranking and trace norm regularization. IEEE Trans Multimedia. 2017;**19**(1):136-148

[12] Dhamija A, Dubey RB. Analysis on age invariance face recognition study and effects of intrinsic and extrinsic factors on skin ageing. International Journal of Computer Applications. 2019; **182**(43):1-9

[13] Kishore KK, Trinatha RP. Face verification across ages using discriminative methods and see 5.0 classifier. In: Proceeding of First International Conference on Information and Communication Technology for Intelligent Systems: Volume 2. Smart Innovation, Systems and Technologies, vol 51. Cham: Springer. 2016. pp 439–448. DOI: 10.1007/978-3-319-30927-9\_43

[14] Li Z, Park U, Jain AK. A discriminative model for age invariant face recognition. IEEE Transactions on Information Forensics and Security. 2011;**6**:1028-1037

[15] Kishore KK, Trinatha RP. Periocular region based biometric identification

*Multi-Features-Assisted Age Invariant Face Recognition and Retrieval Using CNN with Scale... DOI: http://dx.doi.org/10.5772/intechopen.104944*

using the local descriptors. Intelligent Computing and Information and Communication. Advances in Intelligent Systems and Computing. Singapore: Springer. 2018;**673**:341-351. DOI: 10.1007/978-981-10-7245-1\_34

[16] Kamarajugadda KK, Polipalli TR. Age-invariant face recognition using multiple descriptors along with modified dimensionality reduction approach. Multimedia Tools and Applications. 2019;**78**(19):27639-27661. DOI: 10.1007/ s11042-019-7741-y

[17] El Khiyari H, Wechsler H. Age invariant face recognition using convolutional neural networks and set distances. Journal of Information Security. 2017;**8**:174-185

[18] Divyanshu S, Pandey JP, Chauhan B. A deep learning approach for age invariant face recognition. International Journal of Pure and Applied Mathematics. 2017;**117**(21):371-389

[19] Bianco S. Large age-gap face verification by feature injection in deep networks. Pattern Recognition Letters. 2017;**90**:36-42

[20] Riaz S, Ali Z, Park U, Choi J, Masi I, Natarajan P. Age-invariant face recognition using gender specific 3D aging modeling. Multimedia Tools and Applications. 2019:1-21

[21] Kumar KK, Pavani M. Periocular region-based age-invariant face recognition using local binary pattern. Microelectronics, Electromagnetics and Telecommunications. Lecture Notes in Electrical Engineering. Singapore: Springer. 2019;**521**:713-720. DOI: 10.1007/978-981-13-1906-8\_72

[22] Kishore Kumar K, Pavani M. LBP based biometrie identification using the periocular region. In: 8th IEEE Annual

Information Technology, Electronics and Mobile Communication Conference (IEMCON). Vancouver, BC: IEEE; 2017. pp. 204-209. DOI: 10.1109/ IEMCON.2017.8117193

[23] Kamarajugadda KK, Polipalli TR. Extract features from periocular region to identify the age using machine learning algorithms. Journal of Medical Systems, Springer. 2019;**43**:196. DOI: 10.1007/s10916-019-1335-0

[24] Nanni L, Lumini A, Brahnam S. Ensemble of texture descriptors for face recognition obtained by varying feature transforms and preprocessing approaches. Applied Soft Computing. 2017;**61**:8-16

[25] Nhan Duong C, Gia Quach K, Luu K, Le N, Savvides M. Temporal non-volume preserving approach to facial ageprogression and age-invariant face recognition. IEEE International Conference on Computer Vision (ICCV). 2017:3755-3763

[26] Chen B-C, Chen C-S, Hsu WH. Cross-age reference coding for ageinvariant face recognition and retrieval. European Conference on Computer Vision ECCV 2014: Computer Vision— ECCV 2014. 2014:768-783

[27] Li Y, Wang G, Nie L, Wang Q, Tan W. Distance metric optimization driven convolutional neural network for age invariant face recognition. Pattern Recognition. 2018;**75**:51-62

[28] Chandran PS, Byju NB, Deepak RU, Nishakumari KN, Devanand P, Sasi PM. Missing child identification system using deep learning and multiclass SVM. IEEE Recent Advances in Intelligent Computational Systems (RAICS). 2018

[29] Verma G, Jindal A, Gupta S, Kaur L. A technique for face verification across

age progression with large age gap. In: 2017 4th International Conference on Signal Processing, Computing and Control (ISPCC). Solan, India: IEEE. 2017

[30] Bijarnia S, Singh P. Pyramid Binary Pattern for Age Invariant Face Verification. In: 2017 13th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS). Jaipur, India: IEEE. 2017

[31] Nimbarte M, Bhoyar KK. Face recognition across aging using GLBP features. Information and Communication Technology for Intelligent Systems (ICTIS 2017). Smart Innovation, Systems and Technologies. Cham: Springer. 2017;**84**:275-283. DOI: 10.1007/978-3-319-63645-0\_30

[32] Xu Z, Jiang Y, Wang Y, Zhou Y, Li W, Liao Q. Local polynomial contrast binary patterns for face recognition. Neuro Computing. 2019;**355**:1-12

[33] Mohanraj V, Sibi Chakkaravarthy S, Vaidehi V. Ensemble of convolutional neural networks for face recognition. Advances in Intelligent Systems and Computing Recent Developments in Machine Learning and Data Analytics. 2019:467-477. DOI: 10.1007/978-981-13- 1280-9\_43

[34] Kute RS, Vyas V, Anuse A. Component-based face recognition under transfer learning for forensic applications. Information Sciences. 2019; **476**:176-191

[35] Venkata Kranthi B, Surekha B. Realtime facial recognition using deep learning and local binary patterns. Proceedings of International Ethical Hacking Conference. 2018;**2018**:331-347

[36] Malek ME, Azimifar Z, Boostani R. Age-based human face image retrieval using zernike moments. Artificial

Intelligence and Signal Processing Conference (AISP). 2017:347-351

[37] Wang D, Cui Z, Ding H, Yan S, Xie Z. Face aging synthesis application based on feature fusion. In: 2018 International Conference on Audio, Language and Image Processing (ICALIP). Shanghai, China: IEEE. 2018

[38] Kamarajugadda KK, Polipalli TR. Stride towards aging problem in face recognition by applying hybrid local feature descriptors. Evolving Systems. 2018;**10**:689–705. DOI: 10.1007/s12530- 018-9256-6

[39] Shafique MST, Manzoor S, Iqbal F, Talal H, Qureshi US, Riaz I. Demographic-assisted age-invariant face recognition and retrieval. Symmetry. 2018;**10**(5):1-17

[40] Xu C, Liu Q, Ye M. Age invariant face recognition and retrieval by coupled auto-encoder networks. Neurocomputing. 2017;**222**:62-71

[41] Zhou H, Lam K-M. Age-invariant face recognition based on identity inference from appearance age. Pattern Recognition. 2018;**76**:191-202

[42] Alvi FB, Pears R. A composite spatio-temporal modeling approach for age invariant face recognition. Expert Systems With Applications. 2017;**72**: 383-394

### *Edited by Marco Antonio Aceves Fernandez and Carlos M. Travieso-Gonzalez*

Artificial Intelligence (AI) has attracted the attention of many researchers and users alike, and it has become increasingly crucial in our modern society. From cars, smartphones, airplanes, medical equipment, consumer applications, and industrial machines, among others, the impact of AI is notoriously changing the world we live in and making it better in many areas. However, it is equally important to remember that every progress comes with certain challenges that we as a society must address.

### *Andries Engelbrecht, Artificial Intelligence Series Editor*

Published in London, UK © 2022 IntechOpen © your\_photo / iStock

Artificial Intelligence Annual Volume 2022

IntechOpen Series

Artificial Intelligence, Volume 12

Artificial Intelligence

Annual Volume 2022

*Edited by Marco Antonio Aceves Fernandez* 

*and Carlos M. Travieso-Gonzalez*