**2. Theoretical background of HOG and MLP models**

As mentioned earlier, we will focus on a type of descriptor called the feature descriptor. In general terms, this approach uses 4 steps to calculate a descriptor from an image, 1. An edge detector is applied to the image. 2. a basis point is chosen, which is a coordinate in the edge map, then a template is defined, mainly circular, centered at that point and it is divided into sections of the same size. These sections divide the image into *Multi-Object Recognition Using a Feature Descriptor and Neural Classifier DOI: http://dx.doi.org/10.5772/intechopen.106754*

regions, each of which corresponds to one dimension of the feature vector. 3. The value of a dimension is calculated as the number of edge pixels that fall into the region (a histogram that summarizes the spatial distribution of edges in the image relative to the chosen basis point). It is common to use the term "bin" to refer to the region in the image as well as the dimension in the feature vector. 4. Feature vector normalization.

With this approach, if the bins were small enough to each contain one pixel, then the histogram would be an exact description of the shape in the support region.

The HOG descriptor is a member of this family, it was proposed by Dalal and Triggs in [23] and extensively documented by Dalal in his PhD thesis, which was supervised by Triggs [41]. HOG offers a successful and popular object representation, particularly with human representation [22, 42]. HOG is inspired by SIFT, thus it can be regarded as a dense version of SIFT. Both algorithms are based on histograms of gradient orientations weighted by gradient magnitudes. However, there are important differences between these algorithms. First, these two algorithms differ slightly with regard to the type of spatial bins that they use, as HOG has a more sophisticated way of binning. Second, HOG computes the descriptors by means of blocks in dense grids at some single scale without orientation alignment, while in SIFT, descriptors are computed at sparse, scale-invariant key image points, and rotated to align orientation. Third, SIFT only computes the gradient histogram for patches around specific interest points obtained by taking the difference of Gaussians in the scale space (this is a local descriptor and SIFT features are usually compared by computing the Euclidean distance between them). HOG, on the other hand is computed for an entire image by dividing it into smaller cells and summing up the gradients over every pixel within each cell in an image (HOG is used to classify patches using classifiers such as SVM). Finally, SIFT is used for the identification of specific objects since the Gaussian weighting involved enables it to describe the importance of a particular point, while HOG does not have such a bias. HOG, therefore, is better suited to the classification task than SIFT.

The main idea behind the HOG descriptors is that the appearance and shape of a local object within an image can be described by the distribution of intensity gradients or edge directions. That is to say, the HOG features concentrate on the contrast of silhouette contours against the background. HOG is a window-based descriptor, whereby the window is typically computed by dense sampling over all image points.

In general, the HOG algorithm can be divided into 4 phases: gradient computation, orientation binning, descriptor blocks, and normalization blocks. In the first phase, the gradients are computed using Gaussian smoothing followed by discrete derivative masks. The experiments by Dalal and Triggs demonstrated that a simple 1-*D*½ � �1, 0, 1 mask at none scale smoothing ð Þ *σ* ¼ 0 works best. The second phase is responsible for dividing the gradient image into small-connected regions called cells (typical cell size is 6 � 6 or 8 � 8 pixels), and within each cell, a frequency histogram is computed representing the distribution of edge orientations within the cell. For this purpose, each pixel calculates a weighted function of the gradient magnitude (called vote, typically the magnitude itself is used) based on the orientation of the gradient element centered on it. Then, the edge orientations are quantized into *q* bin uniformly spaced over 0–180° when an unsigned gradient is used, or 0–360° when a signed gradient is used. In a third phase, groups of adjacent cells are considered as spatial regions called blocks, (typical block size is 2 � 2 or 3 � 3 cells), and using an overlap between blocks significantly improves the algorithm performance. The grouping of cells into a block is the basis for the grouping and normalization of histograms. Two classes of block geometries commonly used are square or rectangular (R-HOG), with associations of spatial cells in squares or rectangles, and circular blocks (C-HOG) partitioned into cells in log-polar fashion. Finally, the blocks defined in the previous phase are normalized. For this purpose, let *v* be the unnormalized block, k k*v <sup>k</sup>* be its *k*-norm for *k* ¼ 1,2, and *ϵ* be a small normalization constant to avoid division by zero [41], then the following schemes can be used:

$$1. L2-norm \left(\upsilon \to \upsilon/\sqrt{\|\upsilon\|\_{2}^{2} + \epsilon^{2}}\right);$$

2.*L*2 � *Hys* (*L*2 � *norm* followed by clipping, limiting the maximum values of *v* to 0.2, and renormalizing);

$$\mathfrak{3}.L1 - norm \ (v \to v/||v||\_1 + \epsilon);$$

$$4. L1 - sqrt (v \rightarrow \sqrt{v/||v||\_1 + \epsilon}).$$

The final descriptor is then the vector of all components of the normalized cell responses from all of the blocks in the detection window.

On the other hand, it is common to find a support vector machine (SVM) classifier to complement a feature extractor HOG within a scheme of object recognition. One purpose of this paper is to determine the performance of an object recognition system that uses HOG and neural network-based classifier, particularly a multi-layer perceptron (MLP).

MLP was derived from the neuronal model known as perceptron, which was presented by Rossenblatt in 1958 [43]. The perceptron is based on the model of McCulloch and Pitts [44] and presents a learning rule based on error correction. MLP is an ANN composed of an input layer, *n* hidden layers, and an output layer, all consisting of *m* type-perceptron neurons. MLP is the most studied neural network model, which can approximate any continuous nonlinear function arbitrarily well on a compact interval. Due to this property, MLP became popular in order to parametrize nonlinear models and with classification purposes. Furthermore, the characteristics of the MLP are well known to solve the problem that occurs when the data available in training are generally not sufficient to cover the variability of the object's appearance. This problem is present in most recognition systems, including our own.

MLP belongs to the category of supervised classifiers. Then, with the training set, composed of *p* pairs of input–output vectors that define the behavior of the system that the ANN will adapt, is defined as

$$\left\{ \left( \mathbf{x}^1, \mathbf{t}^1 \right), \left( \mathbf{x}^2, \mathbf{t}^2 \right), \dots, \left( \mathbf{x}^p, \mathbf{t}^p \right) \right\} = \left\{ (\mathbf{x}^\mu, \mathbf{t}^\mu) | \mu = \mathbf{1}, 2, \dots, p \right\} \tag{1}$$

where **<sup>x</sup>***<sup>μ</sup>* <sup>¼</sup> *<sup>x</sup><sup>μ</sup> i* � � *<sup>n</sup>* and **<sup>t</sup>***<sup>μ</sup>* <sup>¼</sup> *<sup>t</sup> μ k* � � *<sup>q</sup>* are the input and target vectors, respectively.

MPL structure is defined as follows: It has an input layer of *n* units, one or more successive hidden layers of intermediate units, and a layer of *q* output units. Considering an MLP with *<sup>r</sup>* hidden layers, where **<sup>x</sup>***<sup>μ</sup>* <sup>¼</sup> *<sup>x</sup><sup>μ</sup> i* � � *<sup>n</sup>*is the *μ*-th input vector that belongs to the training set (defined by **Eq. (1)**). Then, **<sup>y</sup>***<sup>l</sup>* � �*<sup>μ</sup>* <sup>¼</sup> *<sup>y</sup><sup>l</sup> j l* � �*<sup>μ</sup>* h i *ml* , **<sup>z</sup>***<sup>μ</sup>* <sup>¼</sup> *<sup>z</sup> μ k* � � *q* and **<sup>t</sup>***<sup>μ</sup>* <sup>¼</sup> *<sup>t</sup> μ k* � � *<sup>q</sup>* represent the output of the *l*-th hidden layer, the output generated by MLP and the output target that MLP must generate, respectively, when **x***<sup>μ</sup>* is presented to the network; where *<sup>l</sup>* <sup>¼</sup> ð Þ 1, 2, … , *<sup>r</sup>* , *<sup>m</sup><sup>l</sup>* and *<sup>q</sup>* indicate the number of units comprising the *l*-th hidden layer and the output layer, respectively.

Each unit's output of layer *l* will be connected to the input of each unit in the layer *l* + 1, and a synaptic weight will be associated with each of these connections. Thus,

the synaptic weights and thresholds of the first hidden layer are represented by **<sup>W</sup>**<sup>1</sup> <sup>¼</sup> *<sup>w</sup>*<sup>1</sup> *j* 1*i* h i *<sup>m</sup>*1�*<sup>n</sup>* and **<sup>θ</sup>**<sup>1</sup> <sup>¼</sup> *<sup>θ</sup>*<sup>1</sup> *j* 1 h i *m*<sup>1</sup> , respectively. For remaining hidden layers, *l* ¼ ð Þ 2, … , *<sup>r</sup>* , the synaptic weights and thresholds are defined as **<sup>W</sup>***<sup>l</sup>* <sup>¼</sup> *<sup>w</sup><sup>l</sup> j l j l*�1 h i *ml* �*ml*�<sup>1</sup> and **<sup>θ</sup>***<sup>l</sup>* <sup>¼</sup> *<sup>θ</sup><sup>l</sup> j l* h i *ml* . Finally, the synaptic weights and thresholds of the output layer are defined as **<sup>W</sup>***<sup>r</sup>*þ<sup>1</sup> <sup>¼</sup> *<sup>w</sup>r*þ<sup>1</sup> *kjr* h i *<sup>q</sup>*�*mr* and **<sup>θ</sup>***<sup>r</sup>*þ<sup>1</sup> <sup>¼</sup> *<sup>θ</sup>r*þ<sup>1</sup> *k* � � *q*.

MLP is a feed-forward neural network, and its operation is defined as follows. The output of first hidden layer is expressed mathematically as follows:

$$\mathcal{Y}\_{j^4}^1 = f\left(\sum\_i w\_{j^{4\_i}}^1 \bullet \boldsymbol{\varkappa}\_i - \theta\_{j^4}^1\right) \tag{2}$$

Whereas the output of the *l*-th hidden layer, with *l* ¼ ð Þ 2, … , *r* , is computed by

$$\mathcal{Y}\_{j^l}^l = f\left(\sum\_{j^{l-1}} w\_{j^l j^{l-1}}^l \bullet y\_{j^{l-1}}^{l-1} - \theta\_{j^l}^l\right) \tag{3}$$

Finally, the MLP operation is defined as:

$$z\_k = \lg\left(\sum\_{\boldsymbol{\mathcal{I}}^r} w\_{k\boldsymbol{\mathcal{I}}^r}^{r+1} \bullet \boldsymbol{\mathcal{Y}}\_{\boldsymbol{\mathcal{I}}}^r - \boldsymbol{\theta}\_k^{r+1}\right) \tag{4}$$

Typically, activation functions for units of hidden layers, *f*ð Þ∙ , are nonlinear; e.g., unipolar sigmoid function 1*<sup>=</sup>* <sup>1</sup> <sup>þ</sup> *<sup>e</sup>*�*<sup>x</sup>* ð Þ and bipolar sigmoid function <sup>1</sup> � *<sup>e</sup>*�*<sup>x</sup>* ð Þ*<sup>=</sup>* <sup>1</sup> <sup>þ</sup> *<sup>e</sup>*�*<sup>x</sup>* ð Þ. Activation functions of this type introduce the nonlinearity into the network and enable the MLP to approximate any nonlinear function with arbitrary accuracy. The activation functions of the units in the output layer, *g*ð Þ∙ may be linear or nonlinear, depending on the application.

The MLP evolution based its success on the design of training algorithms that could minimize the error committed by the network by adequately and automatically modifying the values of the synaptic weights. In this sense, the training algorithm called backpropagation is the most popular used to adapt the MLP to a specific application because it is conceptually simple and computationally efficient. In 1974 Paul Werbos developed the basic principles of backpropagation, while developing his PhD thesis, by implementing a system that estimated a dynamic model for predicting social communications and nationalism [45]. In 1986 Rumelhart et al. formalized the backpropagation algorithm as a method that allows a type-MLP ANN to learn the association that exists between a set of input patterns and the corresponding classes [46]. Over time, backpropagation has become one of the most widely used neural learning methods, proving to be an efficient tool in applications of pattern recognition, dynamic modeling, sensitivity analysis, and the control of systems over time, among others.

The backpropagation algorithm looks for the minimum error function in weight space using the method of gradient descent. This method is applied for training the units of hidden layers of an MLP; that is to say, the basic idea of this algorithm states that updating the synaptic weights of the units of a layer depends on the error

generated by the layer itself and errors generated by the following layers. The aforementioned is established by the mathematical structure of the backpropagation algorithm, which can be expressed as follows:

$$
\Delta w\_{k\bar{j}}^{r+1} = \varepsilon \bullet \delta \mathbf{z}\_k \bullet \mathbf{y}\_{\bar{j}}^r,\tag{5a}
$$

$$\text{where } \delta \mathbf{z}\_k = (t\_k - z\_k) \mathbf{g}'(\boldsymbol{u}\_k^{r+1}) \text{ and } \boldsymbol{u}\_k^{r+1} = \sum\_{\boldsymbol{f}'} \boldsymbol{w}\_{k\boldsymbol{f}'}^{r+1} \bullet \mathbf{y}\_{\boldsymbol{f}}' - \boldsymbol{\theta}\_k^{r+1}$$

$$\Delta \boldsymbol{w}\_{\boldsymbol{f}\boldsymbol{f}^{-1}}^l = \boldsymbol{\varepsilon} \bullet \boldsymbol{\delta}\_{\boldsymbol{f}}^l \bullet \mathbf{y}\_{\boldsymbol{f}^{-1}}^{l-1}, \tag{5b}$$

$$\text{where } \delta y^l\_{\vec{f}} = f^{rl}\_{\vec{f}} \left( u^l\_{\vec{f}} \right) \sum\_{k} \delta \mathbf{z}\_k \bullet w^{l+1}\_{kj} \text{ and } u^l\_{\vec{f}} = \sum\_{j^{l-1}} w^l\_{\vec{f}j^{l-1}} \bullet y^{l-1}\_{j^{-1}} - \theta^l\_{\vec{f}}; \text{ for } l = (2, \dots, r)$$

$$\Delta w^1\_{\vec{f}^i i} = \varepsilon \bullet \delta y^1\_{\vec{f}^i} \bullet \mathbf{x}\_i \tag{5c}$$

where *δy*<sup>1</sup> *j* <sup>1</sup> <sup>¼</sup> *<sup>f</sup>*0<sup>1</sup> *j* <sup>1</sup> *u*<sup>1</sup> *j* 1 � �P *j* 2 *δy*<sup>2</sup> *j* <sup>2</sup> ∙ *w*<sup>2</sup> *j* 2*j* <sup>1</sup> and *u*<sup>1</sup> *j* <sup>1</sup> <sup>¼</sup> <sup>P</sup> *i w*1 *j* 1 *<sup>i</sup>* <sup>∙</sup> *xi* � *<sup>θ</sup>*<sup>1</sup>*: j* <sup>1</sup> *ε* is a small-valued

constant that defines the learning rate of the network.
