**1. Introduction**

Computer vision is a discipline through which a machine is enabled in order to recognize the world around it using visual perception, allowing it to deduce the structure and properties of a three-dimensional world from one or more two-dimensional images. In this respect, Forsyth-Ponce and Ballard-Brown argue that computer vision refers to the construction of explicit and significant descriptions of physical objects from images [1, 2]. That is, computer vision enables a machine to extract and analyze spectral, spatial, and temporal information of the different objects contained within an image. While spectral information includes frequency (color) and intensity (grayscale), spatial information refers to aspects such as shape and position (one, two, and three dimensions) and temporal information comprising stationary aspects (presence and/or absence) and time-dependent (events, movements, and processes).

Due to this ability, tasks such as failure detection [3, 4]; verification [5, 6]; identification [7, 8]; tracking analysis [9, 10]; and recognition [11, 12] can be performed by a computer vision system.

The object recognition task is of particular interest to this research as it plays a significant role in a computer vision system and is necessary even in order to complete some of the tasks listed above. It is evident that there are an increasing number of areas and/or applications requiring object recognition, e.g., fruit sorting [13], face detection [14], people detection [15], face recognition [16], object tracking [17], automatic traffic sign recognition [18], and vehicle license plate recognition [7], among many others. Gonzales and Woods define object recognition as the task of organizing input data into previously defined classes, using significant features extracted from objects that are immersed in an image containing irrelevant details [19]. Considering this definition, it is evident that both feature extraction and classification are extremely important for an object recognition task to achieve its aim. The feature extraction process applies operations to an image in order to obtain information describing the objects it contains. Moreover, this information should be able to discriminate between different object classes. The goal of this process is to improve the effectiveness and efficiency of the classification process [20]. The classification process uses the information generated by the feature extractor to perform both phases comprising it, learning, (thereby creating a bank of models), and recognition (responsible for determining which objects belonging to the bank of models is present in analyzed image) [21].

The development of the work presented in this paper is motivated by the increasing demand and necessity of techniques and algorithms that efficiently perform tasks related to the recognition of objects, as well as their implementation in real applications, both industrial and research, where a visual perception system is required. Considering the above, this work proposes the design and implementation of an object recognition system using methods based on feature descriptor and neural classifier. We will focus on a type of descriptor called the feature descriptor. In their original form, these descriptors characterize shape in 2D images as histograms of edge pixels, HOG (Histograms of Oriented Gradients) algorithm belongs to this family of descriptors [22]. Two practical objectives have been defined to meet the proposal. First, study and development of an improved HOG algorithm that can accurately represent different objects. Second, to demonstrate that the performance of a neuronal approach in the task of classification and labeling of the data generated by HOG is competitive with the techniques currently used.

The HOG method for object representation, proposed by Dalal and Triggs [23], describes the appearance of local regions (objects) within an image by means of the distribution of intensity gradients or edge directions. For this purpose, the HOG method applies a similar principle to be used by different methods, such as edge orientation histograms [24], shape contexts [25], and scale-invariant feature transform (SIFT) [26], which counts the number of occurrences of gradient orientation in specific portions of an image but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

HOG has demonstrated that it is capable of generating representations that provide discriminative information from the objects in an image using normalized representations of objects. Since it operates on local cells, it is invariant to geometric and photometric transformations. For improved accuracy, the local histograms can be

*Multi-Object Recognition Using a Feature Descriptor and Neural Classifier DOI: http://dx.doi.org/10.5772/intechopen.106754*

contrast-normalized by calculating a measure of the intensity across a larger region of the image, called a block, and then using this value to normalize all cells within the block. This normalization results in better invariance to changes in illumination and shadowing. Furthermore, HOG is invariant to changes in the data background as well as object position.

Although, the HOG method is particularly suited for human detection in images [11, 12, 23, 27–29], in this paper we show that it can be used to represent different objects accurately, and even perform well in multi-class applications.

On the other hand, artificial neural networks (ANN) have been successfully used in a variety of classification tasks in real industrial, scientific, and business applications [13, 30–33]. Several kinds of neural networks can be used for this purpose, but we decided to use feedforward multi-layer networks or multi-layer perceptron (MLP), which are the most widely studied and used neural network classifiers.

MLP offers features that make it a competitive alternative to various conventional classification methods [34]. First, adaptability, MLP is capable of developing its own feature representation. That is to say, MLP can organize data into the vital aspects or features that enable one pattern to be distinguished from another. Second, generalization, MLP has the ability to respond appropriately to input patterns different from those involved in the learning process. Third, MLP is a universal function approximator, thus it can approximate any function with arbitrary accuracy. Fourth, since MLP uses a nonlinear activation function, it is a nonlinear model, which makes it flexible in modeling real-world complex relationships.

### **1.1 Related work**

Similar works with which our proposal has been compared, because they use descriptors with similar characteristics to ours, are described below. In [35], an object model is generative and probabilistic, so appearance, scale, shape, and occlusion are all modeled by probability density functions, which here are Gaussians. Learning is carried out using the expectation–maximization algorithm, presented in [36], which iteratively converges to learn an object category using the detecting regions and their scales, and then estimating the parameters of the above densities from these regions, such that the model gives a maximum-likelihood description of the training data. Recognition proceeds by first detecting features, and then evaluating these features through a process Bayesian, using the model parameters estimated in the learning.

Zhang et al. proposed a scheme that represents the local features of the object by means of PCA-SIFT method and the global features using shape context method [37]. Both sets of features are presented to two-layer AdaBoost training network. Boosting refers to the general method of producing a very accurate prediction rule by combining relatively inaccurate rules-of-thumb. It has been used widely in computer vision, particularly for object recognition. Layer 1 chooses as "good" features, those that have the best ability to discriminate the target object class from the nontarget object class. Then, layer 2 locates the final "good" features based on the distances between the most discriminant local features selected by layer 1. This two-layered boosting method produces two strong classifiers, which can then be used in a cascaded for recognition tasks.

On the other hand, in [38] Zhang et al. presented a scheme that represents images as distributions (signatures or histograms) of features extracted with different key point detectors and descriptors. The proposed scheme represents an object from the union of a detector with a descriptor. Two complementary local region detector types are used: The Harris-Laplace detector, which responds to corner-like regions, and the


#### **Table 1.**

*Main features summary of the mentioned schemes and our proposal.*

Laplacian detector, which extracts blob-like regions. At the most basic level, these two detectors are invariant to scale transformations only. To achieve invariance to other transformations, such as rotation or illumination, this scheme may include the descriptors SIFT, SPIN, and RIFT. For classification process, the authors use the well-known Support Vector Machines.

In [39], Leibe et al. propose a novel method for detecting and localizing objects of a visual category in cluttered real-world scenes. For the objects representation, this proposal introduces a basic object representation model known as Implicit Shape Model (ISM). ISM consists of a class-specific alphabet of local appearances that are prototypical for the object category, and of a spatial probability distribution that specifies where each codebook entry may be found on the object. Then, the object detection is implemented as a probabilistic Hough voting procedure from which hypotheses are found by a scale-adaptive Mean-Shift search.

More recently, Leptev showed how histogram-based image descriptors can be combined with a boosting classifier to provide a robust object detector [40]. Each feature is represented by a histogram of local image measurements within a region. For this purpose, HOG features are adopted and considered histograms of alternative image measurements such as color and second-order image derivatives. The HOG features are formed from orientation of local image gradient at each point using Gaussian derivatives of image computed for scale parameter defined. The histograms are normalized to the l1 unit norm. At the training, the features for normalized training images are computed, and apply AdaBoost to select a set of features and the corresponding weak classifiers optimizing classification performance.

The overall organization of the paper is as follows, after the introduction, we examine theoretical issues of HOG and MLP in Section 2. Section 3 presents the object recognition system based on HOG and MLP proposed in this paper. Experimental results are discussed in Section 4. Finally, Section 5 concludes the paper.

**Table 1** shows a summary of the main features that the mentioned schemes and the one proposed in this paper possess.
