**4. Multiple objects recognition**

Multiple objects of the same class can be accommodated by the G-HONN type filters to be recognised within an input cluttered image due to the shift invariance properties inherited by its correlator unit. Thus, in the M-HONN system all the training set images pass through the NNET unit. This time, instead of multiplying each training image with the corresponding weight connections (mask) as for the C-HONN filter, we keep constant the weight connection values, setting them to be equal with a randomly chosen image included in the training set. All the test set images are multiplied with the same randomly chosen image's weight connection values. Then, the training set images, after being transformed (masked) through the NNET unit by being multiplied with the mask, pass through the correlator unit where they are correlated with the masked test set images. In effect, the cross-correlation of each masked test set image with the transformed training set images (reference kernel) returns an output correlation plane peak value for each cross-correlation step. Hence, the maximum peak height values of the output correlation plane correspond to the recognised true-class objects.

#### **4.1 Modified NNET block architecture for multiple objects of different classes recognition**

As for all the HONN-type systems (Kypraios et al., 2004; Kypraios et al., 2003; Kypraios et al., 2009), in the M-HONN system's NNET block (unit) there is a single input source used for all the input data. Assuming we have N = 3 input still images or video frames of size 256×256 in pixels, then the input source consists of 65.536 i.e. [1(256×256)] input neurons equal to the size of each training image or frame (in vector form). Each layer needs, by definition (Hagan et al., 1996), to have the same input connections to each of its hidden neurons. Therefore, the shown NNET architecture is referred to as N+1 = 3+1 = 4, fourlayered since there are, N = 3, three hidden neurons (though shown here aligned under each other, they do not belong in the same hidden layer but rather create three separate hidden layers each of a single hidden neuron) and one output layer. Each of the hidden layers consist of only one hidden neuron. The input layer does not contain neurons with activation functions and so is omitted in the numbering of the layers. Though the network initially is fully connected to the input layer during the training stage, only one hidden layer is connected for each training image presented through the NNET. NNET is thus not a contiguous three layer network during training, which is why the distinction is made. In effect, neuron 1 is trained with the training still image or video frame x <sup>1</sup> , neuron 2 is trained with the training still image or video frame x 2 and so on, ending with neuron <sup>N</sup> being trained with the training still image or video frame x <sup>N</sup> . Thus, the number of the input weights increases proportionally to the size of the training set:

$$\mathbf{N}\_{\rm tw} = \mathbf{N} \times \left[ \mathbf{m} \times \mathbf{n} \right] \tag{24}$$

where N is the number of the input weights, N, is i w the size of the training set equal to the number of the training images and [m×n] is the size of the image of the training set.

Fig. 4 shows the modified NNET block architecture for accommodating multiple objects for more than one class recognition. As for all the family of G-HONN filters, NNET is implemented as a feedforward multi-layer architecture trained with a backpropagation algorithm. It has a single input source of input neurons equal to the size of the training image or video frame in vector form. In effect, for the training still image or video frame i 1N x of size [m×n], there are [m×n] input neurons in the single input source. The input weight are fully connected from the input layer to the hidden layers. There are N input i w weights proportional to the size of the training set. The number of the hidden layers, Nl is equal to the number of the images or video frames of the training set N:

$$\mathbf{N} = \mathbf{1}, \mathbf{2}, \mathbf{3}, \dots, \mathbf{i} \quad \text{and} \quad \mathbf{N}\_{\parallel} = \mathbf{N} \tag{25}$$

We have set to each hidden layer to contain a single neuron. The layer weights are fully connected to the output layer. Now, the number of the layer weights Nlw is equal to:

$$\mathbf{N}\_{\rm lw} = \mathbf{N} \times \mathbf{N}\_{\rm opn} \quad \text{and} \quad \mathbf{N}\_{\rm opn} = \mathbf{N}\_{\rm clasos} \tag{26}$$

where N is the number of the output neurons and opn Nclasses is the number of the different classes. In effect, we have augmented the output layer by adding more output neurons, one for each different class. On Fig. 4 we assume N 2 classes . Thus:

$$\mathbf{N}\_{\rm opn} = \mathbf{N}\_{\rm classos} = \mathbf{2} \quad \text{so, there are} \quad \mathbf{N}\_{\rm low} = \mathbf{N} \times \mathbf{2} \tag{27}$$

and

48 Advances in Object Recognition Systems

*,n* (22)

*M HONN <sup>i</sup> a S m,n <sup>i</sup>* (23)

*S Xm i N ci N* 1 1 

1 *N i N*

In Eq. (23) we have chosen to constrain the correlation peak height values as we did with the constrained-HONN (C-HONN) system's implementation, but we can also easily re-write the system's transfer equation for the case of the unconstrained peak height values as with the unconstrained-HONN (U-HONN) system's implementation (Mahalanobis, 1994; Kypraios

Multiple objects of the same class can be accommodated by the G-HONN type filters to be recognised within an input cluttered image due to the shift invariance properties inherited by its correlator unit. Thus, in the M-HONN system all the training set images pass through the NNET unit. This time, instead of multiplying each training image with the corresponding weight connections (mask) as for the C-HONN filter, we keep constant the weight connection values, setting them to be equal with a randomly chosen image included in the training set. All the test set images are multiplied with the same randomly chosen image's weight connection values. Then, the training set images, after being transformed (masked) through the NNET unit by being multiplied with the mask, pass through the correlator unit where they are correlated with the masked test set images. In effect, the cross-correlation of each masked test set image with the transformed training set images (reference kernel) returns an output correlation plane peak value for each cross-correlation step. Hence, the maximum peak height values of the output correlation plane correspond to

**4.1 Modified NNET block architecture for multiple objects of different classes** 

As for all the HONN-type systems (Kypraios et al., 2004; Kypraios et al., 2003; Kypraios et al., 2009), in the M-HONN system's NNET block (unit) there is a single input source used for all the input data. Assuming we have N = 3 input still images or video frames of size 256×256 in pixels, then the input source consists of 65.536 i.e. [1(256×256)] input neurons equal to the size of each training image or frame (in vector form). Each layer needs, by definition (Hagan et al., 1996), to have the same input connections to each of its hidden neurons. Therefore, the shown NNET architecture is referred to as N+1 = 3+1 = 4, fourlayered since there are, N = 3, three hidden neurons (though shown here aligned under each other, they do not belong in the same hidden layer but rather create three separate hidden layers each of a single hidden neuron) and one output layer. Each of the hidden layers consist of only one hidden neuron. The input layer does not contain neurons with activation functions and so is omitted in the numbering of the layers. Though the network initially is fully connected to the input layer during the training stage, only one hidden layer is connected for each training image presented through the NNET. NNET is thus not a

Thus, the M-HONN system's transfer function is formulated as follows:

et al., 2004b).

**recognition** 

**4. Multiple objects recognition** 

the recognised true-class objects.

$$\mathbf{N}\_{\text{class1 law}} = \mathbf{N} \quad \text{ } \quad \mathbf{N}\_{\text{class2 law}} = \mathbf{N} \tag{28}$$

where N and class1 lw N are the layer weights corresponding to object class 1 and object class2 lw class 2, respectively. There are bias connections to each one of the hidden layers:

$$\mathbf{N}\_b = \mathbf{N} \tag{29}$$

where Nb is the number of the bias connections. There are N target connections from t arg et w the N output neurons of the output layer: opn

$$\mathbf{N}\_{\text{target}\,\text{w}} = \begin{array}{c} \mathbf{N}\_{\text{open}} \\ \end{array} \tag{30}$$

Performance Analysis of

optical correlator.

impossibly large for

input and layer weights:

**5. Performance analysis** 

the Modified-Hybrid Optical Neural Network Object Recognition System Within Cluttered Scenes 51

The above Eq. (31) in spatial domain, and Eq. (32) in frequency domain describe the M-HONN system's transfer function for multiple objects recognition (where the upper script *class* is used for the class index, i.e. for Fig. 4 we have *class = class1, class2*). Thus, the M-HONN filter (robust object recognition system) is composed of a non-linear space domain superposition of the training set images or from the video frames of the training set video sequences. As for all the HONN-type systems, the multiplying coefficient now becomes a non-linear function of the input weights and the layer weights, rather than a simple linear multiplying constant as used in a constrained linear combinatorial-type filter synthesis procedure. The non-linear M-HONN system is inherently shift invariant and it may be employed in an optical correlator as would a linear superposition constrained-type filter, such as the synthetic discriminant function (SDF) -type (Bahri & Kumar, 1988) filters. It may be used as a space domain function in a joint transform correlator architecture or be Fourier transformed and used as Fourier domain filter in a 4-f Vander Lugt (Vander Lugt, 1964) type

We have constructed a data set of input images of an S-type Jaguar car model at 10 increments of out-of-plane rotation at an elevation angle of approximately 45 to be used for the M-HONN system. A second set of images was constructed for the Police car model Mazda Efini RX-7 at the same elevation angle to serve as the out-of-class data for discrimination tests (see Fig. 5). A third data set was created of the background images of typical car parks (see Fig. 6) and the images of the S-type car model and the Mazda RX-7 car model added in the background scene. The size of all the images was 256 256 and all the images are in grey-scale bitmap format. All the input training images (and all the input test set images) for M-HONN system are concatenated row-by-row into a vector of size 1 256 256 prior to input to the NNET block. Normally this size of image is

processing by any artificial neural network architecture, since to be implemented by enough

N 10 256 256 i w 10 65,536 655,360

Thus, for a training set of N 10 individual vectors of size 256 256 , there would, in total, be more than half-a-million input weight connections needed. Thus the selective weight connection architecture is employed to overcome this problem. To overcome this problem we developed a novel selective weight connection architecture (see Section 2). Also, applying the heuristic training algorithm with momentum and an adaptive learning rate into the NNET training session (Nguyen & Widrow, 1989; Nguyen & Widrow, 1990), has speeded up the learning phase and reduced the memory size needed to complete fully the training session. Here, it worth mentioning that the NNET block and, in overall, M-HONN system is able to process input still images and video frames for all the test series in few a msec with a Dual Core CPU at 2.4 GHz with 4.0GB RAM. Additionally, due to the

(33)

 

Fig. 4. Modified NNET block of the M-HONN system for multiple objects recognition of different classes

Thus, now for N 2 classes there will be N transformed images being created for class 1 and N transformed images being created for class 2. Then, both sets of transformed images are used for the synthesis of the system's composite image. M-HONN system for multiple objects recognition of different class objects is written as follows:

 *classes classes N ×N i = 1LN ×N M - HONN =* <sup>i</sup> a *class S m, n <sup>i</sup>* = <sup>1</sup> a *class c 1 Γ × X m,n +* <sup>2</sup> a *class c 2 Γ × X m,n + L +* (31) <sup>N</sup> a *class c N Γ × X m,n*

or in the frequency domain the above equation is re-written as:

$$\mathbf{M} \cdot \mathbf{HONN} = \sum\_{i=1:N\_{class}\ge N}^{N\_{class}\ge N} \mathbf{a}\_i \cdot \mathbf{S}\_i^{class} \left(m\_\nu n\right) \tag{32}$$

The above Eq. (31) in spatial domain, and Eq. (32) in frequency domain describe the M-HONN system's transfer function for multiple objects recognition (where the upper script *class* is used for the class index, i.e. for Fig. 4 we have *class = class1, class2*). Thus, the M-HONN filter (robust object recognition system) is composed of a non-linear space domain superposition of the training set images or from the video frames of the training set video sequences. As for all the HONN-type systems, the multiplying coefficient now becomes a non-linear function of the input weights and the layer weights, rather than a simple linear multiplying constant as used in a constrained linear combinatorial-type filter synthesis procedure. The non-linear M-HONN system is inherently shift invariant and it may be employed in an optical correlator as would a linear superposition constrained-type filter, such as the synthetic discriminant function (SDF) -type (Bahri & Kumar, 1988) filters. It may be used as a space domain function in a joint transform correlator architecture or be Fourier transformed and used as Fourier domain filter in a 4-f Vander Lugt (Vander Lugt, 1964) type optical correlator.
