**7. Results**

14 Will-be-set-by-IN-TECH

[Jin et al. (2009); Lai et al. (2007); Theocharides et al. (2006); Wei et al. (2004); Zemˇcík & Žádník (2007)]. While the algorithms of the object detection are in principle the same for software and hardware implementation, the hardware platform offers features largely different from the software and thus the optimal methods need to implement detection in programmable hardware are often different from the ones used in software and, in many cases, the hardware

The key features that are important for object detection are very different in hardware and

Of course, the hardware implementation also has severe limitations, the most important being:

Taking into account the above advantages and limitations of programmable hardware, it can

• low end computational power embedded system with programmable hardware with programmable hardware as a co-processor; in this setup, it is expected that the

• high end computational system with programmable hardware as a pre-processing unit; this setup is different from the above one with respect to the detection which does not have to be done completely in programmable hardware, but rather the hardware is considered a resource to relieve the processor of the host system from as much computation as possible, and so it is feasible to implement perhaps incomplete but high performance pre-processor

• a complete object detection system in programmable hardware that can be combined with image pre-processing and where the complete detection task along with some image data

Based on the above methods of exploitation, the methods of implementation of object detection in programmable hardware can be subdivided into a complete detection and

The typical methods of complete object detection in programmable hardware is feasible to implement using a sequential engine, possibly microprogrammable, which performs detection location by location, weak classifier by weak classifier until a decision is reached. As the evaluation of each weak classifier is relatively complex, the operation of the sequential unit is pipelined, so that several instances can be running in parallel. At the same time, different locations, in general, require a different number of weak classifiers to be evaluated. These

software, and which are beneficial for hardware implementation, include:

• computational resources for complex mathematical functions expensive

be considered for object detection designed specifically for the following cases:

programmable hardware performs more or less a complete detection task;

• simple implementation of bit manipulation and logical functions • nearly seamless complex control and data flow implementation

• in most cases lower clock speed comparing to the processors

• massive parallelism achievable with good performance/electrical power ratio • variable data path width in hardware adjustable to exact algorithmic needs

implementation may be very efficient.

• limited complexity of the hardware circuits

that reduces the need for computations;

pre-processing.

flow considerations should be implemented.

• memory structures relatively limited

### **7.1 Classifier cost minimization**

This section gives an example of optimization of classifier performance by the balancing amount of computation between a fast hardware pre-processing unit and software post-processing unit. The classifiers used in this experiment were face detectors composed from 1000 weak hypotheses with LBP features and different false negative error rates (in a range from 0.02 to 0.2).

As a baseline, software implementation working on an integral image was selected, as it is the standard way of implementation of the detection. The other implementations ised in the experiments were SSE implementation that evaluate features one by one (SSE-A), and the SSE implementation that evaluates 16 weak hypotheses in a row (SSE-B).

The cost of the hardware unit was selected according to the area on the chip taken by the design. We set the cost constantly to *ci* = <sup>1</sup> *<sup>m</sup>* where *m* is the maximal number of hypotheses that can be fit in the circuit. In this experiment, we use *m* = 50. In general, setting the cost to a low value, we simply say that the cost of the hardware unit is not of much interest to us, and conversely, setting the cost to a large value, we say that the cost of the hardware is very important. The cost of the post-processing unit was calculated from the measurement of

Division Best cost Computations

Integral 0/1000 1.56 0/1 SSE-A 0/1000 0.80 0/1 SSE-B 0/1000 1.24 0/1 FPGA+Integral 16/977 0.51 0.87/0.13 FPGA+SSE-A 11/988 0.38 0.78/0.22 FPGA+SSE-B 14/984 0.41 0.85/0.15

Real-Time Algorithms of Object Detection Using Classifiers 243

of the classifier between hardware and software units; the *Best cost* column reflects the relative cost of the best solution and the *Computations* column shows the fraction of computations

This example shows that it can be beneficial to use a combination of more implementations of detection instead of one. It turns out that using a hardware pre-processing unit improves the detection performance (in terms of computational effort). Additionally, improving the performance of the software part allows for using shorter classifiers in hardware. This is an important fact as the FPGAs (and especially the cheaper ones) have typically limited resources and it could be impossible to put longer classifiers in them. Even higher performance could be achieved by using a neighborhood suppression method which would affect stage execution probability *p* in the optimization. This would result in shorter pre-processing units and lower

The application of such classifier optimization is, for example, in the field of smart camera design. The pre-processing module can be placed directly in the camera which then outputs, beside the normal image, the image with potential occurrence of target objects. Such information, as the above example has shown, dramatically decreases the required

The suppression of neighboring positions was tested on the standard frontal face MIT+CMU dataset. Three WaldBoost classifiers with target false positive rates of 0.01, 0.05 and 0.2 were trained for four types of image features: LRD, LRP, LBP and Haar. For each classifier, three neighborhood suppression strategies were trained with target false positive rates of 0.01, 0.05 and 0.2. Comparing results of the combinations allows us to evaluate if it is more effective to use neighborhood suppression than just by using a WaldBoost classifier with a higher false positive rate. The results of this experiment in Fig. 8 clearly show that neighborhood suppression is indeed effective and on average it evaluates less weak hypotheses per image

EnMS was evaluated on a face localization task. The dataset was downloaded from Flicker groups *portraits* (training) and *just\_faces* (testing). The dataset contains 84, 251 training and 6, 704 near-frontal faces. The images were rescaled to a 100 × 100 pixel resolution with the face approximately 50 × 50 pixels large and positioned in the middle. Both WaldBoost and

Table 2. Summary of results for classifier with LBP features and *α* = 0.02.

performed in hardware and software units.

computation time in the post-processing module.

**7.2 Neighborhood suppression results**

position for the same accuracy.

**7.3 EnMS results**

total cost.


Table 1. Costs of weak hypotheses evaluation in different implementations of detection runtime used in the experiment.

Fig. 7. Optimization results for classifiers with different false negative rates. Each plot shows the total cost of composition of FPGA with a software implementation. The division point is on the horizontal axis and the cost on the vertical axis.

processing time of the implementations of a standard PC, and it corresponds to microseconds per weak hypothesis. The cost values are summarized in Table 1. According to selected costs, the optimization minimize circuit area and, at the same time, the amount of computations in the software. By the combination of such diverse cost measures the result given by the optimization can be viewed as a "relative cost", but the interpretation of the value might be somewhat problematical. This does not, however, matter too much as we do not care about the absolute value of the cost, but about the position of the minima.

Figure 7 shows four plots of total cost for different classifiers. Each plot shows the value of total cost for different settings of the classifier division point and each curve corresponds to a particular combination of FPGA and software implementation. The results of optimization for a classifier with *α* = 0.02 are summarized in Table 2. The *Division* column shows the division 16 Will-be-set-by-IN-TECH

INTEGRAL (ref.) 0.215 SSE-A 0.110 SSE-B 0.070 FPGA 0.002

0

1

0

0.2

0.4

0.6

Total cost C

Fig. 7. Optimization results for classifiers with different false negative rates. Each plot shows the total cost of composition of FPGA with a software implementation. The division point is

processing time of the implementations of a standard PC, and it corresponds to microseconds per weak hypothesis. The cost values are summarized in Table 1. According to selected costs, the optimization minimize circuit area and, at the same time, the amount of computations in the software. By the combination of such diverse cost measures the result given by the optimization can be viewed as a "relative cost", but the interpretation of the value might be somewhat problematical. This does not, however, matter too much as we do not care about

Figure 7 shows four plots of total cost for different classifiers. Each plot shows the value of total cost for different settings of the classifier division point and each curve corresponds to a particular combination of FPGA and software implementation. The results of optimization for a classifier with *α* = 0.02 are summarized in Table 2. The *Division* column shows the division

0.8

0 5 10 15 20 25 30

FPGA+INTEGRAL FPGA+SSE-A FPGA+SSE-B

FPGA+INTEGRAL FPGA+SSE-A FPGA+SSE-B

Division point u

0 5 10 15 20 25 30

Division point u

0.2

0.4

0.6

Total cost C

0.8

1

Table 1. Costs of weak hypotheses evaluation in different implementations of detection

runtime used in the experiment.

0 5 10 15 20 25 30

FPGA+INTEGRAL FPGA+SSE-A FPGA+SSE-B

FPGA+INTEGRAL FPGA+SSE-A FPGA+SSE-B

Division point u

0 5 10 15 20 25 30

Division point u

on the horizontal axis and the cost on the vertical axis.

the absolute value of the cost, but about the position of the minima.

0

1

0

0.2

0.4

0.6

Total cost C

0.8

0.2

0.4

0.6

Total cost C

0.8

1

Cost per weak hyp.


Table 2. Summary of results for classifier with LBP features and *α* = 0.02.

of the classifier between hardware and software units; the *Best cost* column reflects the relative cost of the best solution and the *Computations* column shows the fraction of computations performed in hardware and software units.

This example shows that it can be beneficial to use a combination of more implementations of detection instead of one. It turns out that using a hardware pre-processing unit improves the detection performance (in terms of computational effort). Additionally, improving the performance of the software part allows for using shorter classifiers in hardware. This is an important fact as the FPGAs (and especially the cheaper ones) have typically limited resources and it could be impossible to put longer classifiers in them. Even higher performance could be achieved by using a neighborhood suppression method which would affect stage execution probability *p* in the optimization. This would result in shorter pre-processing units and lower total cost.

The application of such classifier optimization is, for example, in the field of smart camera design. The pre-processing module can be placed directly in the camera which then outputs, beside the normal image, the image with potential occurrence of target objects. Such information, as the above example has shown, dramatically decreases the required computation time in the post-processing module.
