**3. Efficient feature extraction**

2 Will-be-set-by-IN-TECH

Fig. 1. Scanning the image with a classifier. Individual sub-images of the image are classified

(e.g. multiple detections in very close image locations), or simply through the sharing of intermediate results of the calculations. These aspects of object detection are addressed in this

The structure of the chapter is as follows. The next section gives a brief introduction to object detection with classifiers. Section 3 discusses properties of features extracted from image and describes feature types often used for rapid object detection. Section 4 describes the ideas behind AdaBoost and WaldBoost learning procedures. Acceleration methods for WaldBoost-based detection are introduced in Section 5. Implementation of the detection runtime on different platforms is discussed in Section 6. Some results of the detection acceleration are presented in Section 7, and finally we conclude in Section 8 with some ideas

Classifiers are suitable for making the decision, whether some sub-images are images of object of interest or not. Such functionality is obviously of interest for object detection but it is not sufficient on its own. The reason is that for reliable classification, variability of objects of interest has to be minimized - the classifiers are trained to detect well-aligned, centered and size-normalized objects in the classified sub-image. Therefore, the actual detection of objects is performed through a classification of contents of all the sub-windows that can contain the object of interest, or simply through classification of all the possible sub-windows. This is usually performed by *scanning* the image with a moving window of a fixed size where the content of the window is classified for each location and, if the object of interest is found, the

The above described approach involves, in fact, an exhaustive search for an object of interest in the image, where all the sub-images are classified in order to understand whether they contain an object of interest or not. While the classification process is in general quite simple (as shown in more detail below), sometimes it might be feasible to pre-process the analyzed image in order to identify the image parts where the object(s) of interest cannot be present; such parts of the image can be excluded from the classification process and the computational effort can be reduced. Good examples of such approach are color-based pre-processing, where e.g. a flower cannot be present in a part of the image that contains "completely blue sky"; or a human face cannot be found in a part of an image that does not contain "skin color"; or

by a classifier (Image source: BioID dataset).

**2. Object detection with classifiers**

location is considered the output of the detection process.

chapter as well.

for future research.

bg

face

bg

The performance of the object detection is for the large part influenced by underlying feature extraction methods. Two main properties of features extracted from an image exist: *a)* descriptive power and *b)* computational complexity. The goal in rapid object detection is to use computationally simple and, at the same time, descriptive features. In the vast majority of cases, these two properties are mutually exclusive and thus there are computationally simple features with low descriptive power (e.g. isolated pixels, sums of area intensity) or complex and hard to compute features with high descriptive power (Gabor wavelets [Lee (1996)], HoG [Dalal & Triggs (2005)], SIFT and SURF [Bay et al. (2008); Lowe (2004)], etc.). A close to ideal approach is Viola and Jones [Viola & Jones (2001)] with their Haar features calculated in constant time from an integral representation of image. The features used in this chapter are Local Binary Patterns (LBP) [Zhang et al. (2007)], Local Rank Patterns (LRP) [Hradiš et al. (2008)] and Local Rank Differences (LRD) [Zemcik et al. (2007)]. Their main properties are as follows.



Fig. 2. Feature samples for LBP (left), LRD and LRP (right)

All presented features are based on the same model. The only difference is their evaluation function. First, coefficients *vi* from regular 3 × 3 grid (see Fig. 2) are extracted by convolution. The coefficients are processed by an evaluation function producing the response.

is compared to a threshold *θt*:

is minimal:

data

being:

satisfies *<sup>R</sup>*ˆ*<sup>t</sup>* <sup>≥</sup> <sup>1</sup>

platform

algorithms

completely suitable.

or close) areas of the image.

*α* .

*St*(**x**) = �, if *Ht*(**x**) <sup>&</sup>gt; *<sup>θ</sup><sup>t</sup>*

Real-Time Algorithms of Object Detection Using Classifiers 231

WaldBoost [Sochman & Matas (2005)] is a method which produces an optimal decision strategy for a target false negative rate. The algorithm combines real AdaBoost Schapire &

Given a weak learner algorithm, training data {(*x*1, *y*1)...,(*xm*, *ym*)}, *x* ∈ *χ*, *y* ∈ {−1, +1} and a target false negative rate *α*, the WaldBoost algorithm finds a decision strategy *S*∗ with a

*T*¯

WaldBoost uses real AdaBoost to iteratively select the most informative weak hypotheses *ht*. The threshold *θ<sup>t</sup>* is then selected in each iteration so that as many negative training samples are rejected as possible while asserting that the likelihood ratio that is estimated on training

> *<sup>R</sup>*ˆ*<sup>t</sup>* <sup>=</sup> *<sup>p</sup>*(*Ht*(**x**) <sup>&</sup>lt; *<sup>θ</sup>t*|*<sup>y</sup>* <sup>=</sup> <sup>−</sup>1) *p*(*Ht*(**x**) < *θt*|*y* = +1)

Acceleration of object detection can be in general based on several principles, the key ones

• Implementation on a (more) powerful computational platform – simple general

• exploitation of a structurally different platform compared to the traditional processor

• improvement of the AdaBoost/WaldBoost machine learning and/or feature extraction

• exploitation of redundancy and coherence in results of classification in different (adjacent

The case of general improvement of computational platforms is not of interest here in this publication. On the other hand, structurally novel computational platforms are interesting in general due to their rapid growth in computer technology and specifically in the object detection, where the structure of exploitation of the computational resources suggests that the traditional platforms are not ideal and that the massive parallel platforms are also not

The general improvements of the AdaBoost/WaldBoost machine learning methods are outside the scope of this publication. However, the algorithmic improvements not connected with the classification itself, but rather with the redundancy due to correlation of the classification results in different sub-images of the same image, are quite important to

*<sup>S</sup>*, s.t. *α<sup>S</sup>* < *α*.

Singer (1999) and Wald's *sequential probability ratio test* Wald (1947).

miss rate *α<sup>S</sup>* which is lower than *α* and the average evaluation time *T*¯

**5. Acceleration of WaldBoost based object detection**

improvement of computational platforms

*S*∗ = arg min *S*

−1, if *Ht*(**x**) ≤ *θ<sup>t</sup>*

. (7)

*<sup>S</sup>* = *E*(arg min*<sup>i</sup>*

(*Si* �= �))

$$LBP(\mathbf{v}, c) \quad = \sum\_{i=0}^{N} (v\_i > c)2^i \tag{1}$$

$$LRD(\mathbf{v}, a, b) \; \;= r(v\_{a\prime}, \mathbf{v}) - r(v\_{b\prime}, \mathbf{v}) \tag{2}$$

$$LRP(\mathbf{v}, a, b) = 10r(v\_{a\prime}\mathbf{v}) + r(v\_{b\prime}\mathbf{v})\tag{3}$$

The evaluation of LBP works such that all samples are compared to the central one. The result of each comparison is treated as a single bit in the 8 bit code (1). The LRD and LRP features are parametrized by indices of two samples whose ranks are calculated (4). The ranks are subtracted in the case of LRD or combined together in LRP (2,3).

$$r(v, \mathbf{v}) = \sum\_{i=1}^{9} \begin{cases} 1, \text{when } v > v\_i \\ 0, \text{otherwise} \end{cases} \tag{4}$$

The response range of the features is �0, 255� for LBP, �−8, 8� for LRD and �0, 99� for LRP. The response is used as an input to a weak classifier which is essentially a look-up table assigning a weak classifier response to a feature response.

#### **4. AdaBoost and WaldBoost**

AdaBoost [Freund (1995)] and other boosting algorithms [Friedman et al. (2000); Grove & Schuurmans (1998); Ratsch (2001); Rudin et al. (2004); Schapire et al. (1998)] all combine *weak hypotheses ht* : *χ* → **R** into a *strong classifier Ht*. The combination is a weighted average where responses of the weak hypotheses are multiplied by weights *α* determining their importance:

$$H\_T(\mathbf{x}) = \sum\_{t=1}^{T} \left( h\_t(\mathbf{x}) \right) \tag{5}$$

The weak hypotheses often internally partition the object space **Ø** into a set of disjoint areas based on a single feature response. Such weak hypotheses are called space partitioning weak hypotheses [Schapire & Singer (1999)] and the partition functions *f* : *χ* → **N** are referred to in the following text simply as *features*. The weak space partitioning hypotheses are combinations of such features and a *look-up table function l* : **N** → **R**

$$h\_l(\mathbf{x}) = l\_l(f\_l(\mathbf{x})).\tag{6}$$

The real value assigned by *lt* to output *j* of *ft* is denoted as *c* (*j*) *<sup>t</sup>* in the text.

Most of the boosting algorithms order the weak classifiers starting with the most informative one and thus it is reasonable to evaluate them in this order and stop when the classification decision is certain enough. Such classifiers are called *soft cascades* [Bourdev & Brandt (2005)] and can be formalized as a *sequential decision strategy* [Sochman & Matas (2005)] *S* which is a sequence of decision functions *S* = *S*1, *S*2,..., *ST*, where *St* : **R** → �, −1. The evaluation of the strategy is terminated with a negative result when a decision function outputs −1. The decision functions *St* decide based on a tentative sum of the weak hypotheses *Ht*, *t* < *T* which is compared to a threshold *θt*:

4 Will-be-set-by-IN-TECH

The evaluation of LBP works such that all samples are compared to the central one. The result of each comparison is treated as a single bit in the 8 bit code (1). The LRD and LRP features are parametrized by indices of two samples whose ranks are calculated (4). The ranks are

The response range of the features is �0, 255� for LBP, �−8, 8� for LRD and �0, 99� for LRP. The response is used as an input to a weak classifier which is essentially a look-up table assigning

AdaBoost [Freund (1995)] and other boosting algorithms [Friedman et al. (2000); Grove & Schuurmans (1998); Ratsch (2001); Rudin et al. (2004); Schapire et al. (1998)] all combine *weak hypotheses ht* : *χ* → **R** into a *strong classifier Ht*. The combination is a weighted average where responses of the weak hypotheses are multiplied by weights *α* determining their importance:

> *T* ∑ *t*=1

The weak hypotheses often internally partition the object space **Ø** into a set of disjoint areas based on a single feature response. Such weak hypotheses are called space partitioning weak hypotheses [Schapire & Singer (1999)] and the partition functions *f* : *χ* → **N** are referred to in the following text simply as *features*. The weak space partitioning hypotheses

Most of the boosting algorithms order the weak classifiers starting with the most informative one and thus it is reasonable to evaluate them in this order and stop when the classification decision is certain enough. Such classifiers are called *soft cascades* [Bourdev & Brandt (2005)] and can be formalized as a *sequential decision strategy* [Sochman & Matas (2005)] *S* which is a sequence of decision functions *S* = *S*1, *S*2,..., *ST*, where *St* : **R** → �, −1. The evaluation of the strategy is terminated with a negative result when a decision function outputs −1. The decision functions *St* decide based on a tentative sum of the weak hypotheses *Ht*, *t* < *T* which

*HT*(**x**) =

are combinations of such features and a *look-up table function l* : **N** → **R**

The real value assigned by *lt* to output *j* of *ft* is denoted as *c*

 1, when *v* > *vi* 0, otherwise

9 ∑ *i*=1 *<sup>i</sup>*=0(*vi* <sup>&</sup>gt; *<sup>c</sup>*)2*<sup>i</sup>* (1)

(*ht*(**x**)) (5)

*ht*(**x**) = *lt*(*ft*(**x**)). (6)

*<sup>t</sup>* in the text.

(*j*)

(4)

*LRD*(**v**, *a*, *b*) = *r*(*va*, **v**) − *r*(*vb*, **v**) (2)

*LRP*(**v**, *a*, *b*) = 10*r*(*va*, **v**) + *r*(*vb*, **v**) (3)

*LBP*(**v**, *c*) = ∑*<sup>N</sup>*

subtracted in the case of LRD or combined together in LRP (2,3).

a weak classifier response to a feature response.

**4. AdaBoost and WaldBoost**

*r*(*v*, **v**) =

$$S\_{l}(\mathbf{x}) = \begin{cases} \sharp, & \text{if } H\_{l}(\mathbf{x}) > \theta\_{l} \\ -1, & \text{if } H\_{l}(\mathbf{x}) \le \theta\_{l} \end{cases}. \tag{7}$$

WaldBoost [Sochman & Matas (2005)] is a method which produces an optimal decision strategy for a target false negative rate. The algorithm combines real AdaBoost Schapire & Singer (1999) and Wald's *sequential probability ratio test* Wald (1947).

Given a weak learner algorithm, training data {(*x*1, *y*1)...,(*xm*, *ym*)}, *x* ∈ *χ*, *y* ∈ {−1, +1} and a target false negative rate *α*, the WaldBoost algorithm finds a decision strategy *S*∗ with a miss rate *α<sup>S</sup>* which is lower than *α* and the average evaluation time *T*¯ *<sup>S</sup>* = *E*(arg min*<sup>i</sup>* (*Si* �= �)) is minimal:

$$\mathcal{S}^\* = \underset{\mathcal{S}}{\text{arg min }} \bar{T}\_{\mathcal{S}'} \text{ s.t. } a\_{\mathcal{S}} < \alpha.$$

WaldBoost uses real AdaBoost to iteratively select the most informative weak hypotheses *ht*. The threshold *θ<sup>t</sup>* is then selected in each iteration so that as many negative training samples are rejected as possible while asserting that the likelihood ratio that is estimated on training data

$$\hat{R}\_{t} = \frac{p(H\_{t}(\mathbf{x}) < \theta\_{t} | y = -1)}{p(H\_{t}(\mathbf{x}) < \theta\_{t} | y = +1)}$$

satisfies *<sup>R</sup>*ˆ*<sup>t</sup>* <sup>≥</sup> <sup>1</sup> *α* .

#### **5. Acceleration of WaldBoost based object detection**

Acceleration of object detection can be in general based on several principles, the key ones being:


The case of general improvement of computational platforms is not of interest here in this publication. On the other hand, structurally novel computational platforms are interesting in general due to their rapid growth in computer technology and specifically in the object detection, where the structure of exploitation of the computational resources suggests that the traditional platforms are not ideal and that the massive parallel platforms are also not completely suitable.

The general improvements of the AdaBoost/WaldBoost machine learning methods are outside the scope of this publication. However, the algorithmic improvements not connected with the classification itself, but rather with the redundancy due to correlation of the classification results in different sub-images of the same image, are quite important to

When analyzing real classifiers, *p* can be obtained from the statistics on input images and *c* by time measurement or other cost estimation and *k* can be set to a constant value. In Fig 3, the left plot shows the value of *pi* and the right plot the area under the *pi* curve which is proportional

Real-Time Algorithms of Object Detection Using Classifiers 233

In object detection, the most common are homogeneous classifiers (i.e. those with all weak classifiers and features of the same type). In such cases, the cost of hypothesis evaluation is constant *ci* = *c*. Additionally in AdaBoost, all weak hypotheses are executed every time and the probability of executing all hypotheses is equal to *pi* = 1. The *C* from (8) can thus be

> *n* ∑ *i*=1

*pi* (9)

(10)

to the amount of computational resources needed for the evaluation of the classifier.

*C*(*AB*) = *knc C*(*WB*) = *kc*

The cost of classification is not the only property of the classifier, but it is also the property of implementation of the run-time in which the classifier is executed – feature extraction and classifier evaluation. Different implementations with different properties exist. Imagine, for example, an implementation *A* which can evaluate very efficiently *K* > 1 weak hypotheses in a row, but it always evaluates *all* of them no matter how many weak hypotheses is actually needed for the evaluation. It could be a pre-processing unit implemented in a hardware which rejects areas without an occurrence of the target object. Then, there is implementation *B* in software which can evaluate the classifier in standard way. The computational cost for one feature in *A* is much less than in *B* but implementing the whole classifier in the hardware is

*p*1,*ic*1,*<sup>i</sup>* + *k*<sup>2</sup>

Both implementations can be put together, but the problem is how many weak hypotheses have to be put in a hardware unit and how many are left in the software. The precise position of division of the evaluation is subject to minimization of classification cost (10) in order to

The two-phase classifier can be fine tuned by one parameter. Equation 10 shows the minimization problem and Fig. 4 shows values of *C* for different settings of *u*. The *C* is the total minimal cost of the evaluation; *u* is the point of classifier division; and *k*, *c* and *p* correspond to the parameters of the cost computation from Equation 8. It should be noted, that although the properties *p* of the classifier are the same for both parts, the *p* can be in general different for each part. This is due to the structure of the evaluation in particular implementation which can *force* different probabilities of feature evaluation (e.g. by evaluating more features in one

When going beyond the example given above, more than two phases of evaluation can be used. And minimization problem is thus multi-dimensional. In the general case, described by (11), the classifier division is vector **u** whose values are searched for in order to find the best composition of parts with different properties. Note that *ui* can be equal to *ui*<sup>+</sup><sup>1</sup> and some

*T*−1 ∑ *i*=*u*

*p*2,*ic*2,*<sup>i</sup>*

simplified to *C*(*AB*) (for AdaBoost) and *C*(*WB*) (for WaldBoost) in (9).

**5.1.3 Cost minimization**

hard to achieve due to limited resources.

find a composition with minimal cost.

step; see Section 5.1.1).

*C* = arg min 0≤*u*≤*T*

 *k*1 *u*−1 ∑ *i*=0

investigate. Their exploitation can significantly reduce the computational effort needed for object detection.

#### **5.1 Classification cost and its minimization**

The relative cost of classifier evaluation can be measured and used for the reduction of the computational effort by combining two or more different approaches of classifier implementation; for example, a hardware pre-processing unit connected to post-processing unit on traditional CPU. The minimization method can be applied to various types of relative cost (computations, memory, hardware price, etc.) as its formulation is general. In this chapter, the interest is in the minimization of the use of computational resources and the relative cost thus roughly corresponds to computational time (except when otherwise noted).

#### **5.1.1 Classifier statistics**

The main property of a classifier is the probability of the evaluation of a weak hypothesis, reflecting on how often a weak hypothesis is executed during the detection. This value *p* can be calculated for every stage *i* from statistics obtained on a dataset of images. Due to the rejection nature of WaldBoost classifiers, the sequence of *pi* decreases and the first stage is always evaluated (i.e. the *p*<sup>1</sup> = 1). Example of such statistics is shown on the left in Fig. 3. The *pi* captures computational the complexity of the classifier.

Fig. 3. Example of classifier statistics. Left, stage execution probability. Right, number of evaluated weak hypotheses on average for particular length of the classifier.

#### **5.1.2 Cost evaluation**

In the case of AdaBoost/WaldBoost classifiers the total cost *C* is proportional to the number of evaluated weak hypotheses which can be calculated by (8). The *T* is the length of classifier. The *k* is the overall classifier cost which symbolizes evaluation cost on a particular platform on which the classification is implemented. The *p* is the probability of the execution of a particular weak hypothesis (see Section 5.1.1). The *c* is the relative cost of the weak hypothesis evaluation which addresses the possibility that the hypotheses have a different cost (due to the use of different features, for example).

$$\mathbf{C} = k \sum\_{i=1}^{T} p\_i c\_i \tag{8}$$

When analyzing real classifiers, *p* can be obtained from the statistics on input images and *c* by time measurement or other cost estimation and *k* can be set to a constant value. In Fig 3, the left plot shows the value of *pi* and the right plot the area under the *pi* curve which is proportional to the amount of computational resources needed for the evaluation of the classifier.

In object detection, the most common are homogeneous classifiers (i.e. those with all weak classifiers and features of the same type). In such cases, the cost of hypothesis evaluation is constant *ci* = *c*. Additionally in AdaBoost, all weak hypotheses are executed every time and the probability of executing all hypotheses is equal to *pi* = 1. The *C* from (8) can thus be simplified to *C*(*AB*) (for AdaBoost) and *C*(*WB*) (for WaldBoost) in (9).

$$\mathbb{C}^{(AB)} = knc \qquad \mathbb{C}^{(WB)} = kc \sum\_{i=1}^{n} p\_i \tag{9}$$

#### **5.1.3 Cost minimization**

6 Will-be-set-by-IN-TECH

investigate. Their exploitation can significantly reduce the computational effort needed for

The relative cost of classifier evaluation can be measured and used for the reduction of the computational effort by combining two or more different approaches of classifier implementation; for example, a hardware pre-processing unit connected to post-processing unit on traditional CPU. The minimization method can be applied to various types of relative cost (computations, memory, hardware price, etc.) as its formulation is general. In this chapter, the interest is in the minimization of the use of computational resources and the relative cost

The main property of a classifier is the probability of the evaluation of a weak hypothesis, reflecting on how often a weak hypothesis is executed during the detection. This value *p* can be calculated for every stage *i* from statistics obtained on a dataset of images. Due to the rejection nature of WaldBoost classifiers, the sequence of *pi* decreases and the first stage is always evaluated (i.e. the *p*<sup>1</sup> = 1). Example of such statistics is shown on the left in Fig. 3.

> 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1

Classification cost [weak hypotheses]

Fig. 3. Example of classifier statistics. Left, stage execution probability. Right, number of

In the case of AdaBoost/WaldBoost classifiers the total cost *C* is proportional to the number of evaluated weak hypotheses which can be calculated by (8). The *T* is the length of classifier. The *k* is the overall classifier cost which symbolizes evaluation cost on a particular platform on which the classification is implemented. The *p* is the probability of the execution of a particular weak hypothesis (see Section 5.1.1). The *c* is the relative cost of the weak hypothesis evaluation which addresses the possibility that the hypotheses have a different cost (due to

*C* = *k*

*T* ∑ *i*=1

evaluated weak hypotheses on average for particular length of the classifier.

1 10 100 1000

*pici* (8)

Stage

thus roughly corresponds to computational time (except when otherwise noted).

The *pi* captures computational the complexity of the classifier.

1 10 100 1000

Stage

object detection.

**5.1.1 Classifier statistics**

0.0001

**5.1.2 Cost evaluation**

the use of different features, for example).

0.001

0.01

Stage execution probability

0.1

1

**5.1 Classification cost and its minimization**

The cost of classification is not the only property of the classifier, but it is also the property of implementation of the run-time in which the classifier is executed – feature extraction and classifier evaluation. Different implementations with different properties exist. Imagine, for example, an implementation *A* which can evaluate very efficiently *K* > 1 weak hypotheses in a row, but it always evaluates *all* of them no matter how many weak hypotheses is actually needed for the evaluation. It could be a pre-processing unit implemented in a hardware which rejects areas without an occurrence of the target object. Then, there is implementation *B* in software which can evaluate the classifier in standard way. The computational cost for one feature in *A* is much less than in *B* but implementing the whole classifier in the hardware is hard to achieve due to limited resources.

$$\mathcal{C} = \operatorname\*{arg\,min}\_{0 \le u \le T} \left( k\_1 \sum\_{i=0}^{u-1} p\_{1,i} c\_{1,i} + k\_2 \sum\_{i=u}^{T-1} p\_{2,i} c\_{2,i} \right) \tag{10}$$

Both implementations can be put together, but the problem is how many weak hypotheses have to be put in a hardware unit and how many are left in the software. The precise position of division of the evaluation is subject to minimization of classification cost (10) in order to find a composition with minimal cost.

The two-phase classifier can be fine tuned by one parameter. Equation 10 shows the minimization problem and Fig. 4 shows values of *C* for different settings of *u*. The *C* is the total minimal cost of the evaluation; *u* is the point of classifier division; and *k*, *c* and *p* correspond to the parameters of the cost computation from Equation 8. It should be noted, that although the properties *p* of the classifier are the same for both parts, the *p* can be in general different for each part. This is due to the structure of the evaluation in particular implementation which can *force* different probabilities of feature evaluation (e.g. by evaluating more features in one step; see Section 5.1.1).

When going beyond the example given above, more than two phases of evaluation can be used. And minimization problem is thus multi-dimensional. In the general case, described by (11), the classifier division is vector **u** whose values are searched for in order to find the best composition of parts with different properties. Note that *ui* can be equal to *ui*<sup>+</sup><sup>1</sup> and some

Fig. 5. Neighborhood suppression - during scanning, positions surrounding the currently evaluated position can be suppressed. On such positions the classifier will not be computed.

detection classifier. Moreover, previously created detectors can be used as well.

The first change, as mentioned earlier, is that the weak hypotheses *h*

an optimal weak hypothesis is generally the most time consuming step.

responses in the neighborhood of the currently evaluated position.

only use these computed features to make their own decision.

**5.3 Early non-maxima suppression**

learning suppression classifier is summarized in Algorithm 1.

*l* 

unlabeled data for training and does not require any modifications in learning the original

Real-Time Algorithms of Object Detection Using Classifiers 235

In the classifier emulation [Šochman & Matas (2007); Sochman & Matas (2009)] approach, an existing detector is considered a black box and its decisions are used as labels for a new WaldBoost learning problem. The algorithm for learning the suppression classifiers differs from this basic scenario in three distinct aspects discussed below. The whole algorithm for

classifier, reused features *ft* of the original detector and only new look-up table functions

*<sup>t</sup>* are learned. By restricting the features, the learning process is very fast as the selection of

The second difference is that the labels for training the suppression classifier are obtained from a different image position than where the classifier gets information from (the position containing the original features *lt*). This is consistent with the fact that we want to predict

Finally, the set of training samples is pruned twice in each iteration of the learning algorithm instead of only once as in WaldBoost. The samples rejected by the new suppression classifier are removed from the training set, as well as, the samples rejected by the original classifier. This reflects the behavior during scanning when only those features which are needed by the detector to make a decision are computed and, consequently, the suppression classifiers can

Detection of objects by a scanning window technique usually employs some kind of *non-maxima suppression* to select a position with the highest classifier response from a small neighborhood in position, scale and other possible degrees of freedom. The suppressed detections have no influence on the resulting detection and it may not be necessary to compute the detectors completely in these positions. In other applications only the highest response on a number of samples is of interest as well. Examples of such applications are speaker

*<sup>t</sup>* of a suppression

Fig. 4. Example of minimization of classification cost for two-phase classifier. The first phase always evaluates all weak hypotheses but the cost for a weak hypothesis is 0.1 of the second phase. The second phase evaluates weak hypotheses one by one. The black dot marks the division between the parts that lead into the minimum cost.

parts could be in fact skipped when they are evaluated as useless in the optimization.

$$\begin{aligned} \mathbf{C} &= \operatorname\*{arg\,min}\_{\mathbf{u}} \left( \sum\_{m=1}^{M} \left( k\_{m} \sum\_{i=u\_{m-1}}^{u\_{m}-1} p\_{m,i} c\_{m,i} \right) \right) \\ \text{s.t.} & \\ u\_{0} &= 0 \\ u\_{m} &= T \\ u\_{i-1} &\le u\_{i}, \ 0 \le i \le M \end{aligned} \tag{11}$$

In practical applications, it is easy to get classifier statistics – it reflects classifier behavior on images. On the other hand, it is tricky to identify values of *c* and *k*. It has to be done by careful examination of performance of the particular implementation of the detection (e.g. by the precise measurement of time needed for the execution of weak hypotheses).

#### **5.2 Exploiting neighbors**

In scanning window object detection using a soft cascade detector, each image position is processed independently. However, much information is shared between neighboring positions and utilizing this information has a potential for increasing the speed of detection.

One way to utilize the shared information is to learn *suppression classifiers* [Zemˇcík et al. (2010)] to predict the responses of the original detection classifier at neighboring positions. Computation of the original detector can then be suppressed at positions for which this prediction is negative and with enough confidence.

In the case of *space partitioning weak hypotheses* (see Section 4), the suppression classifiers can be made computationally very efficient by re-using the features *ht* computed by the original classifier. In that case, adding the suppression classifiers just increases the size of the look-up table *l* : **N** → **R**.

The task of learning the suppression classifiers can be formulated as detector emulation [Šochman & Matas (2007); Sochman & Matas (2009)] which allows usage of 8 Will-be-set-by-IN-TECH

1 10 100 1000

Division point u Fig. 4. Example of minimization of classification cost for two-phase classifier. The first phase always evaluates all weak hypotheses but the cost for a weak hypothesis is 0.1 of the second phase. The second phase evaluates weak hypotheses one by one. The black dot marks the

parts could be in fact skipped when they are evaluated as useless in the optimization.

 *km*

In practical applications, it is easy to get classifier statistics – it reflects classifier behavior on images. On the other hand, it is tricky to identify values of *c* and *k*. It has to be done by careful examination of performance of the particular implementation of the detection (e.g. by

In scanning window object detection using a soft cascade detector, each image position is processed independently. However, much information is shared between neighboring positions and utilizing this information has a potential for increasing the speed of detection. One way to utilize the shared information is to learn *suppression classifiers* [Zemˇcík et al. (2010)] to predict the responses of the original detection classifier at neighboring positions. Computation of the original detector can then be suppressed at positions for which this

In the case of *space partitioning weak hypotheses* (see Section 4), the suppression classifiers can be made computationally very efficient by re-using the features *ht* computed by the original classifier. In that case, adding the suppression classifiers just increases the size of the look-up

The task of learning the suppression classifiers can be formulated as detector emulation [Šochman & Matas (2007); Sochman & Matas (2009)] which allows usage of

*um*−1 ∑ *i*=*um*−<sup>1</sup>

*pm*,*icm*,*<sup>i</sup>*

(11)

 *M* ∑ *m*=1

0

division between the parts that lead into the minimum cost.

*C* = arg min **u**

*ui*−<sup>1</sup> ≤ *ui*, 0 ≤ *<sup>i</sup>* ≤ *<sup>M</sup>*

the precise measurement of time needed for the execution of weak hypotheses).

*s*.*t*. *u*<sup>0</sup> = 0 *um* = *T*

prediction is negative and with enough confidence.

**5.2 Exploiting neighbors**

table *l* : **N** → **R**.

5

10

15

Total cost

20

25

Fig. 5. Neighborhood suppression - during scanning, positions surrounding the currently evaluated position can be suppressed. On such positions the classifier will not be computed.

unlabeled data for training and does not require any modifications in learning the original detection classifier. Moreover, previously created detectors can be used as well.

In the classifier emulation [Šochman & Matas (2007); Sochman & Matas (2009)] approach, an existing detector is considered a black box and its decisions are used as labels for a new WaldBoost learning problem. The algorithm for learning the suppression classifiers differs from this basic scenario in three distinct aspects discussed below. The whole algorithm for learning suppression classifier is summarized in Algorithm 1.

The first change, as mentioned earlier, is that the weak hypotheses *h <sup>t</sup>* of a suppression classifier, reused features *ft* of the original detector and only new look-up table functions *l <sup>t</sup>* are learned. By restricting the features, the learning process is very fast as the selection of an optimal weak hypothesis is generally the most time consuming step.

The second difference is that the labels for training the suppression classifier are obtained from a different image position than where the classifier gets information from (the position containing the original features *lt*). This is consistent with the fact that we want to predict responses in the neighborhood of the currently evaluated position.

Finally, the set of training samples is pruned twice in each iteration of the learning algorithm instead of only once as in WaldBoost. The samples rejected by the new suppression classifier are removed from the training set, as well as, the samples rejected by the original classifier. This reflects the behavior during scanning when only those features which are needed by the detector to make a decision are computed and, consequently, the suppression classifiers can only use these computed features to make their own decision.

### **5.3 Early non-maxima suppression**

Detection of objects by a scanning window technique usually employs some kind of *non-maxima suppression* to select a position with the highest classifier response from a small neighborhood in position, scale and other possible degrees of freedom. The suppressed detections have no influence on the resulting detection and it may not be necessary to compute the detectors completely in these positions. In other applications only the highest response on a number of samples is of interest as well. Examples of such applications are speaker

possible while asserting that the likelihood ratio is estimated on the training data

satisfies *<sup>R</sup>*ˆ*<sup>t</sup>* <sup>≥</sup> <sup>1</sup>

*α* .

the competing samples

We choose *θt*(*zt*) as

drawbacks.

**6. Runtime design**

(e.g. stream processing).

**6.1 Exploiting SIMD architectures**

*<sup>R</sup>*ˆ*<sup>t</sup>* <sup>=</sup> *<sup>p</sup>*(*Ht*(**x**) <sup>&</sup>lt; *<sup>θ</sup>t*(*zt*)|*<sup>y</sup>* <sup>=</sup> <sup>−</sup>1)

Real-Time Algorithms of Object Detection Using Classifiers 237

For the EnMS approach to be effective, the conditioning information *zt* has to encode how well the other competing samples are classified and the function form of the threshold function

In our approach, the weak hypothesis *ht* is evaluated for the whole set of competing samples X at a time, and the conditioning information is the maximum tentative classifier response on

With this choice of *θt*(*zt*), the EnMS condition for rejecting samples in Equation 12 becomes

With these choices, EnMS introduces only a very small computational overhead. When computed sequentially, a weak hypothesis *ht* can be computed on all active positions; then the maximal responses can be gathered and the samples fulfilling *Ht*(*x*) < *zt* − *λ<sup>t</sup>* can be suppressed. When computing positions in parallel, the process has to be synchronized before the suppression step and gathering the maximal value may require synchronization, atomic instructions or a special value reduction method. However, even in highly parallel environments, the possible issues are not that significant as the potential serial operations are simple. Furthermore, suppression does not have to be performed after each weak hypothesis and the computation does not have to be strictly enforced without any significant performance

The SIMD (Single Instruction Multiple Data) architectures exploit data level parallelism to accelerate certain operations. Contrary to instruction parallelism, the data parallelism is works so that the CPU performs the same instruction with vectors of data. This approach is very efficient in tasks where a simple computation is performed on large amount of data

Typically, CPUs contain a standard instruction set which processes integers and floats. This set is extended with a set of vector instructions which work over vectors of data stored in the memory. Vector instructions typically include standard arithmetic and logic instructions, instructions for data access and other data manipulation instructions (packing, unpacking, etc.). This is the case of general purpose CPUs like Intel, AMD or PowerPC. Beside the general purpose CPUs, there are GPGPU (General Purpose Graphics Processing Units), successors of

*θt*(*zt*) has to be simple enough to allow reliable estimation of its optimal parameters.

*zt* = max

*<sup>p</sup>*(*Ht*(**x**) <sup>&</sup>lt; *<sup>θ</sup>t*(*zt*)|*<sup>y</sup>* = +1) (13)

*<sup>x</sup>*∈X (*Ht*(*x*)). (14)

*θt*(*zt*) = *zt* − *λt*. (15)

*Ht*(*x*) < *zt* − *λt*. (16)

#### **Algorithm 1** WaldBoost for learning suppression classifiers

**Input**: original soft cascade *HT*(*x*) = ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *ht*(*x*), its early termination thresholds *<sup>θ</sup>*�(*t*) and its features *ft*; desired miss rate *α*; training set {(*x*1, *y*1)...,(*xm*, *ym*)}, *x* ∈ *χ*, *y* ∈ {−1, +1}, where the labels *yi* are obtained by evaluating the original detector *HT* at an image position with a particular displacement with respect to the position of corresponding *xi*

**Output**: look-up table functions *l* � *<sup>t</sup>* and early termination thresholds *<sup>θ</sup>*�(*t*) of the new suppression classifier

**Initialize** sample weight distribution *D*1(*i*) = <sup>1</sup> *m* **for** *t* = 1, . . . , *T*

1. estimate new *l* � *<sup>t</sup>* such that its

$$c\_t^{(j)} = -\frac{1}{2} \ln \left( \frac{Pr\_{i \sim D}(f\_t(\mathbf{x}\_i) = j | y\_i = +1)}{Pr\_{i \sim D}(f\_t(\mathbf{x}\_i) = j | y\_i = -1)} \right)$$

2. add *l* � *<sup>t</sup>* to the suppression classifier

$$H\_t'(\mathfrak{x}) = \sum\_{r=1}^t l\_r'(f\_r(\mathfrak{x})),$$


$$D\_{t+1}(i) \propto \exp\left(-y\_i H\_t'(x\_i)\right),$$

and person recognition where a short utterance or face image is matched by a classifier to templates from a database.

The main idea of *Early non-Maxima Suppression* [Herout et al. (2011)] (EnMS) is to perform non-maxima suppression already during computation of classifiers and to stop computing classifiers for objects having very low probability to reach the best score in the set of the competing objects.

In the context of soft cascades, EnMS can be formalized as the *Conditioned Sequential Probability Ratio Test* (CSPRT) which allows the decision functions *St* (see Equation 7 for the original formulation) to be conditioned by some additional data *zt* ∈ Z:

$$S\_{\mathbf{f}}(\mathbf{x}, z\_{\mathbf{f}}) = \begin{cases} -1, \text{ if } H\_{\mathbf{f}}(\mathbf{x}) < \theta\_{\mathbf{f}}(z\_{\mathbf{f}})\\ \sharp, \quad \text{if } \theta\_{\mathbf{f}}(z\_{\mathbf{f}}) \le H\_{\mathbf{f}}(\mathbf{x}) \end{cases} \tag{12}$$

Here the threshold becomes a function of the conditioning data.

In order to create an optimal CSPRT strategy, the threshold functions *θt*(*zt*) should be optimized for the same objectives as the thresholds *θ<sup>t</sup>* in WaldBoost (see Equation 13). Parameters of *θt*(*zt*) should be set so that as many negative training samples are rejected as possible while asserting that the likelihood ratio is estimated on the training data

$$\mathcal{R}\_{l} = \frac{p(H\_{l}(\mathbf{x}) < \theta\_{l}(z\_{l}) | y = -1)}{p(H\_{l}(\mathbf{x}) < \theta\_{l}(z\_{l}) | y = +1)} \tag{13}$$

satisfies *<sup>R</sup>*ˆ*<sup>t</sup>* <sup>≥</sup> <sup>1</sup> *α* .

10 Will-be-set-by-IN-TECH

its features *ft*; desired miss rate *α*; training set {(*x*1, *y*1)...,(*xm*, *ym*)}, *x* ∈ *χ*, *y* ∈ {−1, +1}, where the labels *yi* are obtained by evaluating the original detector *HT* at an image position

*m*

*t* ∑ *r*=1 *l* � *<sup>r</sup>*(*fr*(*x*))

*Dt*+1(*i*) ∝ exp(−*yiH*�

and person recognition where a short utterance or face image is matched by a classifier to

The main idea of *Early non-Maxima Suppression* [Herout et al. (2011)] (EnMS) is to perform non-maxima suppression already during computation of classifiers and to stop computing classifiers for objects having very low probability to reach the best score in the set of the

In the context of soft cascades, EnMS can be formalized as the *Conditioned Sequential Probability Ratio Test* (CSPRT) which allows the decision functions *St* (see Equation 7 for the original

In order to create an optimal CSPRT strategy, the threshold functions *θt*(*zt*) should be optimized for the same objectives as the thresholds *θ<sup>t</sup>* in WaldBoost (see Equation 13). Parameters of *θt*(*zt*) should be set so that as many negative training samples are rejected as

<sup>−</sup>1, if *Ht*(*x*) <sup>&</sup>lt; *<sup>θ</sup>t*(*zt*)

�, if *<sup>θ</sup>t*(*zt*) <sup>≤</sup> *Ht*(*x*) (12)

 *Pri*∼*D*(*ft*(*xi*) = *<sup>j</sup>*|*yi* = +1) *Pri*∼*D*(*ft*(*xi*) = *<sup>j</sup>*|*yi* = −1)

*<sup>t</sup>*(*x*) <sup>≤</sup> *<sup>θ</sup>*�(*t*)

*<sup>t</sup>*(*xi*))

with a particular displacement with respect to the position of corresponding *xi*

�

2 ln

*H*� *<sup>t</sup>*(*x*) =

4. remove from the training set samples for which *Ht*(*x*) <sup>≤</sup> *<sup>θ</sup>*(*t*)

formulation) to be conditioned by some additional data *zt* ∈ Z:

Here the threshold becomes a function of the conditioning data.

*St*(*x*, *zt*) =

5. remove from the training set samples for which *H*�

6. update the sample weight distribution

*<sup>t</sup>*=<sup>1</sup> *ht*(*x*), its early termination thresholds *<sup>θ</sup>*�(*t*) and

*<sup>t</sup>* and early termination thresholds *<sup>θ</sup>*�(*t*) of the new

**Algorithm 1** WaldBoost for learning suppression classifiers

**Input**: original soft cascade *HT*(*x*) = ∑*<sup>T</sup>*

**Initialize** sample weight distribution *D*1(*i*) = <sup>1</sup>

*<sup>t</sup>* such that its

*<sup>t</sup>* to the suppression classifier

*c* (*j*) *<sup>t</sup>* <sup>=</sup> <sup>−</sup><sup>1</sup>

**Output**: look-up table functions *l*

�

3. find optimal threshold *θ*�(*t*)

templates from a database.

competing objects.

suppression classifier

**for** *t* = 1, . . . , *T* 1. estimate new *l*

2. add *l* � For the EnMS approach to be effective, the conditioning information *zt* has to encode how well the other competing samples are classified and the function form of the threshold function *θt*(*zt*) has to be simple enough to allow reliable estimation of its optimal parameters.

In our approach, the weak hypothesis *ht* is evaluated for the whole set of competing samples X at a time, and the conditioning information is the maximum tentative classifier response on the competing samples

$$z\_t = \max\_{\boldsymbol{\chi} \in \mathcal{X}} (H\_t(\boldsymbol{\chi})). \tag{14}$$

We choose *θt*(*zt*) as

$$
\theta\_t(z\_t) = z\_t - \lambda\_t. \tag{15}
$$

With this choice of *θt*(*zt*), the EnMS condition for rejecting samples in Equation 12 becomes

$$H\_t(\mathbf{x}) < z\_t - \lambda\_t. \tag{16}$$

With these choices, EnMS introduces only a very small computational overhead. When computed sequentially, a weak hypothesis *ht* can be computed on all active positions; then the maximal responses can be gathered and the samples fulfilling *Ht*(*x*) < *zt* − *λ<sup>t</sup>* can be suppressed. When computing positions in parallel, the process has to be synchronized before the suppression step and gathering the maximal value may require synchronization, atomic instructions or a special value reduction method. However, even in highly parallel environments, the possible issues are not that significant as the potential serial operations are simple. Furthermore, suppression does not have to be performed after each weak hypothesis and the computation does not have to be strictly enforced without any significant performance drawbacks.

#### **6. Runtime design**

#### **6.1 Exploiting SIMD architectures**

The SIMD (Single Instruction Multiple Data) architectures exploit data level parallelism to accelerate certain operations. Contrary to instruction parallelism, the data parallelism is works so that the CPU performs the same instruction with vectors of data. This approach is very efficient in tasks where a simple computation is performed on large amount of data (e.g. stream processing).

Typically, CPUs contain a standard instruction set which processes integers and floats. This set is extended with a set of vector instructions which work over vectors of data stored in the memory. Vector instructions typically include standard arithmetic and logic instructions, instructions for data access and other data manipulation instructions (packing, unpacking, etc.). This is the case of general purpose CPUs like Intel, AMD or PowerPC. Beside the general purpose CPUs, there are GPGPU (General Purpose Graphics Processing Units), successors of

Unfortunately, the high level of parallelism is difficult to employ in WaldBoost detection as the amount of computation in adjacent positions in the image is not correlated and in general is quite unpredictable, which fact heavily complicates usage of the ALUs in the SIMD device. The efficient implementation of object detection using CUDA [Herout et al. (2011)] solves the problems of two main domains: the classifier operating on one fixed-size window, and parallel execution of this classifier on different locations of the input image. The problem of

Real-Time Algorithms of Object Detection Using Classifiers 239

The constant data containing the classifier (image features' parameters, prediction values of the weak hypotheses summed by the algorithm and WaldBoost thresholds) could be accommodated in the texture memory or constant memory of the CUDA architecture. These data are accessed in the evaluation of each feature at each position, so the demands for access speed are critical. Programs that are run on the graphics hardware using CUDA are executed as *kernels*; each kernel has a number of blocks and each block is further organized into threads. The code of the threads consumes hardware resources: registers and shared memory; this limits the number of threads that can be efficiently executed in a block (both the maximal and

One thread computes one or more locations of the scanning window in the image. The image pixels (or more precisely, window locations) are therefore divided into rectangular tiles, which are solved by different thread blocks. Experiments showed that the suitable number of threads per block was around 128. Executing blocks for only 128 pixels of the image would not be efficient, so we chose that one thread calculated more than one position of the window – a whole column of pixels in a rectangular tile. A good consequence of this layout is easy control of the resources used by one block: the number of threads is determined by the width of the tile, and the height controls the whole number of processed window positions by the block. The tile can extend over the whole height of the image or just a part of it. In order to avoid collisions of concurrently running threads and blocks, atomic increment (atomicInc function) of one shared word in the global memory is used for synchronization. This operation is rather costly, but the positive detections are so rare that this means of output can be afforded. As a consequence, the results of the whole process are at the end available in one spot of the

The main property of the CUDA implementation is that the CUDA outperforms the CPU implementation mainly for high resolution videos. This can be explained by extra overhead connected with transferring the image to the GPU, starting the kernel programs, retrieving the results, etc. These overhead operations typically consume constant time independent of

The runtime for object detection does not necessarily need to be implemented only in software; programmable hardware is one of the options as well, namely field programmable gate arrays

global memory, which can be easily made available on the host computer.

the problem size, so they are better amortized in high-resolution videos.

object detection by statistical classifiers can be divided into the following steps:

• loading and representing the classifier data

• image pre-processing • classifier evaluation • retrieving results.

minimal number of threads).

**6.3 Programmable hardware**

traditional GPUs (purposed to process graphics primitives) that can execute parallel kernels over data, and that can be viewed as advanced SIMD processors.

The SIMD architecture can be used especially to accelerate the following parts of detection.


When evaluating the weak hypotheses in a one-by-one manner, the evaluation of a feature can be transformed to SIMD processing so that all feature samples are loaded to registers and the response is evaluated by using SIMD instructions instead of a typical implementation by a loop [Herout et al. (2009); Juránek et al. (2010)]. This necessarily needs a pre-processing stage that transforms an image to a SIMD-friendly form and which allows for simple access to the data belonging to a feature - convolution of image. Speed up of this method compared to a naive implementation is very high, around 3 to 5, depending on the particular architecture on which it is implemented.

When evaluating multiple hypotheses, the implementations is pretty much the same as for one weak hypothesis without SIMD instructions. The difference is that the SIMD registers can hold information for more weak hypotheses (16 in the case of Intel SSE). This leads into efficient implementation of AdaBoost classifiers. WaldBoost classifiers, on the other hand, can be inefficient using this implementation as many weak hypotheses are calculated even when they are not necessarily needed for the classifier evaluation. Pre-processing is needed again to simplify the data access and feature evaluation. Speed-up achieved by this method is very high. In fact, when implementing WaldBoost evaluation, it is comparable to the method in the previous paragraph, even though many weak hypotheses are calculated unnecessarily.

In some cases, the feature response can be pre-calculated for all positions in the image and during detection, the feature is extracted by only one access to a pre-calculated image. In this case, for each version of a feature, an image with a pre-calculated result must be created. This is only possible when a small number of feature variant exist. For example, LBP with restricted size to 2 × 2 pixels ber block has four variants. On the other hand, LRD with the same restriction has 144 variants (as it is additionally parametrized by A and B indices) and calculation of such a high amount of images would be computationally expensive.

To summarize, benefits brought up by SIMD processing are the following: SIMD allows for features to be extracted very efficiently and the performance of a classifier evaluation can evan be increased by multiple number of times. On the other hand, the SIMD comes with the need of pre-processing which, when implemented without care, can reduce performance.

## **6.2 GPU implementation of the detection**

Implementation of object detection in GPU was historically detected using programmable shaders [Polok et al. (2008)]; however, contemporary state of the art is in GP-GPU programming languages, such as CUDA or OpenCL [Herout et al. (2011)]. GP-GPUs programmed using one of these languages present one of the most powerful and efficient computational devices. When used for object detection, GP-GPUs can be seen as a SIMD device with a high level of parallelism.

Unfortunately, the high level of parallelism is difficult to employ in WaldBoost detection as the amount of computation in adjacent positions in the image is not correlated and in general is quite unpredictable, which fact heavily complicates usage of the ALUs in the SIMD device.

The efficient implementation of object detection using CUDA [Herout et al. (2011)] solves the problems of two main domains: the classifier operating on one fixed-size window, and parallel execution of this classifier on different locations of the input image. The problem of object detection by statistical classifiers can be divided into the following steps:


12 Will-be-set-by-IN-TECH

traditional GPUs (purposed to process graphics primitives) that can execute parallel kernels

The SIMD architecture can be used especially to accelerate the following parts of detection. • *Weak classifier evaluation* - the instructions can be used to evaluate multiple weak classifiers. • *Feature evaluation* - the features like LRD, LRP and LBP can be evaluated in a data-parallel

When evaluating the weak hypotheses in a one-by-one manner, the evaluation of a feature can be transformed to SIMD processing so that all feature samples are loaded to registers and the response is evaluated by using SIMD instructions instead of a typical implementation by a loop [Herout et al. (2009); Juránek et al. (2010)]. This necessarily needs a pre-processing stage that transforms an image to a SIMD-friendly form and which allows for simple access to the data belonging to a feature - convolution of image. Speed up of this method compared to a naive implementation is very high, around 3 to 5, depending on the particular architecture on

When evaluating multiple hypotheses, the implementations is pretty much the same as for one weak hypothesis without SIMD instructions. The difference is that the SIMD registers can hold information for more weak hypotheses (16 in the case of Intel SSE). This leads into efficient implementation of AdaBoost classifiers. WaldBoost classifiers, on the other hand, can be inefficient using this implementation as many weak hypotheses are calculated even when they are not necessarily needed for the classifier evaluation. Pre-processing is needed again to simplify the data access and feature evaluation. Speed-up achieved by this method is very high. In fact, when implementing WaldBoost evaluation, it is comparable to the method in the previous paragraph, even though many weak hypotheses are calculated unnecessarily.

In some cases, the feature response can be pre-calculated for all positions in the image and during detection, the feature is extracted by only one access to a pre-calculated image. In this case, for each version of a feature, an image with a pre-calculated result must be created. This is only possible when a small number of feature variant exist. For example, LBP with restricted size to 2 × 2 pixels ber block has four variants. On the other hand, LRD with the same restriction has 144 variants (as it is additionally parametrized by A and B indices) and

To summarize, benefits brought up by SIMD processing are the following: SIMD allows for features to be extracted very efficiently and the performance of a classifier evaluation can evan be increased by multiple number of times. On the other hand, the SIMD comes with the need

Implementation of object detection in GPU was historically detected using programmable shaders [Polok et al. (2008)]; however, contemporary state of the art is in GP-GPU programming languages, such as CUDA or OpenCL [Herout et al. (2011)]. GP-GPUs programmed using one of these languages present one of the most powerful and efficient computational devices. When used for object detection, GP-GPUs can be seen as a SIMD

calculation of such a high amount of images would be computationally expensive.

of pre-processing which, when implemented without care, can reduce performance.

over data, and that can be viewed as advanced SIMD processors.

fashion.

which it is implemented.

**6.2 GPU implementation of the detection**

device with a high level of parallelism.

The constant data containing the classifier (image features' parameters, prediction values of the weak hypotheses summed by the algorithm and WaldBoost thresholds) could be accommodated in the texture memory or constant memory of the CUDA architecture. These data are accessed in the evaluation of each feature at each position, so the demands for access speed are critical. Programs that are run on the graphics hardware using CUDA are executed as *kernels*; each kernel has a number of blocks and each block is further organized into threads. The code of the threads consumes hardware resources: registers and shared memory; this limits the number of threads that can be efficiently executed in a block (both the maximal and minimal number of threads).

One thread computes one or more locations of the scanning window in the image. The image pixels (or more precisely, window locations) are therefore divided into rectangular tiles, which are solved by different thread blocks. Experiments showed that the suitable number of threads per block was around 128. Executing blocks for only 128 pixels of the image would not be efficient, so we chose that one thread calculated more than one position of the window – a whole column of pixels in a rectangular tile. A good consequence of this layout is easy control of the resources used by one block: the number of threads is determined by the width of the tile, and the height controls the whole number of processed window positions by the block. The tile can extend over the whole height of the image or just a part of it. In order to avoid collisions of concurrently running threads and blocks, atomic increment (atomicInc function) of one shared word in the global memory is used for synchronization. This operation is rather costly, but the positive detections are so rare that this means of output can be afforded. As a consequence, the results of the whole process are at the end available in one spot of the global memory, which can be easily made available on the host computer.

The main property of the CUDA implementation is that the CUDA outperforms the CPU implementation mainly for high resolution videos. This can be explained by extra overhead connected with transferring the image to the GPU, starting the kernel programs, retrieving the results, etc. These overhead operations typically consume constant time independent of the problem size, so they are better amortized in high-resolution videos.

## **6.3 Programmable hardware**

The runtime for object detection does not necessarily need to be implemented only in software; programmable hardware is one of the options as well, namely field programmable gate arrays

Threshold

*<sup>m</sup>* where *m* is the maximal number of hypotheses

Threshold

Manager Alpha

Convolver Filler

good performance can be achieved [Zemˇcík & Žádník (2007)].

IMADEV structure

Input

Output

classifiers can be implemented in this way.

design. We set the cost constantly to *ci* = <sup>1</sup>

**7.1 Classifier cost minimization**

range from 0.02 to 0.2).

Žádník (2007)).

**7. Results**

Result

Fig. 6. Block structure of the object detector in programmable hardware (source Zemˇcík &

facts lead into relatively complex timing and synchronization of processing; however, very

In a situation, where a complete evaluation of the detection is not required (e.g. in cases where a powerful CPU is available) and programmable hardware can be exploited for pre-processing, the best approach is probably a synthesis of fixed-function circuits synthesized based on results of the machine learning process "on demand" for each classifier. Such a synthesized circuit is most efficient when processing a (small) fixed number of weak classifiers for every evaluated position. While some of the weak classifiers are in such cases evaluated unnecessarily (assuming WaldBoost algorithm), the average price of weak classifier implementation is still often much lower than in the sequential machine described above. The main advantage of this approach is that all weak classifiers can be evaluated in a parallel way. However, as each weak classifier consumes chip resources, only a very small number of weak

This section gives an example of optimization of classifier performance by the balancing amount of computation between a fast hardware pre-processing unit and software post-processing unit. The classifiers used in this experiment were face detectors composed from 1000 weak hypotheses with LBP features and different false negative error rates (in a

As a baseline, software implementation working on an integral image was selected, as it is the standard way of implementation of the detection. The other implementations ised in the experiments were SSE implementation that evaluate features one by one (SSE-A), and the SSE

The cost of the hardware unit was selected according to the area on the chip taken by the

that can be fit in the circuit. In this experiment, we use *m* = 50. In general, setting the cost to a low value, we simply say that the cost of the hardware unit is not of much interest to us, and conversely, setting the cost to a large value, we say that the cost of the hardware is very important. The cost of the post-processing unit was calculated from the measurement of

implementation that evaluates 16 weak hypotheses in a row (SSE-B).

Stripe

Real-Time Algorithms of Object Detection Using Classifiers 241

Manager Program

Evaluator Evaluator

Manager Program

[Jin et al. (2009); Lai et al. (2007); Theocharides et al. (2006); Wei et al. (2004); Zemˇcík & Žádník (2007)]. While the algorithms of the object detection are in principle the same for software and hardware implementation, the hardware platform offers features largely different from the software and thus the optimal methods need to implement detection in programmable hardware are often different from the ones used in software and, in many cases, the hardware implementation may be very efficient.

The key features that are important for object detection are very different in hardware and software, and which are beneficial for hardware implementation, include:


Of course, the hardware implementation also has severe limitations, the most important being:


Taking into account the above advantages and limitations of programmable hardware, it can be considered for object detection designed specifically for the following cases:


Based on the above methods of exploitation, the methods of implementation of object detection in programmable hardware can be subdivided into a complete detection and pre-processing.

The typical methods of complete object detection in programmable hardware is feasible to implement using a sequential engine, possibly microprogrammable, which performs detection location by location, weak classifier by weak classifier until a decision is reached. As the evaluation of each weak classifier is relatively complex, the operation of the sequential unit is pipelined, so that several instances can be running in parallel. At the same time, different locations, in general, require a different number of weak classifiers to be evaluated. These

Fig. 6. Block structure of the object detector in programmable hardware (source Zemˇcík & Žádník (2007)).

facts lead into relatively complex timing and synchronization of processing; however, very good performance can be achieved [Zemˇcík & Žádník (2007)].

In a situation, where a complete evaluation of the detection is not required (e.g. in cases where a powerful CPU is available) and programmable hardware can be exploited for pre-processing, the best approach is probably a synthesis of fixed-function circuits synthesized based on results of the machine learning process "on demand" for each classifier. Such a synthesized circuit is most efficient when processing a (small) fixed number of weak classifiers for every evaluated position. While some of the weak classifiers are in such cases evaluated unnecessarily (assuming WaldBoost algorithm), the average price of weak classifier implementation is still often much lower than in the sequential machine described above. The main advantage of this approach is that all weak classifiers can be evaluated in a parallel way. However, as each weak classifier consumes chip resources, only a very small number of weak classifiers can be implemented in this way.
