*3.1.2.2. Parts‐based deformable models*

Parts‐based deformable models represent a face as a collection of landmark points similar to PAMs. The difference is that the most likely locations of the parts are calculated with a probabi‐ listic framework. The landmark points are represented as a mixture of trees of landmark points on the face [39]. Let Φbe the set of landmark points on the face. A facial configuration *L* is modeled as *L* = {*p<sup>i</sup>* :*i* ∈ Φ}. Alignment of the landmark points is achieved by maximizing the posterior likelihood of appearance and shape. The objective function is formulated as follows:

$$\epsilon(\mathbf{I}, \mathbf{L}, \mathbf{j}) = \sum\_{i} \mu\_{\parallel} \mathbf{s} \left( \mathbf{p}\_{i} \right) + \sum\_{(i,k)} \left( b\_{\mathbf{i}} \langle i, j, k \rangle \mathbf{x}^{2} + b\_{\mathbf{i}} \langle i, j, k \rangle \mathbf{x} + b\_{\mathbf{j}} \langle i, j, k \rangle \mathbf{g}^{2} + b\_{\mathbf{i}} \langle i, j, k \rangle \mathbf{g} \right) \tag{3}$$

where ϵ is the objective function to be minimized; *I* is the video frame; *j* is the mixture index; *<sup>k</sup>* is the landmark point indexes; *uij* is the template of mixture *j* at point *i*; *s* is an appearance feature; *b* 1 , *b* 2 , *b* 3 , and *b* 4 are the spring rest and rigidity parameters of the model's shape. *x*˜ and *y*˜ are the displacement in horizontal and vertical directions from *i* and *k*:

$$
\tilde{\mathbf{x}} = \mathbf{x}\_i - \mathbf{x}\_k \tag{4}
$$

$$
\tilde{y} = \bar{y}\_i - \bar{y}\_k \tag{5}
$$

Inference is carried out by maximizing the following:

$$\max\_{\{\text{max}\_l\{\text{e}(l, l, l\_l)\}\}} \left( \epsilon(l, l\_l, j) \right) \tag{6}$$

which enumerates over all mixtures and configurations. The maximum likelihood of the model which best fits the parameters is computed with the Chow‐Liu algorithm [45].

#### *3.1.2.3. Least square selection*

We compare the results of all three pipelines to check if a face has been properly detected. The problem is posed where we must quantify the accuracy of each extraction pipeline. We minimize the candidate face ROI *I k* to the reference of a face *RARI* in the least squared sense: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ (*x*, *<sup>y</sup>*) <sup>−</sup> *RARI*(*x*, *<sup>y</sup>*))

$$\min\_{\mathbf{x}} \sqrt{\sum p \left( I\_{\mathbf{k}} (\mathbf{x}, y) - R\_{\text{all}} (\mathbf{x}, y) \right)^2} \tag{7}$$

where *I k* is a candidate face ROI from one of the face extraction pipelines *k*. It is possible that Eq. (7) failed to generate a candidate face. There are two causes for this: (A) there are no candidate face ROIs generated, or (B) the selected face is a false alarm, e.g., it is not a face, or the bounding box is poorly centered. To prevent (B), the face selected in Eq. (7) must have a distance to the reference of no greater than parameter *T*, which is empirically selected in train‐ ing. If the detector fails because of (A) or the threshold is less than *T*, the last extracted face should be used for processing further in the recognition pipeline. Note when comparing this proposed method to other detectors in **Table 1** we count (A) and (B) as a failure of the method.


Viola and Jones is the worst performer with the highest variance. Constrained Local Models and Supervised Descent Method are acceptable but have a high variance. The proposed method is the best performer. Higher is better for both metrics. Bold: Best performer. Underline: Second best performer.

**Table 1.** Face detection rates for the Motor Trend Magazine's Best Driver Car of the Year.

#### **3.2. Local anisotropic inhibited binary patterns in three orthogonal planes**

#### *3.2.1. Gabor filter*

A Gabor filter is a bandpass filter that is used for edge detection at a specific orientation and scale. Images are typically filtered by many Gabor filters at different parameters, called a bank. It is modulated by a sine and a cosine. When it is modulated by a sine, the Gabor filter finds symmetric edges. When it is modulated by a cosine, the Gabor filter finds antisymmetric edges. According to Grigorescu et al. [36], a Gabor filter at a specific orientation and magni‐ tude is: *x*'2 + *γ*2 *y* \_

$$g(\mathbf{x}, y; \gamma, \theta, \lambda, \sigma, \phi) = \exp\left(\frac{\mathbf{x}^2 + \gamma^2 y^2}{2\sigma^2}\right) \cos\left(\frac{2\pi x^\circ}{\gamma} + \phi\right) \tag{8}$$

where *γ* is the spatial aspect ratio that effects the eccentricity of the filter; *θ* is the angle param‐ eter that tunes the orientation; and *λ* is the wavelength parameter that tunes the filter to a specific spatial frequency, or magnitude. In pattern recognition this is also referred to a scale. *σ* is the variance of the distribution. It determines the size of the filter. *φ* is the phase offset that is taken at 0 and *π*. *x*' and *y*' are defined as follows:

$$\mathbf{x}' = \mathbf{x}\cos\theta \mathbf{+} y \sin\theta \tag{9}$$

$$y' = -x\sin\theta + y\cos\theta\tag{10}$$

The Gabor filter can be used as local appearance filter by tuning the filter to a local neighbor‐ hood while still varying the orientation: *σ*/*λ* = 0.56and varying *θ*. For the rest of the chapter, *g* (*x*, *y*; *θ*, *φ*) represents *g* with *γ* = 0.5 , *λ* = 7.14,and *σ* = 3 , and with varying *θ* and *φ*. Given an image *I*, the Gabor energy filter is given by: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ ((*<sup>I</sup>* \* *<sup>g</sup>*)(*x*, *<sup>y</sup>*; *<sup>θ</sup>*, 0 )) 2

$$E\{\mathbf{x}, y; \theta\} = \sqrt{\left( (\mathbf{I}^\* \, \mathbf{g})(\mathbf{x}, y; \theta, 0) \right)^2 + \left( (\mathbf{I}^\* \, \mathbf{g})(\mathbf{x}, y; \theta, \pi) \right)^2} \tag{11}$$

which corresponds to the magnitude of filtering the image at the phase values of 0and *π*.

#### *3.2.2. Anisotropic‐inhibited Gabor filter*

which enumerates over all mixtures and configurations. The maximum likelihood of the

We compare the results of all three pipelines to check if a face has been properly detected. The problem is posed where we must quantify the accuracy of each extraction pipeline. We

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

is a candidate face ROI from one of the face extraction pipelines *k*. It is possible that

Eq. (7) failed to generate a candidate face. There are two causes for this: (A) there are no candidate face ROIs generated, or (B) the selected face is a false alarm, e.g., it is not a face, or the bounding box is poorly centered. To prevent (B), the face selected in Eq. (7) must have a distance to the reference of no greater than parameter *T*, which is empirically selected in train‐ ing. If the detector fails because of (A) or the threshold is less than *T*, the last extracted face should be used for processing further in the recognition pipeline. Note when comparing this proposed method to other detectors in **Table 1** we count (A) and (B) as a failure of the method.

**models (CLM)**

True positive rate 60.27 ± 10.53 68.36 ± 9.80 81.37 ± 17.60 **86.29 ± 8.90** F1‐score 74.52 ± 19.67 80.81 ± 7.17 89.47 ± 11.22 **92.43 ± 5.07**

(*x*, *<sup>y</sup>*) <sup>−</sup> *RARI*(*x*, *<sup>y</sup>*))

to the reference of a face *RARI* in the least squared sense:

**Supervised descent method (SDM)**

(7)

**Proposed face detector**

2

model which best fits the parameters is computed with the Chow‐Liu algorithm [45].

∑ *p* (*I k*

**3.2. Local anisotropic inhibited binary patterns in three orthogonal planes**

**Table 1.** Face detection rates for the Motor Trend Magazine's Best Driver Car of the Year.

*g*(*x*, *y*; *γ*, *θ*, *λ*, *σ*, *φ*) = exp(

A Gabor filter is a bandpass filter that is used for edge detection at a specific orientation and scale. Images are typically filtered by many Gabor filters at different parameters, called a bank. It is modulated by a sine and a cosine. When it is modulated by a sine, the Gabor filter finds symmetric edges. When it is modulated by a cosine, the Gabor filter finds antisymmetric edges. According to Grigorescu et al. [36], a Gabor filter at a specific orientation and magni‐

Viola and Jones is the worst performer with the highest variance. Constrained Local Models and Supervised Descent Method are acceptable but have a high variance. The proposed method is the best performer. Higher is better for both

> *x*'2 + *γ*2 *y* \_ '2 <sup>2</sup> *<sup>σ</sup>*2 )cos(

2*π x* \_'

*<sup>γ</sup>* + *φ* ) (8)

*k*

12 Emotion and Attention Recognition Based on Biological Signals and Images

*3.1.2.3. Least square selection*

where *I k*

*3.2.1. Gabor filter*

tude is:

minimize the candidate face ROI *I*

min *<sup>k</sup>* <sup>√</sup>

**% Viola and Jones (VJ) Constrained local** 

metrics. Bold: Best performer. Underline: Second best performer.

The original formulation of the Gabor energy filter does not generalize well. The Gabor energy filter captures all edges and magnitudes within the image, including the edges due to noisy background texture. For example, MPEG block encoding artifacts that present as a grid‐ like repeating pattern. In the field of facial expression recognition, face morphology causes creases along the face that are not a part of the background texture thus a better contour map can be extracted by removing the background texture of the face. In order to eliminate the background texture detected by the Gabor filter, we build upon the Anisotropic Gabor energy filter. To suppress the background texture, we take a weighted Gabor filter:

$$\tilde{g}(\mathbf{x}, y; \theta) = (\mathbf{E}^\* w)(\mathbf{x}, y) \tag{12}$$

where the weighted function *w* is:

$$w(\mathbf{x}, y) = \frac{1}{\|Dv\mathbf{G}(\mathbf{x}, y)\|} h(Dv\mathbf{G}(\mathbf{x}, y)) \tag{13}$$

where *h*(*x*) <sup>=</sup> *<sup>H</sup>*(*x*)\* *<sup>x</sup>*, where *H*(*x*) is the Heaviside step function; *DoG*(. ) is the difference of Gaussians:

$$\text{DoG}\{\mathbf{x}, y; \mathcal{O}\} = \frac{1}{2\pi} \frac{1}{K^2 \sigma^2} e^{\frac{\mathbf{x}^\dagger y^\dagger}{2K^2 \sigma^2}} - \frac{1}{2\pi} \frac{1}{\sigma^2} e^{\frac{\mathbf{x}^\dagger y^\dagger}{2\sigma^2}} \tag{14}$$

*w* resembles a ring. Eq. (12) retrieves the background texture of (*x*, *y*) without the texture of (*x*, *y*) itself by weighting *E* by the ring‐like filter *w*. The resulting anisotropic‐inhibited Gabor filter is described as follows:

$$
\hat{\mathbf{g}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\Theta}) = h \{ \mathbb{E}(\mathbf{x}, \mathbf{y}; \boldsymbol{\Theta}) - \boldsymbol{\alpha} \times \tilde{\mathbf{g}}(\mathbf{x}, \mathbf{y}; \boldsymbol{\Theta}) \}\tag{15}
$$

where *α* is a parameter that affects how much of the background texture is removed. *α* ranges from 0 to 1, where 0 indicates no background texture removal and 1 indicates complete back‐ ground texture removal. The first term of Eq. (15) defines the original Gabor energy filter that captures all edges including background edges. The second term subtracts the weighted Gabor filter with a specified alpha, depending on how much background suppression is needed. We follow [46] where a value of *α* = 1 was empirically selected.

To obtain an image that contains only the strongest edges and corresponding orientations, we take the edges with the strongest magnitude across *N* different orientations:

$$\text{AICF}\{\mathbf{x}, y\} = \max\_{\theta} \hat{\mathbf{g}}\{\mathbf{x}, y; \theta\} \tag{16}$$

The resulting output of Anisotropic Inhibited Gabor Filter is an image that is *M* × *N*. Results are given in **Figure 2**.

**Figure 2.** (a) Original frame, (b) result of Gabor energy filter (Eq. (15) with *α* = 0), and (c) result of Anisotropic Gabor Energy Filtering.

We build upon the work in Ref. [46], but the proposed approach is significantly different. The anisotropic Gabor energy filter (AIGF) further computes the orientations corresponding to the maximum edges as follows:

$$\Theta(\mathbf{x}, y) = \operatorname\*{argmax}\_{\theta} \tilde{\mathbf{g}}(\mathbf{x}, y; \theta) \tag{17}$$

A soft histogram is computed from Θ with votes weighted by the maximal edge response *AIGF*. For the proposed approach, we use *AIGF* and do not compute a soft histogram.

## *3.2.3. Local binary patterns*

Local binary patterns (LBP) encode local appearance as a microtexture code. The code is a func‐ tion of comparison to the intensity values of neighboring pixels. Some formulations are invari‐ ant to rotation and monotonic grayscale transformations [31]. At present LBP and its many variations are one of the most widely used feature descriptors for facial expression recognition. LBP result in a texture descriptor with dimensionality of 2*<sup>n</sup>* where *n* is a parameter that controls the number of pixel neighbours. The LBP code of a pixel at (*x*, *y*) is given as follows:

$$LBP(\mathbf{x}, y) = \sum\_{(\mathbf{u}, \mathbf{v}) \in \mathcal{N}\_{\text{eq}}^{\text{inv}}} \text{sign} \{ I(\mathbf{u}, \mathbf{v}) - I(\mathbf{x}, y) \} \times \mathbf{2}^q \tag{18}$$

where (*u*, *<sup>v</sup>*) iterates over points in the neighborhood of *Nx*,*<sup>y</sup> LBP* ; sign(.) is the sign of the expression; *<sup>q</sup>* is a counter starting from 0 that increments on each iteration; and *Nx*,*<sup>y</sup> LBP* is the neighborhood of points about (*x*, *y*) (see **Figure 3A**). 2*<sup>q</sup>* encodes the result of the intensity difference in a specific bit. A histogram is taken for further compactness and tolerance of registration errors. Each pixel in *I* is encoded with an LBP code from Eq. (18) then an *n*‐level histogram is extracted from *LBP*. Typically, the image is segmented into nonoverlapping regions and a histogram is extracted from each region [47]. While powerful and effective for static images, LBP lacks the ability to capture temporal changes in continuous video data.

**Figure 3.** (A) In LBP, microtexture is encoded in the XY‐plane. (B) In VLBP, this is extended to the spatiotemporal domain by including neighbors in the three planes parallel to the current frame. (C) In LBP‐TOP, local binary patterns are separately extracted in three orthogonal planes and the resultant histograms are concatenated. This greatly reduces feature vector size over treating the volume as a 3D microtexture.
