*3.1.2.1. Discriminative parameterized appearance model*

of video where one extractor will succeed when others fail. Therefore, better performance can be achieved by unifying these three methods to generate multiple candidate face ROIs and quantitatively determine which candidate is the best ROI. Note that Refs. [38, 39] use VJ for an initial bounding box so running more than one face detector is not excessive for state‐of‐

The avatar reference image concept generates a reference image of an expressionless face. It was previously used for registration [40] and learning [41]. A proof of optimality of the avatar image concept is given in the previous work [42]. Let *I* be an image in the training data *D* . To

> *ND* ∑ *i*∈*D I i*

the dataset *D*. The process iterates by rewarping *D* to *RARI* to create a more refined estimate of the reference face. The procedure is described as follows: (1) compute reference using Eq. (1) from all training ROIs *D*, (2) warp all *D* to the reference, and (3) recompute Eq. (1) using the warped images from the previous step. Steps (2) and (3) are iterated for three times which was empirically selected in Ref. [40]. Results of the reference face at different iterations are shown in **Figure 1**. SIFT‐Flow warps the images in step (2) and the reader is referred to [43] for a full description of SIFT‐Flow. In short, a dense, per‐pixel SIFT feature warp is computed with loopy belief propagation. After this point, a *RARI* represents a well‐extracted

To robustly detect a face, three different pipelines simultaneously extract the ROI. We fuse a discriminative parameterized appearance model, a part‐based deformable model, and the

**Figure 1.** Iterative refinement of the avatar reference face. It represents a well‐extracted face.

(*x*, *y*) (1)

*i*

is the *i*‐th image in

estimate the avatar reference image *RARI*(*x*) , take the mean across all face images:

where *ND* is the number of training images; (*x*, *<sup>y</sup>*) is a pixel location; and *<sup>I</sup>*

the‐art approaches.

reference face.

*3.1.1. Reference‐based face detection in training*

10 Emotion and Attention Recognition Based on Biological Signals and Images

*RARI*(*x*, *<sup>y</sup>*) <sup>=</sup> \_\_\_1

*3.1.2. Reference‐based face detection in testing*

Consider a sparse appearance model of the face. The face detection problem can be framed as an optimization problem that fits the landmark points representing the face. A face is success‐ fully detected when the gradient descent in the fitness space of the optimization problem is complete. Traversing the fitness space can be viewed as a supervised learning problem [38], rather than carrying out a gradient descent with Gauss‐Newton algorithm [44]. In the training phase the following equation is minimized:

$$\min\_{w} \left\| s(p + w(p)) - s(p^\*) \right\| \tag{2}$$

where *s* is a function that computes SIFT features; *w* is a flow vector to be optimized; *p*\* is manually labeled landmark points; and the vector *p* has horizontal and vertical components **p** = (*x*, *y*). Computing the Hessian of the model is computationally undesirable, and super‐ vised learning of the descent from *p*\* avoids computing this directly. In testing, face alignment is carried out with linear least squares.
