**3.1. Reference‐based face detection**

Reference‐based face detection consists of two phases: (1) In the training phase, a reference face is computed with avatar reference image. This face represents a well‐extracted face and quantifies the quality of detection results in the next step. (2) In testing, multiple candidate face ROIs are detected, and the candidate ROI that best matches the reference face in the least squared sense is selected for further processing. Three different methodologies for finding the face ROI are combined: a boosted cascade of Haar‐like features (Viola and Jones (VJ) [3], a discriminative parameterized appearance model (SIFT landmark points matched with iterative least squares), and a parts‐based deformable model. VJ was selected because of its ubiquitous use in the field of face analysis. Discriminative parameterized appearance models were recently deployed in commercial software [38]. Parts‐based deformable models showed promise for face ROI extraction in the wild [39]. Despite the success of currently used meth‐ ods, there is still much room for improvement. In the Motor Trend data, there are segments of video where one extractor will succeed when others fail. Therefore, better performance can be achieved by unifying these three methods to generate multiple candidate face ROIs and quantitatively determine which candidate is the best ROI. Note that Refs. [38, 39] use VJ for an initial bounding box so running more than one face detector is not excessive for state‐of‐ the‐art approaches.

#### *3.1.1. Reference‐based face detection in training*

The avatar reference image concept generates a reference image of an expressionless face. It was previously used for registration [40] and learning [41]. A proof of optimality of the avatar image concept is given in the previous work [42]. Let *I* be an image in the training data *D* . To estimate the avatar reference image *RARI*(*x*) , take the mean across all face images:

$$R\_{\rm AR}(\mathbf{x}, \mathbf{y}) = \frac{1}{N\_{\rm D}} \sum\_{\mathbf{x} \in \mathbf{D}} I\_{\rm I}(\mathbf{x}, \mathbf{y}) \tag{1}$$

where *ND* is the number of training images; (*x*, *<sup>y</sup>*) is a pixel location; and *<sup>I</sup> i* is the *i*‐th image in the dataset *D*. The process iterates by rewarping *D* to *RARI* to create a more refined estimate of the reference face. The procedure is described as follows: (1) compute reference using Eq. (1) from all training ROIs *D*, (2) warp all *D* to the reference, and (3) recompute Eq. (1) using the warped images from the previous step. Steps (2) and (3) are iterated for three times which was empirically selected in Ref. [40]. Results of the reference face at different iterations are shown in **Figure 1**. SIFT‐Flow warps the images in step (2) and the reader is referred to [43] for a full description of SIFT‐Flow. In short, a dense, per‐pixel SIFT feature warp is computed with loopy belief propagation. After this point, a *RARI* represents a well‐extracted reference face.

**Figure 1.** Iterative refinement of the avatar reference face. It represents a well‐extracted face.

#### *3.1.2. Reference‐based face detection in testing*

To robustly detect a face, three different pipelines simultaneously extract the ROI. We fuse a discriminative parameterized appearance model, a part‐based deformable model, and the Viola and Jones framework. In Viola and Jones (VJ), detection of the face is carried out with a boosted cascade of Haar‐like features. Because of the near‐standard use of VJ, we omit an in‐depth explanation of the method. The reader is referred to [3] for the details of the algorithm.
