**Audio‐Visual Speaker Tracking**

Volkan Kılıç and Wenwu Wang

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.68146

#### Abstract

[33] Memsic website. Available from: http://www.memsic.com/wireless-sensor-networks/TPR2420 [34] Sensing and Processing Across Networks (SPAN). Lab Spin website. Available from:

[35] Huan T, Liao Q, Zhang J. An improved algorithm of multiple robots cooperation in obstacle existing environment. In: Proceedings of 2007 IEEE International Conference on

[36] Huan T. Imitation learning and behavior generation in a robot team. In: Ani Hsieh M, Chirikjian G, editors. Distributed Autonomous Robotic Systems, the 11th International Symposium, Springer Tracts in Advanced Robotics (STAR). Vol. 104; Springer Berlin

Robotics and Biomimetics (ROBIO); Sanya; 2007. pp. 1001–1006

http://span.ece.utah.edu/spin

44 Motion Tracking and Gesture Recognition

Heidelberg; 2014. pp. 423–434

Target motion tracking found its application in interdisciplinary fields, including but not limited to surveillance and security, forensic science, intelligent transportation system, driving assistance, monitoring prohibited area, medical science, robotics, action and expression recognition, individual speaker discrimination in multi-speaker environments and video conferencing in the fields of computer vision and signal processing. Among these applications, speaker tracking in enclosed spaces has been gaining relevance due to the widespread advances of devices and technologies and the necessity for seamless solutions in real-time tracking and localization of speakers. However, speaker tracking is a challenging task in real-life scenarios as several distinctive issues influence the tracking process, such as occlusions and an unknown number of speakers. One approach to overcome these issues is to use multi-modal information, as it conveys complementary information about the state of the speakers compared to single-modal tracking. To use multi-modal information, several approaches have been proposed which can be classified into two categories, namely deterministic and stochastic. This chapter aims at providing multimedia researchers with a state-of-the-art overview of tracking methods, which are used for combining multiple modalities to accomplish various multimedia analysis tasks, classifying them into different categories and listing new and future trends in this field.

Keywords: audio-visual tracking, multi-speaker tracking, deterministic, stochastic approaches

#### 1. Introduction

Speaker tracking aims at localizing the moving speakers in a scene by analysing the data sequences captured by sensors or arrays of sensors. It gained relevance in the past decades due to its widespread applications such as automatic camera steering in video conferencing

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

[1], individual speaker discriminating in multi-speaker environments [2], acoustic beamforming [3], audio-visual speech recognition [4], video indexing and retrieval [5], humancomputer interaction [6], and surveillance and monitoring [7] in security applications. There are numerous challenges, which make speaker tracking a difficult task including, but not limited to, the estimation of the variable number of speakers and their states, and dealing with various conditions such as occlusions, limited view of cameras, illumination change and room reverberations [8–10].

Using multi-modal information is one way to address these challenges since more comprehensive observations for the state of the speakers can be collected in multi-modal tracking as compared to the single-modal case, and the collection of the multi-modal information can be achieved by sensors such as audio, video, thermal vision, laser-range finders and radio-frequency identification (RFID) [11–13]. Among these sensors, audio and video sensors are commonly used in speaker tracking compared to others, because of their easier installation, cheaper cost and more data-processing tools [9, 14, 15].

Earlier methods in speaker tracking employ either visual-only or audio-only data, and each modality offers some advantages but is also limited by some weaknesses [16, 17]. Tracking with only video [16–18] offers robust and accurate performance when the camera field of view covers the speakers. However, it degrades when the occlusion between speakers happens, when the speakers go out of the camera field of view, or any changes on illumination or target appearance have occurred. Although audio tracking [19–21] is not restricted by these limitations, it has a tendency to non-negligible-tracking errors because of intermittency of audio data. In addition, audio data may be corrupted by background noise and room reverberations. Nevertheless, the combination of audio and video data may improve the tracking performance when one of the modalities is missing or neither provides accurate measurements, as audio and visual modalities are often complementary to each other which can be exploited to further enhance their respective strengths and mitigate their weaknesses in tracking.

Previous techniques were focused on tracking a single person in a static and controlled environment. However, theoretical and algorithmic advances together with the increasing capability in computer processing have led to the emergence of more sophisticated techniques for tracking multiple speakers in dynamic and less controlled (or natural) environments [22–24]. In addition, the type of sensors used to collect the measurements is advanced from single- to multi-modal.

In the literature, there are many approaches for speaker tracking using multi-modal information, which can be categorized into two methods as one is deterministic and data-driven while the other is stochastic and model-driven [25, 26]. Deterministic approaches are considered as an optimization problem by minimizing a cost function, which needs to be defined appropriately. A representative method in this category is the mean-shift method [27, 28], which defines the cost function in terms of colour similarity measured by Bhattacharyya distance. The stochastic and model-driven approaches use a state-space approach based on the Bayesian framework as it is suitable for processing of multi-modal information [29]. Representative methods are the Kalman filter (KF) [30], extended KF (EKF) and particle filter (PF) [31]. The PF approach is more robust for non-linear and non-Gaussian models as compared with the KF and EKF approaches since it easily approaches the Bayesian optimal estimate with a sufficiently large number of particles [11].

[1], individual speaker discriminating in multi-speaker environments [2], acoustic beamforming [3], audio-visual speech recognition [4], video indexing and retrieval [5], humancomputer interaction [6], and surveillance and monitoring [7] in security applications. There are numerous challenges, which make speaker tracking a difficult task including, but not limited to, the estimation of the variable number of speakers and their states, and dealing with various conditions such as occlusions, limited view of cameras, illumination change and room

Using multi-modal information is one way to address these challenges since more comprehensive observations for the state of the speakers can be collected in multi-modal tracking as compared to the single-modal case, and the collection of the multi-modal information can be achieved by sensors such as audio, video, thermal vision, laser-range finders and radio-frequency identification (RFID) [11–13]. Among these sensors, audio and video sensors are commonly used in speaker tracking compared to others, because of their easier installation,

Earlier methods in speaker tracking employ either visual-only or audio-only data, and each modality offers some advantages but is also limited by some weaknesses [16, 17]. Tracking with only video [16–18] offers robust and accurate performance when the camera field of view covers the speakers. However, it degrades when the occlusion between speakers happens, when the speakers go out of the camera field of view, or any changes on illumination or target appearance have occurred. Although audio tracking [19–21] is not restricted by these limitations, it has a tendency to non-negligible-tracking errors because of intermittency of audio data. In addition, audio data may be corrupted by background noise and room reverberations. Nevertheless, the combination of audio and video data may improve the tracking performance when one of the modalities is missing or neither provides accurate measurements, as audio and visual modalities are often complementary to each other which can be exploited to further

Previous techniques were focused on tracking a single person in a static and controlled environment. However, theoretical and algorithmic advances together with the increasing capability in computer processing have led to the emergence of more sophisticated techniques for tracking multiple speakers in dynamic and less controlled (or natural) environments [22–24]. In addition, the type of sensors used to collect the measurements is advanced from single- to

In the literature, there are many approaches for speaker tracking using multi-modal information, which can be categorized into two methods as one is deterministic and data-driven while the other is stochastic and model-driven [25, 26]. Deterministic approaches are considered as an optimization problem by minimizing a cost function, which needs to be defined appropriately. A representative method in this category is the mean-shift method [27, 28], which defines the cost function in terms of colour similarity measured by Bhattacharyya distance. The stochastic and model-driven approaches use a state-space approach based on the Bayesian framework as it is suitable for processing of multi-modal information [29]. Representative methods are the Kalman filter (KF) [30], extended KF (EKF) and particle filter (PF) [31]. The PF approach is more robust for non-linear and non-Gaussian models as compared with the KF

enhance their respective strengths and mitigate their weaknesses in tracking.

reverberations [8–10].

46 Motion Tracking and Gesture Recognition

multi-modal.

cheaper cost and more data-processing tools [9, 14, 15].

One challenge in the implementation of the PF to tracking problem is to choose an optimal number of particles [9, 32]. An insufficient number may introduce a particle impoverishment, while a larger number (than required) will lead to extra computational cost. Therefore, choosing the optimal number of particles is one of the issues that affect the performance of the tracker. To address this issue and to find the optimal number of particles for the PF to use, adaptive particle filtering (A-PF) approaches have been proposed in Refs. [9, 32–35]. Fox [34] proposed KLD sampling, which aims to bind the error introduced by the sample-based representations of the PF using the Kullback-Leibler divergence between maximum likelihood estimates (MLEs) of the states and the underlying distribution to optimize the number of particles. The KLD-sampling criterion is improved in Ref. [35] for the estimation of the number of particles, leading to an approach for adaptive propagation of the samples. Subsequent work [33] introduces the innovation error to estimate the number of particles by employing a twofold metric. The particles are removed by the first metric in case their distance to a neighbouring particle is smaller than a predefined threshold. The second metric is used to set the threshold on the innovation error in order to control the birth of the particles. These two thresholds need to be set before the algorithm is run. A new approach is proposed in Refs. [9, 32], which estimates noise variance besides the number of particles in an adaptive manner. Different from other existing adaptive approaches, adaptive noise variance is employed in this method for the estimation of the optimal number of particles based on tracking error and the area occupied by the particles in the image.

One assumption in the traditional PF used in multi-speaker tracking is that the number of speakers is known and invariant during the tracking. In practice, the presence of the speakers may change in a random manner, resulting in time-varying number of speakers. To deal with the unknown and variable number of speakers, the theory of random finite sets (RFSs) has been introduced, which allows multi-speaker filtering by propagation of the multi-speaker posterior [36–39]. However, the computational complexity of RFS grows exponentially as the number of speakers increases since the complexity order of the RFS is <sup>O</sup>ðM<sup>Ξ</sup><sup>Þ</sup> where <sup>M</sup> is the number of measurements and Ξ is the number of speakers. The PHD filtering [40] approach is proposed to overcome this problem, as the first-order approximation of the RFS whose complexity scales linearly with the number of speakers since the complexity order of the PHD is OðMΞÞ. This framework has been found to be promising for multi-speaker tracking [36]. However, the PHD recursion involves multiple integrals that need to have closed-form solutions for implementation. So far, two analytic solutions have been proposed: Gaussian mixture PHD (GM-PHD) filter [41, 42] and sequential Monte Carlo PHD (SMC-PHD) filter [43, 44]. Applications of GM-PHD filter are limited by linear Gaussian systems, which lead us to consider SMC-PHD filter to handle non-linear/non-Gaussian problems in audio-visual tracking [15, 45].

Apart from the stochastic methodologies mentioned above, the mean-shift [28] is a deterministic and data-driven method, which focuses on target localization using representation of the target. The mean-shift easily convergences to peak of the function with a high speed and a small computational load. Moreover, as a non-parametric method, the solution of the mean shift is independent from the features used to represent the targets. On the other hand, the performance of the mean-shift is degraded by occlusion or clutter as it searches the densest (most similar) region starting from the initial position in the region of interest. In this sense, the mean-shift trackers may fail easily in tracking small- and fast-moving targets as the region of interest may not cover the targets, which results in a track being lost after a complete occlusion. Also, it is formulated for single-target tracking, so it cannot handle a variable number of targets. Therefore, several methods [14, 15, 46–49] have been proposed by integrating both deterministic and stochastic approaches to benefit their respective strengths which will be discussed in Section 4.

#### 2. Tracking modalities

#### 2.1. Visual cues

Visual tracking is a challenging task in real-life scenarios, as the performance of a tracker is affected by the illumination conditions, occlusion by background objects and fast and complicated movements of the target [50, 51]. To address these problems, several visual features, that is, colour, texture, contour and motion [52], are employed in existing tracking systems.

Using colour feature is a very intuitive approach and commonly applied in target tracking as the information provided by colour helps to distinguish between targets and other objects. Several approaches can be found in the literature which employs colour information to track the target. In Ref. [53], a colour mixture model based on a Gaussian distribution is used for tracking and segmentation, while in Ref. [58], an adaptive mixture model is developed. Target detection and tracking can be easily maintained using colour information if the colour of the target is distinct from those of the background or other objects.

Another approach for tracking is contour-based where shape matching or contour-evolution techniques [54] are used to track the target contour. Active models like snakes, geodesic-active contours, B-splines or meshes [55] can be employed to represent the contours. Occlusion of the target by other objects is the common problem in tracking. This problem can be addressed by detecting and tracking the contour of the upper body [56] rather than tracking the contour of the whole bodies, which leads to the detection of a new person as the upper bodies are often distinguishable from back and front view for different people.

Texture is another cue defined as a measure for surface intensity variation. Properties like smoothness and regularity can be quantified by the texture [57–59]. The texture feature is used with Gabor wavelet in Ref. [60]. The Gabor filters can be employed as orientation and scaletunable edge and line detectors, and the statistics of these micro-features are mostly used to characterize the underlying texture information in a given region [61]. For improved detection and recognition, local patterns of image have gained attention recently. Local patterns are used in several application areas such as image classification and face detection since they offer promising results. In Ref. [62], the local binary patterns (LBPs) method is used to create a type of texture descriptor based on a grey-scale-invariant texture measure. Such a measure is tolerant to illumination changes.

Another cue used in tracking, particularly in indoor environments, is motion which is an explicit cue of human presence. One way to extract this cue is to apply foreground detection algorithms. A simple method for foreground detection is to compute the difference of two consecutive frames which gives the moving part of the image. Although it has been used in multi-modal-tracking systems [63], it fails when the person remains stationary since the person is considered part of background after some time.

The scale-invariant feature transform (SIFT) proposed in Ref. [64] has found wide use in tracking applications. SIFT uses local features to transform the image data into scale-invariant coordinates. Distinctive invariant features are extracted from images to provide matching between several views of an object. The SIFT feature is invariant to scaling, translation, clutter, rotation, occlusion and lighting which makes it robust to changes in three-dimensional (3D) viewpoint and illumination, and the presence of noise. Even a single feature has high matching rate in a large database because the SIFT features are generally distinctive. On the other hand, non-rigid targets [65] in noisy environments degrade the SIFT matching rate and recognition performance.

So far, several visual cues were introduced, and among them colour cues have been used more commonly in tracking applications due to their easy implementation and low complexity. Colour information can be used in the calculation of the histogram of possible targets at the initialization step as reference images which can be used in detection and tracking of the target. There are two common colour histogram models, RGB or HSV [66] in the literature and HSV is more preferable since it is observed to be more robust to illumination variation [9].

#### 2.2. Audio cues

shift is independent from the features used to represent the targets. On the other hand, the performance of the mean-shift is degraded by occlusion or clutter as it searches the densest (most similar) region starting from the initial position in the region of interest. In this sense, the mean-shift trackers may fail easily in tracking small- and fast-moving targets as the region of interest may not cover the targets, which results in a track being lost after a complete occlusion. Also, it is formulated for single-target tracking, so it cannot handle a variable number of targets. Therefore, several methods [14, 15, 46–49] have been proposed by integrating both deterministic and stochastic approaches to benefit their respective strengths which will be

Visual tracking is a challenging task in real-life scenarios, as the performance of a tracker is affected by the illumination conditions, occlusion by background objects and fast and complicated movements of the target [50, 51]. To address these problems, several visual features, that

Using colour feature is a very intuitive approach and commonly applied in target tracking as the information provided by colour helps to distinguish between targets and other objects. Several approaches can be found in the literature which employs colour information to track the target. In Ref. [53], a colour mixture model based on a Gaussian distribution is used for tracking and segmentation, while in Ref. [58], an adaptive mixture model is developed. Target detection and tracking can be easily maintained using colour information if the colour of the

Another approach for tracking is contour-based where shape matching or contour-evolution techniques [54] are used to track the target contour. Active models like snakes, geodesic-active contours, B-splines or meshes [55] can be employed to represent the contours. Occlusion of the target by other objects is the common problem in tracking. This problem can be addressed by detecting and tracking the contour of the upper body [56] rather than tracking the contour of the whole bodies, which leads to the detection of a new person as the upper bodies are often

Texture is another cue defined as a measure for surface intensity variation. Properties like smoothness and regularity can be quantified by the texture [57–59]. The texture feature is used with Gabor wavelet in Ref. [60]. The Gabor filters can be employed as orientation and scaletunable edge and line detectors, and the statistics of these micro-features are mostly used to characterize the underlying texture information in a given region [61]. For improved detection and recognition, local patterns of image have gained attention recently. Local patterns are used in several application areas such as image classification and face detection since they offer promising results. In Ref. [62], the local binary patterns (LBPs) method is used to create a type of texture descriptor based on a grey-scale-invariant texture measure. Such a measure is

is, colour, texture, contour and motion [52], are employed in existing tracking systems.

target is distinct from those of the background or other objects.

distinguishable from back and front view for different people.

tolerant to illumination changes.

discussed in Section 4.

2.1. Visual cues

2. Tracking modalities

48 Motion Tracking and Gesture Recognition

There are a variety of audio information that could be used in audio tracking such as sound source localization (SSL), time-delay estimation (TDE) and the direction of arrival (DOA) angle.

The audio source localization methods can be divided into three categories [67], namely steered beamforming, super-resolution spectral estimation and time-delay estimation. Beamformer-based source localization offers comparatively low resolution and needs a search over a highly non-linear surface [20]. Also, it is computationally expensive which may be limited in real-time applications. Super-resolution spectral estimation methods are not well suited for locating a moving speaker since it is under the assumption that the speaker location is fixed for a number of frames [68]. However, the location of a moving speaker may change considerably over time. In addition, these methods are not robust to modelling errors caused by room reverberation and mostly have high computational cost [20, 69]. The time-delay of arrival (TDOA)-based location estimators use the relative time delay between the wave-front arrivals at microphone positions in order to estimate the location of the speaker. As compared with the other two methods, the TDOA-based approach has advantages in the following two aspects. The first one is its computational efficiency and the second one its direct connection to the speaker location.

The problem of DOA estimation is similar to that of the TDOA estimation. To estimate the DOA, the TDOA needs to be determined between the sensor elements of the microphone array. Estimation of source locations mainly depends on the quality of the DOA measurements. In the literature, several DOA estimation techniques such as the MUSIC algorithm [70] and the coherent signal subspace (CSS) [71] have been proposed. The main differences between them are the way of dealing with reverberation, background noise and movement of the sources [20]. The following three factors influence the quality of the DOA estimation. The spectral content of the speech segment is considered as the first one which is used for derivation of the DOAs. The reverberation level of the room is the second one which causes outlier in the measurements because of the reflections from the objects and walls. The positions of the microphone array to the speakers and the number of simultaneous sources in the field are considered the third factor.

#### 3. Audio-visual speaker tracking

Speaker tracking is a fundamental part of multimedia applications which plays a critical role to determine the speaker trajectories and analyse the behaviour of speakers. Speaker tracking can be accomplished with the use of audio-only, visual-only or audio-visual information.

Audio-only information based approaches for speaker tracking have been presented in [19, 20, 37, 72–74]. An audio-based fusion scheme was proposed in Ref. [20] to detect multiple speakers where the locations from multiple microphone arrays are estimated and fused to determine the state of the same speaker. Separate KFs are employed for all the individual microphone arrays for the location estimation. To deal with motion of the speaker and measurement uncertainty, the probabilistic data association technique is used with an interacting model.

One issue in Ref. [20] is that it cannot deal with the tracking problem for a time-varying number of speakers. Ma et al. [37, 72] proposed an approach based on random finite set to track an unknown and time-varying number of speakers. The RFS theory and SMC implementation are used to develop the Bayesian RFS filter, which tracks the time-varying number of speakers and their states. The random finite set theory can deal with a time-varying number of speakers; however, the maximum number of speakers that can be handled is limited as its computational complexity increases exponentially with the number of speakers. In that sense, a cardinalized PHD (CPHD) filter is proposed in Ref. [74], which is the first-order approximation of the RFS, to reduce the computational cost caused by the number of speakers. The positions of the speakers are estimated using TDOA measurements from microphone pairs by asynchronous sensor fusion with the CPHD filter.

A time-frequency method and the PHD filter are used in Ref. [73] to localize and track simultaneous speakers. The location of multiple speakers is estimated based on the timefrequency method, which uses an array of three microphones, then the PHD filter is employed to the localization results as post-processing to handle miss-detection and clutters.

Speaker tracking with multi-modal information has also gained attention, and many approaches have been proposed in the past decade using audio-visual information [2, 6, 23, 29, 75–81], providing the complementary characteristics of each modality. The differences among these existing works arise from the overall objective such as tracking either single or multiple speakers and the specific detection/tracking framework.

array. Estimation of source locations mainly depends on the quality of the DOA measurements. In the literature, several DOA estimation techniques such as the MUSIC algorithm [70] and the coherent signal subspace (CSS) [71] have been proposed. The main differences between them are the way of dealing with reverberation, background noise and movement of the sources [20]. The following three factors influence the quality of the DOA estimation. The spectral content of the speech segment is considered as the first one which is used for derivation of the DOAs. The reverberation level of the room is the second one which causes outlier in the measurements because of the reflections from the objects and walls. The positions of the microphone array to the speakers and the number of simultaneous sources in the field are

Speaker tracking is a fundamental part of multimedia applications which plays a critical role to determine the speaker trajectories and analyse the behaviour of speakers. Speaker tracking can

Audio-only information based approaches for speaker tracking have been presented in [19, 20, 37, 72–74]. An audio-based fusion scheme was proposed in Ref. [20] to detect multiple speakers where the locations from multiple microphone arrays are estimated and fused to determine the state of the same speaker. Separate KFs are employed for all the individual microphone arrays for the location estimation. To deal with motion of the speaker and measurement uncertainty, the probabilistic data association technique is used with an interacting

One issue in Ref. [20] is that it cannot deal with the tracking problem for a time-varying number of speakers. Ma et al. [37, 72] proposed an approach based on random finite set to track an unknown and time-varying number of speakers. The RFS theory and SMC implementation are used to develop the Bayesian RFS filter, which tracks the time-varying number of speakers and their states. The random finite set theory can deal with a time-varying number of speakers; however, the maximum number of speakers that can be handled is limited as its computational complexity increases exponentially with the number of speakers. In that sense, a cardinalized PHD (CPHD) filter is proposed in Ref. [74], which is the first-order approximation of the RFS, to reduce the computational cost caused by the number of speakers. The positions of the speakers are estimated using TDOA measurements from microphone pairs

A time-frequency method and the PHD filter are used in Ref. [73] to localize and track simultaneous speakers. The location of multiple speakers is estimated based on the timefrequency method, which uses an array of three microphones, then the PHD filter is employed

Speaker tracking with multi-modal information has also gained attention, and many approaches have been proposed in the past decade using audio-visual information [2, 6, 23, 29, 75–81],

to the localization results as post-processing to handle miss-detection and clutters.

be accomplished with the use of audio-only, visual-only or audio-visual information.

considered the third factor.

50 Motion Tracking and Gesture Recognition

model.

3. Audio-visual speaker tracking

by asynchronous sensor fusion with the CPHD filter.

Audio-visual measurements are fused by graphical models in Ref. [23] to track a moving speaker in a cluttered and noisy environment. Audio and video observations are used jointly by computing their mutual dependencies. The model parameters are learnt using the expectation-maximization algorithm from a sequence of audio-visual data.

A hierarchical Kalman filter structure was proposed in Refs. [2, 80] to track people in a threedimensional space using multiple microphones and cameras. Two independent local Kalman filters are employed for audio and video streams, and then the outputs of these two local filters are combined under one global Kalman filter.

Unlike [2, 80], particle filters are used in Ref. [81] to estimate the predictions from audio- and video-based measurements and audio-visual information fusion is performed at the feature level. In other words, the independent particle coordinates from the features of both modalities are fused for speaker tracking. These works [2, 23, 80, 81] have focused on the single-speaker case which cannot directly address the tracking problem for multiple speakers.

Two multi-modal systems are introduced in Ref. [75] for the tracking of multiple persons. A joint probabilistic data association filter is employed to detect speech and determine active speaker positions. Two systems are performed for visual features where a particle filter is applied first using foreground, colour, upper body detection and person region cues from multiple camera images and the latter is a blob tracker using only a wide-angle overhead view. Then, acoustic and visual tracks are integrated using a finite state machine. Unlike [75], a particle filtering framework is proposed in Ref. [29, 77] which incorporates the audio and visual detections into the particle filtering framework using an observation model. It has the capability to track multiple people jointly with their speaking activity based on a mixed-state dynamic graphical model defined on a multi-person state space. Another particle filter based multi-modal fusion approach is proposed in Ref. [78] where a single speaker can be identified in the presence of multiple visual observations. Gaussian mixtures model was adopted to fuse multiple observations and modalities. Compared to [29, 75, 77, 78], particle filtering framework is not used in Ref. [6]; instead, hidden Markov model based iterating decoding scheme is used to fuse audio and visual cues for localization and tracking of persons.

In Refs. [14, 76, 79], the Bayesian framework is used to handle the tracking problem for a varying number of speakers. The particle filter is used in Ref. [76], and observation likelihoods based on both audio and video measurements are formulated to use in the estimation of the weights of the particles, and then the number of people is calculated using the weights of these particles. The RFS theory based on multi-Bernoulli approximations is employed in Ref. [79] to integrate audio and visual cues with sequential Monte Carlo implementation. The nature of the random finite set formulation allows their framework to deal with the tracking problem for a varying number of targets. Sequential Monte Carlo implementation (or particle filter) of PHD filter is used in Ref. [14] where audio and visual modalities are fused in the steps of particle filter rather than using any data fusion algorithms. Their work substantially differs from existing works in AV multi-speaker tracking with respect to the capabilities for dealing with multiple speakers, simultaneous speakers, and unknown and time-varying number of speakers.

#### 4. Tracking algorithms

In this section, a brief review of tracking algorithms is presented which covers the following topics: Bayesian statistical methods, visual and audio-visual algorithms and non-linear filtering approaches.

Recall that in Section 1, tracking methods are either stochastic and model-driven or deterministic and data-driven [25].

The stochastic approaches are based on the Bayesian framework which uses a state-space approach [82]. Representative methods in this category are the Kalman filter (KF) [30], extended Kalman filter (EKF) [83, 84] and particle filter (PF) [11]. The PF approach is more robust as compared to the KF and EKF approaches as it can approach the Bayesian optimal estimate with a sufficiently large number of particles [11]. It has been widely applied to speaker tracking problems [29, 76, 81]. The PF is used to fuse object shapes and audio information in Refs. [29, 81]. In Ref. [76], independent audio and video observation models are fused for simultaneous tracking and detection of multiple speakers. However, one challenge in PF is to choose an appropriate number of particles. While an insufficient number may lead to particle impoverishment (i.e. loss of diversity among the particles), a larger number (than required) will induce additional computational cost. Therefore, the performance of the tracker depends on the number of particles that are estimated as an optimal value.

The PHD filter [85] is another stochastic method based on the finite-set statistics (FISST) theory, which propagates the first-order moment of a dynamic point process. The PHD filter is used in many application areas after its proposal and some applications with speaker tracking are reported in Refs. [37, 73]. It has an advantage over other Bayesian approaches such as Kalman and PF filters, in that the number of targets does not need to be known in advance since it is estimated in each iteration. The issue in the PHD filter is that it is prone to estimation error in the number of speakers in the case of low signal-to-noise ratio [36]. The reason is that the PHD filter restricts the propagation of multi-target posterior to the first-order distribution moment, resulting in loss of information for higher order cardinality. To address this issue, the cardinality distribution is also propagated with PHD distribution in the cardinalized PHD (CPHD) filter which improves the estimation of the target number [36, 86] and state of the speakers [74]. However, additional distribution for cardinality requires extra computational load, which makes the CPHD computationally more expensive than the PHD filter. Moreover, the spawning of new targets is not modelled explicitly in the CPHD filter.

As a deterministic and data-driven method, the mean-shift [28] uses representation of the target for localization, which is based on minimizing an appropriate cost function. In that sense, a similarity function is defined in Ref. [32] to reduce the state estimation problem to a search in the region of interest. To obtain fast localization, a gradient optimization method is performed. The mean-shift works under the assumption that the representation of the target is sufficiently distinct from the background which may not be always true. Although the meanshift is an efficient and robust approach, in occlusion and rapid motion scenarios [87, 88], it may fail when the target is out of the region of interest, in other words, the search area.

multi-speaker tracking with respect to the capabilities for dealing with multiple speakers, simul-

In this section, a brief review of tracking algorithms is presented which covers the following topics: Bayesian statistical methods, visual and audio-visual algorithms and non-linear filter-

Recall that in Section 1, tracking methods are either stochastic and model-driven or determin-

The stochastic approaches are based on the Bayesian framework which uses a state-space approach [82]. Representative methods in this category are the Kalman filter (KF) [30], extended Kalman filter (EKF) [83, 84] and particle filter (PF) [11]. The PF approach is more robust as compared to the KF and EKF approaches as it can approach the Bayesian optimal estimate with a sufficiently large number of particles [11]. It has been widely applied to speaker tracking problems [29, 76, 81]. The PF is used to fuse object shapes and audio information in Refs. [29, 81]. In Ref. [76], independent audio and video observation models are fused for simultaneous tracking and detection of multiple speakers. However, one challenge in PF is to choose an appropriate number of particles. While an insufficient number may lead to particle impoverishment (i.e. loss of diversity among the particles), a larger number (than required) will induce additional computational cost. Therefore, the performance of the tracker

The PHD filter [85] is another stochastic method based on the finite-set statistics (FISST) theory, which propagates the first-order moment of a dynamic point process. The PHD filter is used in many application areas after its proposal and some applications with speaker tracking are reported in Refs. [37, 73]. It has an advantage over other Bayesian approaches such as Kalman and PF filters, in that the number of targets does not need to be known in advance since it is estimated in each iteration. The issue in the PHD filter is that it is prone to estimation error in the number of speakers in the case of low signal-to-noise ratio [36]. The reason is that the PHD filter restricts the propagation of multi-target posterior to the first-order distribution moment, resulting in loss of information for higher order cardinality. To address this issue, the cardinality distribution is also propagated with PHD distribution in the cardinalized PHD (CPHD) filter which improves the estimation of the target number [36, 86] and state of the speakers [74]. However, additional distribution for cardinality requires extra computational load, which makes the CPHD computationally more expensive than the PHD filter. Moreover, the

As a deterministic and data-driven method, the mean-shift [28] uses representation of the target for localization, which is based on minimizing an appropriate cost function. In that sense, a similarity function is defined in Ref. [32] to reduce the state estimation problem to a search in the region of interest. To obtain fast localization, a gradient optimization method is

depends on the number of particles that are estimated as an optimal value.

spawning of new targets is not modelled explicitly in the CPHD filter.

taneous speakers, and unknown and time-varying number of speakers.

4. Tracking algorithms

52 Motion Tracking and Gesture Recognition

istic and data-driven [25].

ing approaches.

Many approaches have been proposed in the literature to address these problems in mean-shift tracking, which can be categorized into two groups. One group [87, 89–91] improves the meanshift tracking by, for example, introducing adaptive estimation of the search area, iteration number and bin number. In the other group, the mean-shift algorithm is combined with other methods such as particle filter [46–49]. The stochastic and deterministic approaches are integrated under the same framework in many studies. Particle filtering (stochastic) is integrated with a variation approach (deterministic) in Ref. [25] where the 'switching search' algorithm is run for all the particles. In this algorithm, the momentum of the particles is compared with a pre-determined threshold value, and if it is smaller than the threshold, the deterministic search is run; otherwise, the particles are propagated in terms of a stochastic motion model.

The particle filtering and mean-shift are combined in Ref. [48] under the name of mean-shift embedded particle filter (MSEPF). It is inspired by Sullivan and Rittscher [25], but the mean shift is used as a variational method. It is aimed to integrate the advantages of the particle filtering and mean-shift method. The MSEPF has a capability to track the target with a small number of particles as the mean-shift search concentrates on the particles around local modes (maxima) of the observation. To deal with the possible changes in illuminations, a skin colour model is used and updated for every frame. As an observation model, colour and motion cues are employed. To use a multi-cue observation model, the mean-shift analysis is modified and applied to all the particles. Resampling (selective resampling) is, then, applied when the effective sample size is too small. The mean-shift and particle filtering methods are used independently in Ref. [46]. The estimated positions of the target obtained by these two methods are compared using the Bhattacharyya distance at every iteration and the best value is chosen as the estimated position of the target to avoid the algorithm from being trapped to a local maximum, and thus finding the true maximum beyond the local one.

A hybrid particle with a mean-shift tracker is proposed in Ref. [92] which works in a similar manner to that in Ref. [48]. Alternatively, [92] uses the original application of the mean-shift and performs the mean-shift process on all the particles to reach the local maxima. Moreover, an adaptive motion model is used to deal with manoeuvring targets, which have a high speed of movement. The kernel particle filter is proposed in Ref. [93] where small perturbations are added to the states of the particles after the mean-shift iteration to prevent the gradient ascent from being stopped too early in the density. Kernel radius is calculated adaptively every iteration and this method is applied to multiple target tracking using multiple hypotheses which are then evaluated and assigned to possible targets. An adaptive mean-shift tracking with auxiliary particles is proposed in Ref. [49]. As long as the conditions are met, such as the target remaining in the region of interest, and there are no serious distractions, the mean-shift is used as the main tracker. When sudden motions or distractions are detected by the motion estimator, auxiliary particles are introduced to support the mean-shift tracker. As the mean shift may diverge from the target and converge on the background, background/foreground feature selection is applied to minimize the tracking error. Even though this study is inspired by Sullivan and Rittscher [25], where the main tracker is a particle filter, in Ref. [49], the main tracker is the mean-shift. In addition, the switched trackers are used to handle sudden movements, occlusion and distractions. Moreover, to maintain tracking even when the target appearance is affected by illumination or view point, the target model is updated online.

In the literature, several frameworks have been proposed to combine the mean-shift and particle filters. However, it is still required to have an explicitly designed framework for a variable number of targets. Both the mean-shift and particle filter were derived for tracking only a single target. To address this issue, the PHD filter is found as a promising solution as it is originally designed for multi-target tracking. However, the PHD filter does not have closedform solutions as the recursion of the PHD filter includes multi-dimensional integrals. To derive analytical solution of the PHD filter, the particle filter or sequential Monte Carlo (SMC) implementation [44] is introduced which leads to SMC-PHD filtering. In Ref. [14], the meanshift is integrated with standard SMC-PHD filtering, aiming at improving computational efficiency and estimation accuracy of the tracker for a variable number of targets.

Besides the tracking methods explained so far, speaker tracking with multi-modal usage introduces a problem which is known as data association. Each measurement coming from multi-modality needs to be associated with an appropriate target. Data association methods are divided into two classes [94]. Unique neighbour is the first data association, and a representative method in this class is multiple hypothesis tracking (MHT). Here, each existing track is associated with one of the measurements. All-neighbours data association belongs to the second class which uses all the measurements for updating the entire track estimate, for example, the joint probabilistic data association (JPDA). In MHT, the association between a target state and the measurements is maintained by multiple hypotheses. However, the required number of hypotheses increases exponentially over time [95]. In JPDA, separate Gaussian distributions for each target [96] are used to approximate the posterior target distribution which results in an extra computational cost. Data association algorithms in targettracking applications with Bayesian methods and the PHD filter can be found in [20, 97–100]. However, it is found that classical data association algorithms are computationally expensive which lead to the fusion of multi-modal measurements inside the proposed framework [8, 9, 29, 73, 80, 81, 83]. As in Refs. [8, 9], audio and visual modalities are fused in the steps of the visual particle filter.

Among the methods explained above, the PF, RFS, PHD filter and mean-shift are the main methods discussed throughout this chapter and the main concepts of the methods are presented below.

#### 4.1. Particle filtering

The PF became widely used tools in tracking after being proposed by Isard et al. [31] due to its ability to handle non-linear and non-Gaussian problems. The main idea of the PF is to represent a posterior density by a set of random particles with associated weights, and then compute estimates based on these samples and weights [101]. The principle of the particle filter is illustrated in Figure 1. Ten particles are initialized with equal weights in the first step.

Figure 1. Steps of the particle filter. The first step is particle initialization with equal weights. The particles are weighted in the second step. After a resampling step is performed in the third step, the particles are distributed to predict the next state in the fourth step. This figure is adapted from Ref. [102].

In the second step, the particles are weighted based on given measurements, and as a result, some particles require small weights while others require larger weights represented by the size of the particles. The state distribution is represented by these weighted particles. Then, a resampling step is performed which selects the particles with large weights to generate a set of new particles with equal weights in the third step. In step four, these new particles are distributed again to predict the next state. This loop continues from steps two through four until all the observations are exhausted.

Although there are various extensions of the PF in the literature, the basic concept is the same and based on the idea of representing the posterior distribution by a set of particles.

#### 4.2. Random finite set and PHD filtering

feature selection is applied to minimize the tracking error. Even though this study is inspired by Sullivan and Rittscher [25], where the main tracker is a particle filter, in Ref. [49], the main tracker is the mean-shift. In addition, the switched trackers are used to handle sudden movements, occlusion and distractions. Moreover, to maintain tracking even when the target appearance is affected by illumination or view point, the target model is updated online.

In the literature, several frameworks have been proposed to combine the mean-shift and particle filters. However, it is still required to have an explicitly designed framework for a variable number of targets. Both the mean-shift and particle filter were derived for tracking only a single target. To address this issue, the PHD filter is found as a promising solution as it is originally designed for multi-target tracking. However, the PHD filter does not have closedform solutions as the recursion of the PHD filter includes multi-dimensional integrals. To derive analytical solution of the PHD filter, the particle filter or sequential Monte Carlo (SMC) implementation [44] is introduced which leads to SMC-PHD filtering. In Ref. [14], the meanshift is integrated with standard SMC-PHD filtering, aiming at improving computational

Besides the tracking methods explained so far, speaker tracking with multi-modal usage introduces a problem which is known as data association. Each measurement coming from multi-modality needs to be associated with an appropriate target. Data association methods are divided into two classes [94]. Unique neighbour is the first data association, and a representative method in this class is multiple hypothesis tracking (MHT). Here, each existing track is associated with one of the measurements. All-neighbours data association belongs to the second class which uses all the measurements for updating the entire track estimate, for example, the joint probabilistic data association (JPDA). In MHT, the association between a target state and the measurements is maintained by multiple hypotheses. However, the required number of hypotheses increases exponentially over time [95]. In JPDA, separate Gaussian distributions for each target [96] are used to approximate the posterior target distribution which results in an extra computational cost. Data association algorithms in targettracking applications with Bayesian methods and the PHD filter can be found in [20, 97–100]. However, it is found that classical data association algorithms are computationally expensive which lead to the fusion of multi-modal measurements inside the proposed framework [8, 9, 29, 73, 80, 81, 83]. As in Refs. [8, 9], audio and visual modalities are fused in the steps of

Among the methods explained above, the PF, RFS, PHD filter and mean-shift are the main methods discussed throughout this chapter and the main concepts of the methods are

The PF became widely used tools in tracking after being proposed by Isard et al. [31] due to its ability to handle non-linear and non-Gaussian problems. The main idea of the PF is to represent a posterior density by a set of random particles with associated weights, and then compute estimates based on these samples and weights [101]. The principle of the particle filter is illustrated in Figure 1. Ten particles are initialized with equal weights in the first step.

efficiency and estimation accuracy of the tracker for a variable number of targets.

the visual particle filter.

54 Motion Tracking and Gesture Recognition

presented below.

4.1. Particle filtering

The generic PF is designed for single-target tracking. Multi-target tracking is more complicated than single-target tracking as it is necessary to jointly estimate the number of targets and the state of the targets. One multi-target tracking scenario is illustrated in Figure 2a, where five targets exist in state space (bottom plane) given at the previous time with eight measurements in observation space (upper plane). In this scenario, the number of measurements is larger than the number of targets due to clutter or noise. When the targets are passed to the current time, the number of targets becomes three and two targets no longer exist.

In that sense, the variable number of targets and noisy measurements need to be handled for reliable tracking in multi-target case. The RFS approach [36] is an elegant solution to address this issue. The basic idea behind the RFS approach is to treat the collection of targets as a setvalued state called the multi-target state and the collection of measurements as a set-valued observation, called multi-observation. So, the problem of estimating multiple targets in the presence of clutter and uncertainty is handled by modelling these set-valued entities as

Figure 2. An illustration of the RFS theory in a multi-target tracking application. One possible multi-target tracking scenario is given in (a), and (b) represents the RFS approach to multi-target tracking. The figures are adapted from Ref. [103].

random finite sets [41]. The point here is to generalize the tracking problem from single target to multiple targets.

Figure 2b illustrates the RFS approach where all the targets are collected in one target set and all the measurements are considered as one measurement set. The RFS propagates the full multi-target posterior for multi-target filtering. The state model of the RFS incorporates individual target dynamics which are target birth, target spawn and target death. In addition, the observation model of the RFS incorporates the measurement likelihood as target detection uncertainty (miss-detection) and clutter (false alarm). These incorporations are implemented by assigning hypotheses, and all possible associations between hypotheses and measurement/ targets need to be repeated at every time step, resulting in increased computational cost in the case of a high number of targets and measurements.

To alleviate the computational cost, the PHD filter is introduced which is a computationally cheaper alternative to the RFS. The PHD filter is the first-order approximation of the RFS and propagates only the first-order moments instead of the full multi-target posterior [44, 104]. The PHD filter function is denoted as the intensity vðxÞ whose integral on any region of the state space gives the expected number of targets. The peaks of the PHD function point the highest local concentration of the expected number of targets, which can be used to provide estimates of individual targets [36]. The PHD filter is illustrated in Figure 3 by a simple example [36] which corresponds to Eq. (1)

Figure 3. A simple example for the PHD filter. This figure is adapted from Ref. [36].

$$w(\mathbf{x}) = N\_{o^2}(\mathbf{x} - a) + N\_{o^2}(\mathbf{x} - b) = \frac{1}{\sqrt{2\pi\sigma}} \left[ \exp\left(-\frac{(\mathbf{x} - a)^2}{2\sigma^2}\right) + \exp\left(-\frac{(\mathbf{x} - b)^2}{2\sigma^2}\right) \right] \tag{1}$$

Figure 3 is plotted for Eq. (1) with σ ¼ 1, a ¼ 1 and b ¼ 4. The peaks of vðxÞ is near the target locations x ¼ 1 and x ¼ 4.

The integral of vðxÞ computes the actual number of targets Ξ:

$$\Xi = \int \mathbf{v}(\mathbf{x})d\mathbf{x} = \int \mathcal{N}\_{(\boldsymbol{\sigma})^2}(\mathbf{x} - \mathbf{a})d\mathbf{x} + \int \mathcal{N}\_{(\boldsymbol{\sigma})^2}(\mathbf{x} - \mathbf{b})d\mathbf{x} = 1 + 1 = 2\tag{2}$$

#### 4.3. Mean-shift tracking

random finite sets [41]. The point here is to generalize the tracking problem from single target

(b)

Figure 2. An illustration of the RFS theory in a multi-target tracking application. One possible multi-target tracking scenario is given in (a), and (b) represents the RFS approach to multi-target tracking. The figures are adapted from Ref.

(a)

Figure 2b illustrates the RFS approach where all the targets are collected in one target set and all the measurements are considered as one measurement set. The RFS propagates the full multi-target posterior for multi-target filtering. The state model of the RFS incorporates individual target dynamics which are target birth, target spawn and target death. In addition, the observation model of the RFS incorporates the measurement likelihood as target detection uncertainty (miss-detection) and clutter (false alarm). These incorporations are implemented by assigning hypotheses, and all possible associations between hypotheses and measurement/ targets need to be repeated at every time step, resulting in increased computational cost in the

To alleviate the computational cost, the PHD filter is introduced which is a computationally cheaper alternative to the RFS. The PHD filter is the first-order approximation of the RFS and propagates only the first-order moments instead of the full multi-target posterior [44, 104]. The PHD filter function is denoted as the intensity vðxÞ whose integral on any region of the state space gives the expected number of targets. The peaks of the PHD function point the highest local concentration of the expected number of targets, which can be used to provide estimates of individual targets [36]. The PHD filter is illustrated in Figure 3 by a simple example [36]

to multiple targets.

56 Motion Tracking and Gesture Recognition

[103].

which corresponds to Eq. (1)

case of a high number of targets and measurements.

Different from stochastic approaches such as the PF, RFS and PHD filter, the mean-shift is a deterministic method [28]. The mean-shift can be defined as a simple iterative procedure that shifts each data point to the average of data points in its neighbourhood [105].

Common application areas are clustering [106], mode seeking [107], image segmentation [108] and tracking [109]. Simple implementation of the mean-shift method is illustrated in Figure 4 where the purpose is to find the densest region of the distributed balls. The first step is to select an initial point with the region of interest as shown in Figure 4a where the circle indicates the region of interest centred on the initial point. In Figure 4b, the centre of the mass is calculated using the balls inside the region of interest. To get the distance and direction for shifting the initial point, the mean-shift vector is calculated in Figure 4c. The initial point is shifted to a new point together with the region of interest in Figure 4d. The centre of the mass is calculated again using the balls inside the region of interest which leads to new mass point in Figure 4e. The mean-shift vector is calculated to obtain the direction and distance for shifting and the region of interest is shifted to a new point as illustrated in Figures 4f and g, respectively. This iteration continues until the mean-shift method reaches the densest point in Figure 4h.

Figure 4. Simple descriptions of the mean-shift process. These figures are adapted from Ref. [110].

#### 5. Relevant datasets

(a) (b)

58 Motion Tracking and Gesture Recognition

(c) (d)

(e) (f)

(g) (h)

Figure 4. Simple descriptions of the mean-shift process. These figures are adapted from Ref. [110].

In order to perform a quantitative evaluation of the audio-visual tracker, both audio and video sequences are required. In that sense, several datasets are presented in the literature that combine multiple audio and video sources for tracking.

The augmented multi-party interaction (AMI) [111] corpus includes 100 h of meetings, which were recorded in English using three different rooms. Natural conversations are included in some of the meetings, and many others, in particular those using a scenario in which the participants play different roles in a design team, are also reasonably natural. The number of speakers in the natural conversations varies from three to five. In one artificial meeting, four speakers are involved, taking four pre-arranged roles (as industrial designer, interface designer, marketing and project manager). Other artificial meetings also appear in the AMI corpus, such as a film club scenario. Generally, the speakers are mostly static or with small movements. In addition, calibration information is not available which is required for 3D tracking as it is needed to project the coordinates from the two-dimensional (2D) image into 3D space.

CLEAR (CLassification of Events, Activities and Relationships) is the next dataset created for people identification, activities, human-human interaction and relevant scenarios [112]. Recordings are captured with multiple users in realistic meeting rooms equipped with a multitude of audio-visual sensors. The rooms have five calibrated cameras, and four of them are mounted to the corners of the room while the last panoramic camera is mounted to the ceiling of the room. All cameras are synchronized with the audio streams collected by the linear microphone array placed on the walls. In most scenarios, the speakers are generally still and seated around the table. They speak one by one.

Another dataset is SPEVI (Surveillance performance evaluation initiative) [113] created for single- and multi-modal people detection and tracking. Sequences are captured by a video camera and two linear microphone arrays. The SPEVI dataset has three sequences. The sequences motinas\_Room160 and motinas\_Room105 are captured in rooms with reverberation. The sequence motinas\_Chamber is captured in a reduced reverberation room. In this dataset, audio signals were recorded with linear microphone arrays and the calibration information is not available.

One of the most challenging datasets that can be used for the evaluation of audio-visual tracking algorithm is AV16.3 corpus which is developed by the IDIAP research institute [114]. The corpus AV16.3 involves various scenarios where subjects are moving and speaking at the same time whilst being recorded by three calibrated video cameras and two circular eightelement microphone arrays.

Recordings in the AV16.3 involve challenging scenarios such as object initialization, partial and total occlusion, overlapped speech, illumination change, close and far locations, variable number of objects, and small and large angular separations. Circular microphone arrays were used to record the audio signals at 16 kHz and video sequences were captured at 25 Hz. The recordings of audio and video were performed independently from each other. Each video frame is a colour image of 288 · 360 pixels and some sequences are annotated to get the ground truth (GT) speaker position which allows one to measure the accuracy of each tracker and to compare the performance of the algorithms. In addition, it provides calibration information of the cameras and challenging scenarios like occlusions and moving speakers.

The most recently released dataset is 'S3A speaker tracking with Kinect2' [115, 116] which uses a Kinect for Windows V2.0 for recording the visual data and dummy head for recording the audio data. It contains four sequences in a studio where people are talking and walking slowly around a dummy head which is located at the centre of the room. Different from other cameras, Kinect sensor provides in-depth information besides the colour which helps to extract the 3D position of the speaker without using additional view of the scene. In addition, annotated data are provided which can be used as ground truth data to estimate the performance of the tracker.

#### 6. Performance metrics

Several metrics have been proposed to evaluate the performance of tracking methods in the literature. In this section, four metrics are introduced.

The first one is the mean absolute error (MAE), which is computed as the Euclidean distance in pixels between the estimated and the ground truth positions, and then divided by the number of frames. This metric offers simplicity and explicit output for the performance comparison.

The multiple object tracking (MOT) metric is the next metric which was proposed in Ref. [117]. It is defined with MOT precision (MOTP) and MOT accuracy (MOTA) quantities. The precision is measured with the MOTP using a pre-defined threshold value

$$MOTP = \frac{\sum\_{i,k} d\_k^i}{\sum\_k c\_k} \tag{3}$$

where di <sup>k</sup> is the distance between the ith object and its corresponding hypothesis and ck is the number of matches between the objects and hypotheses for time frame k.

Tracking errors are measured with the MOTA which covers the false positives, false negatives and mismatches. If the error is greater than the threshold value, it is assumed that the false positive and false negative count if the speaker is not tracked with the accuracy measured by the threshold. Mismatches are the case where the speaker identity is switched [117]

$$MOTA = 1 - \frac{\sum\_{k} (m\_k + fp\_k + mm\_k)}{\sum\_{k} \mathbf{g}\_k} \tag{4}$$

where mk, f pk, mmk and gk define the number of misses (false negatives), false positives, mismatches and objects present, respectively, for the time frame k.

The next metric is the trajectory-based measures (TBMs) proposed in Refs. [118, 119], where the performance is measured based on trajectory. It categorizes the trajectories as mostly tracked (MT), mostly lost (ML) and partially tracked (PT). MT is defined as if the tracker follows at least 80% of its ground truth (GT) trajectory. If the tracker follows less than 20% of its GT, it is called ML. If the followed trajectory is between 20 and 80% of the GT trajectory, it is called PT. Also, track fragmentation (Frag) is defined as the total number of times that GT is interrupted. Identity switches (IDs) are computed by calculating change in GT identity.

OSPA-T (Optimal Subpattern Assignment for Tracks) [120] is the last performance metric designed for the evaluation of multi-speaker tracking systems. It is an improved version of the OSPA metric [121] by extending it for tracking management evaluation. To transfer the cardinality error into the state error, a penalty value is used in the OSPA. So its performance evaluation includes both source number estimation and speaker position estimation:

$$\mathfrak{e}\_{\text{OSPA}}(\hat{\mathcal{X}}\_{k}, \mathcal{X}\_{k}) = \min\_{\pi \in \Pi\_{\hat{\Xi}\_{k}, \Sigma\_{k}}} \sqrt{\frac{1}{\Xi\_{k}} \left( \sum\_{i=1}^{\hat{\Xi}\_{k}} \overline{d}^{(c)}(\hat{\mathbf{x}}\_{i,k}, \mathbf{x}\_{\pi\_{i},k})^{a} + c^{a}(\Xi\_{k} - \hat{\Xi}\_{k}) \right)} \tag{5}$$

where <sup>X</sup>^ <sup>k</sup> ¼ fx^1, k,…, <sup>x</sup>^Ξ^ <sup>k</sup> , <sup>k</sup><sup>g</sup> is an estimation of the ground-truth state set <sup>X</sup><sup>k</sup> ¼ fx1, k, …, <sup>x</sup>Ξk, <sup>k</sup><sup>g</sup> and ΠΞ^ <sup>k</sup> ,Ξ<sup>k</sup> is the set of maps <sup>π</sup> : <sup>1</sup>, …,Ξ^<sup>k</sup> ! <sup>1</sup>, …,Ξk. The state cardinality estimation <sup>Ξ</sup>^<sup>k</sup> may not be the same as the ground truth Ξk. The OSPA error defined in Eq. (5) is for Ξ^<sup>k</sup> ≤Ξk. If <sup>Ξ</sup><sup>k</sup> <sup>&</sup>lt; <sup>Ξ</sup>^k, then <sup>e</sup>OSPAðX^ k, <sup>X</sup>kÞ ¼ <sup>e</sup>OSPAðXk, <sup>X</sup>^ <sup>k</sup>Þ. The function <sup>d</sup> ðcÞ is denoted as min ðc, dð�ÞÞ. Here, c is defined as the cut-off value in order to weight the penalties for cardinality and localization errors. Additionally, the metric order is defined by a which determines the sensitivity to outliers. The OSPA-T metric differs from other metrics since it considers not only the position estimation of the speaker but also the estimation of the number of speakers in the evaluation of the tracking results. As OSPA-T measures the error based on these two terms, state (position estimation) and cardinality (number of speaker estimation), it causes ambiguities about how much error is contributed from each term to the final error. In addition to the x<sup>1</sup> and x<sup>2</sup> variables of the state vector, the scale variable, s, may be considered in the evaluation. However, this will cause more ambiguities in the contributions of the terms to the final error and deteriorate the reliability of the metric.

As a summary, four metrics are introduced which evaluate the methods from their own perspectives. To see how well the tracker follows its trajectory, the TBM can be used to measure its performance. If the tracking error needs to be estimated, the MAE or the more advanced option MOT can be used to see how accurately the tracker follows the target. If an unknown and variable number of targets need to be tracked, then the OSPA-T metric is more suitable than the others as it considers both position estimation and the estimated number of targets in the performance evaluation.

#### 7. Experimental results and analysis

frame is a colour image of 288 · 360 pixels and some sequences are annotated to get the ground truth (GT) speaker position which allows one to measure the accuracy of each tracker and to compare the performance of the algorithms. In addition, it provides calibration infor-

The most recently released dataset is 'S3A speaker tracking with Kinect2' [115, 116] which uses a Kinect for Windows V2.0 for recording the visual data and dummy head for recording the audio data. It contains four sequences in a studio where people are talking and walking slowly around a dummy head which is located at the centre of the room. Different from other cameras, Kinect sensor provides in-depth information besides the colour which helps to extract the 3D position of the speaker without using additional view of the scene. In addition, annotated data are provided

Several metrics have been proposed to evaluate the performance of tracking methods in the

The first one is the mean absolute error (MAE), which is computed as the Euclidean distance in pixels between the estimated and the ground truth positions, and then divided by the number of frames. This metric offers simplicity and explicit output for the performance comparison.

The multiple object tracking (MOT) metric is the next metric which was proposed in Ref. [117]. It is defined with MOT precision (MOTP) and MOT accuracy (MOTA) quantities. The precision

> X i, k di

X

<sup>k</sup> is the distance between the ith object and its corresponding hypothesis and ck is the

Tracking errors are measured with the MOTA which covers the false positives, false negatives and mismatches. If the error is greater than the threshold value, it is assumed that the false positive and false negative count if the speaker is not tracked with the accuracy measured by

k ck

k

ðmk þ f pk þ mmkÞ X k gk

ð3Þ

ð4Þ

MOTP ¼

the threshold. Mismatches are the case where the speaker identity is switched [117]

X k

where mk, f pk, mmk and gk define the number of misses (false negatives), false positives,

The next metric is the trajectory-based measures (TBMs) proposed in Refs. [118, 119], where the performance is measured based on trajectory. It categorizes the trajectories as mostly

mation of the cameras and challenging scenarios like occlusions and moving speakers.

which can be used as ground truth data to estimate the performance of the tracker.

6. Performance metrics

60 Motion Tracking and Gesture Recognition

where di

literature. In this section, four metrics are introduced.

is measured with the MOTP using a pre-defined threshold value

number of matches between the objects and hypotheses for time frame k.

MOTA ¼ 1 �

mismatches and objects present, respectively, for the time frame k.

In this chapter, six trackers are included to cover the recent paradigms. The trackers are restricted to the ones either for which access to the source code has been permitted or tracker performance has been reported on commonly used datasets.

To deal with the tracking problem for unknown and time-varying number of speakers, Kılıç et al. [14] propose to use particle PHD (SMC-PHD) filter. DOA information is employed as an audio cue and it is integrated with video data under SMC-PHD filter framework. Audio data are used to determine when to propagate and re-allocate surviving, spawned and born particles based on their types. The particles are concentrated around the DOA line, which is drawn from the centre of the microphone array to the estimated speaker position by audio information.

As a baseline algorithm, the visual SMC-PHD (V-SMC-PHD) filter, which uses colour information as a visual cue, is compared with the audio-visual SMC-PHD (AV-SMC-PHD) to see the advantage of using multi-modal information in challenging tracking scenarios like occlusion. Sequence 24 from AV16.3 dataset is run for V-SMC-PHD and AV-SMC-PHD, and tracking results are given in Figure 5.

The first row shows the results of V-SMC-PHD filter which fails to track after occlusion. Also, it shows poor performance before the occlusion in terms of the detection of the speakers. It is reported in Ref. [14] that the AV-SMC-PHD filter tracks the speakers more accurately and shows better performance than the V-SMC-PHD filter in terms of accuracy and ability for redetection of the speakers after lost.

The same experiments are repeated for three-speaker case using Sequence 45 camera #3 from the AV16.3 dataset and the results are given in Figure 6. It is reported in Ref. [14] that AV-SMC-

Figure 5. AV16.3, sequence 24 camera #1: occlusions with two speakers [14]. Performance of the V-SMC-PHD filter is shown in the first row. The second row is given for the AV-SMC-PHD filter.

Figure 6. AV16.3, sequence 45 camera #3: occlusions with three speakers [14]. The tracking results of the V-SMC-PHD and the AV-SMC-PHD filters are shown in the first and second rows, respectively.

PHD filter has better capability in detecting and following all the speakers even after the occlusions.

To deal with the tracking problem for unknown and time-varying number of speakers, Kılıç et al. [14] propose to use particle PHD (SMC-PHD) filter. DOA information is employed as an audio cue and it is integrated with video data under SMC-PHD filter framework. Audio data are used to determine when to propagate and re-allocate surviving, spawned and born particles based on their types. The particles are concentrated around the DOA line, which is drawn from the centre

As a baseline algorithm, the visual SMC-PHD (V-SMC-PHD) filter, which uses colour information as a visual cue, is compared with the audio-visual SMC-PHD (AV-SMC-PHD) to see the advantage of using multi-modal information in challenging tracking scenarios like occlusion. Sequence 24 from AV16.3 dataset is run for V-SMC-PHD and AV-SMC-PHD, and tracking

The first row shows the results of V-SMC-PHD filter which fails to track after occlusion. Also, it shows poor performance before the occlusion in terms of the detection of the speakers. It is reported in Ref. [14] that the AV-SMC-PHD filter tracks the speakers more accurately and shows better performance than the V-SMC-PHD filter in terms of accuracy and ability for re-

The same experiments are repeated for three-speaker case using Sequence 45 camera #3 from the AV16.3 dataset and the results are given in Figure 6. It is reported in Ref. [14] that AV-SMC-

Figure 5. AV16.3, sequence 24 camera #1: occlusions with two speakers [14]. Performance of the V-SMC-PHD filter is

Figure 6. AV16.3, sequence 45 camera #3: occlusions with three speakers [14]. The tracking results of the V-SMC-PHD and

shown in the first row. The second row is given for the AV-SMC-PHD filter.

the AV-SMC-PHD filters are shown in the first and second rows, respectively.

of the microphone array to the estimated speaker position by audio information.

results are given in Figure 5.

62 Motion Tracking and Gesture Recognition

detection of the speakers after lost.

To improve the estimation accuracy of the AV-SMC-PHD filter, [12] integrates the mean-shift method in order to shift the particles to a local maximum of the distribution function which drives particles closer the speaker position. The generic mean-shift algorithm is modified for multiple-speaker case and applied after the audio contribution to the particles, and this algorithm is named as AVMS-SMC-PHD filter.

Even though the integration of the mean-shift improves the estimation accuracy, applying the mean-shift process to all the particles introduces extra computational cost [12]. To address this problem, [12] proposes a sparse sampling scheme which chooses sparse particles and runs the mean-shift method only on those particles rather than all the particles which results in a significant reduction in computational cost. This method is named as sparse-AVMS-SMC-PHD filter. Another tracking algorithm is given in Ref. [122], which uses the merits of dictionary learning for multi-speaker tracking. It is tested using some sequences (seq24, seq25 and seq30) of the AV16.3 dataset.

The results of these five trackers on sequences of AV16.3 are given in Table 1 and the OSPA-T metric is used for comparison. The tracker in Ref. [122] outperforms the V-SMC-PHD; however, the AVMS-SMC-PHD shows better performance than the others.

These tracking results are compared with those of [123] which uses the PHD filter for tracking and reports the results only for seq24 cam1 and cam2 in terms of Wasserstein distance. Table 2 shows the results of six trackers.


Table 1. Comparison results of the tracking algorithms for the AV16.3 dataset using the OSPA-T metric [14].


Table 2. Tracking algorithms are compared in terms of mean Wasserstein distance (in pixel) [14].

Among six trackers, the AVMS-SMC-PHD outperforms the other trackers in terms of the average accuracy.

The trackers of [14] are also tested in different datasets. One sequence from each AMI and CLEAR dataset is used to test the trackers. Figure 7 shows the results of V-SMC-PHD and AV-SMC-PHD for a sequence of the AMI dataset. In this dataset, the speakers talk one by one. Hence, one DOA line is drawn per time instance. Since the speakers remain still, the visual trackers do not fail to track the speakers.

Other sequence is UKA\_20060726 from the CLEAR dataset where the speakers talk one by one and mostly sit around the table. The performance of visual and audio-visual trackers is given in Figure 8.

The average error of the trackers for sequences IS1001a and UKA\_20060726 is given in Table 3 in terms of the OSPA-T metric. It is reported in Ref. [14] that there is no significant difference on the performance of the visual and audio-visual trackers since the speakers talk one by one. The audio-visual tracker runs as a visual tracker for the silent speakers, while it is more

Figure 7. AMI dataset, sequence IS1001a. The first and second rows show the results of the V-SMC-PHD and the AV-SMC-PHD filter, respectively [14].

Figure 8. CLEAR dataset, sequence UKA\_20060726. The first and second rows show the results of the V-SMC-PHD and the AV-SMC-PHD filters, respectively [14].


Table 3. Comparison results of the tracking algorithms for the AMI and CLEAR dataset.

effective for the talking speakers because of the additional information coming from audio modality.

#### 8. Chapter summary

Among six trackers, the AVMS-SMC-PHD outperforms the other trackers in terms of the

V SMC-PHD [14]

cam1 9.02 7.20 16.96 7.94 6.67 7.45 cam2 6.40 4.80 19.17 7.59 5.24 5.73 Average 7.71 6.00 18.06 7.76 5.96 6.59

Table 2. Tracking algorithms are compared in terms of mean Wasserstein distance (in pixel) [14].

AV SMC-PHD [14]

AVMS SMC-PHD

Sparse AVMS SMC-PHD [14]

[14]

The trackers of [14] are also tested in different datasets. One sequence from each AMI and CLEAR dataset is used to test the trackers. Figure 7 shows the results of V-SMC-PHD and AV-SMC-PHD for a sequence of the AMI dataset. In this dataset, the speakers talk one by one. Hence, one DOA line is drawn per time instance. Since the speakers remain still, the visual

Other sequence is UKA\_20060726 from the CLEAR dataset where the speakers talk one by one and mostly sit around the table. The performance of visual and audio-visual trackers is given

The average error of the trackers for sequences IS1001a and UKA\_20060726 is given in Table 3 in terms of the OSPA-T metric. It is reported in Ref. [14] that there is no significant difference on the performance of the visual and audio-visual trackers since the speakers talk one by one. The audio-visual tracker runs as a visual tracker for the silent speakers, while it is more

Figure 7. AMI dataset, sequence IS1001a. The first and second rows show the results of the V-SMC-PHD and the AV-

average accuracy.

seq24 Tracking

algorithm #1 [122]

64 Motion Tracking and Gesture Recognition

Tracking algorithm #2 [123]

in Figure 8.

SMC-PHD filter, respectively [14].

trackers do not fail to track the speakers.

In this chapter, a review of multi-speaker tracking has been provided on modalities, existing tracking techniques, datasets and performance metrics that have been developed over the past few decades.

After a broad survey of the tracking methods, a technical background of the methods such as particle filtering, random finite set, PHD filter and mean-shift, which are commonly used as baseline methods in the literature, is introduced with their basic mathematical, statistical concepts and definitions, which are required for understanding the mathematics and techniques behind the proposed tracking algorithms.

In order to perform a quantitative evaluation of the proposed algorithms, both audio and video sequences are required. Publicly available datasets such as AV16.3, CLEAR, AMI, SPEVI and S3A were introduced with the fundamental differences including physical setup, scenarios and challenges.

Moreover, performance metrics were analysed in order to see which aspects are considered more in the evaluation and impacts of these perspectives on the evaluation results.

#### Acknowledgements

This work was supported in part by the Engineering and Physical Sciences Research Council (EPSRC) Grant number EP/K014307/2 and the MOD University Defence Research Collaboration in Signal Processing.

#### Author details

Volkan Kılıç 1 \* and Wenwu Wang<sup>2</sup>


#### References


[6] Shivappa ST, Rao BD, Trivedi MM. Audio-visual fusion and tracking with multilevel iterative decoding: Framework and experimental evaluation. IEEE Journal of Selected Topics in Signal Processing. 2010;4(5):882–894

Moreover, performance metrics were analysed in order to see which aspects are considered

This work was supported in part by the Engineering and Physical Sciences Research Council (EPSRC) Grant number EP/K014307/2 and the MOD University Defence Research Collabora-

[1] Liu Q, et al. Automating camera management for lecture room environments. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM;

[2] Talantzis F, Pnevmatikakis A, Constantinides AG. Audio–visual active speaker tracking in cluttered indoors environments. IEEE Transactions on Systems, Man, and Cybernet-

[3] Wölfel M, McDonough JW. Combining multi-source far distance speech recognition strategies: beamforming, blind channel and confusion network combination. In: INTERSPEECH;

[4] Potamianos G, Neti C, Deligne S. Joint audio-visual speech processing for recognition and enhancement. In: AVSP 2003-International Conference on Audio-Visual Speech

[5] Naphade MR, et al. Probabilistic multimedia objects (multijects): A novel approach to video indexing and retrieval in multimedia systems. In: ICIP 98. Proceedings of the 1998

more in the evaluation and impacts of these perspectives on the evaluation results.

Acknowledgements

66 Motion Tracking and Gesture Recognition

tion in Signal Processing.

1

\* and Wenwu Wang<sup>2</sup>

1 Izmir Katip Celebi University, Izmir, Turkey

2 University of Surrey, Guildford, UK

\*Address all correspondence to: volkan.kilic@ikc.edu.tr

ics, Part B (Cybernetics). 2008;38(3):799–807

International Conference on Image Processing. IEEE; 1998

Author details

Volkan Kılıç

References

2001

2005

Processing; 2003


[38] Kılıç V, et al. Audio-visual tracking of a variable number of speakers with a random finite set approach. In: 2014 17th International Conference on Information Fusion (FUSION). IEEE; 2014

[21] Barnard M, Wang W. Audio head pose estimation using the direct to reverberant speech

[22] Lanz O. Approximate Bayesian multibody tracking. IEEE Transactions on Pattern Anal-

[23] Beal MJ, Jojic N, Attias H. A graphical model for audiovisual object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2003;25(7): 828–836

[24] Walia GS, Kapoor R. Recent advances on multicue object tracking: A survey. Artificial

[25] Sullivan J, Rittscher J. Guiding random particles by deterministic search. In: ICCV 2001. Proceedings of the Eighth IEEE International Conference on Computer Vision. IEEE;

[26] Zhou SK, Chellappa R, Moghaddam B. Visual tracking and recognition using appearance-adaptive models in particle filters. IEEE Transactions on Image Processing. 2004;13

[27] Comaniciu D, Meer P. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002;24(5):603–619 [28] Comaniciu D, Ramesh V, Meer P. Kernel-based object tracking. IEEE Transactions on

[29] Gatica-Perez D, et al. Audio-visual speaker tracking with importance particle filters. In: ICIP 2003. Proceedings of the 2003 International Conference on Image Processing. IEEE;

[30] Bar-Shalom Y. Tracking and Data Association. Academic Press Professional, Inc.; 1987 [31] Isard M, Blake A. Condensation—Conditional density propagation for visual tracking.

[32] Kılıç V, et al. Adaptive particle filtering approach to audio-visual tracking. In: Signal Processing Conference (EUSIPCO), 2013 Proceedings of the 21st European. IEEE;

[33] Closas P, Fernández-Prades C. Particle filtering with adaptive number of particles. In:

[34] Fox D. Adapting the sample size in particle filters through KLD-sampling. The Interna-

[36] Mahler RP. Statistical Multisource-Multitarget Information Fusion. Boston, MA, USA:

[37] Vo B.-N, Singh SS, Ma W-K. Tracking multiple speakers using random sets. In: ICASSP

Pattern Analysis and Machine Intelligence. 2003;25(5):564–577

International Journal of Computer Vision. 1998;29(1):5–28

tional Journal of Robotics Research. 2003;22(12):985–1003

[35] Soto A. Self Adaptive Particle Filter. In: IJCAI; 2005

Aerospace Conference. IEEE; 2011

Artech House, Inc.; 2007

ratio. Speech Communication. 2016;85:98–108

Intelligence Review. 2016;46(1):1–39

2001

2003

2013

(2); 2004

(11):1491–1506

68 Motion Tracking and Gesture Recognition

ysis and Machine Intelligence. 2006;28(9):1436–1449


[68] DiBiase JH, Silverman HF, Brandstein MS. Robust localization in reverberant rooms. In: Microphone Arrays. Springer; 2001. pp. 157–180

[53] Raja Y, McKenna SJ, Gong S. Segmentation and tracking using colour mixture models.

[54] Yilmaz A, Li X, Shah M. Contour-based object tracking with occlusion handling in video acquired using mobile cameras. IEEE Transactions on Pattern Analysis and Machine

[55] Wang Y, Lee O. Active mesh-a feature seeking and tracking image sequence representa-

[56] Micilotta AS, Ong E-J, Bowden R. Real-time upper body detection and 3D pose estimation in monoscopic images. In: European Conference on Computer Vision. Springer; 2006 [57] Fergus R, Perona P, Zisserman A. Object class recognition by unsupervised scale-invariant learning. In: Proceedings of the 2003 IEEE Computer Society Conference on Com-

[58] Shotton J, et al. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of

[59] Winn J, Criminisi A, Minka T. Object categorization by learned universal visual dictionary. In: Tenth IEEE International Conference on Computer Vision (ICCV'05). Vol. 1.

[60] Manjunath BS, Ma W-Y. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1996;18(8):837–842

[61] Yang H, et al. Recent advances and trends in visual tracking: A review. Neurocomp-

[62] Ojala T, Pietikainen M, Maenpaa T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis

[63] Perez P, Vermaak J, Blake A. Data fusion for visual tracking with particles. Proceedings

[64] Lowe DG. Distinctive image features from scale-invariant keypoints. International Jour-

[65] Ke Y, Sukthankar R. PCA-SIFT: A more distinctive representation for local image descriptors. In: CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference

[66] Sigal L, Sclaroff S, Athitsos V. Estimation and prediction of evolving color distributions for skin segmentation under varying illumination. In: Proceedings of the IEEE Confer-

[67] Brandstein MS, Silverman HF. A practical methodology for speech source localization

with microphone arrays. Computer Speech & Language. 1997;11(2):91–126

tion scheme. IEEE Transactions on Image Processing. 1994;3(5):610–624

In: Asian Conference on Computer Vision. Springer; 1998

puter Vision and Pattern Recognition. IEEE; 2003

Intelligence. 2004;26(11):1531–1536

70 Motion Tracking and Gesture Recognition

Computer Vision. 2009;81(1):2–23

uting. 2011;74(18):3823–3831

of the IEEE. 2004;92(3):495–513

and Machine Intelligence. 2002;24(7):971–987

nal of Computer Vision. 2004;60(2):91–110

on Computer Vision and Pattern Recognition. IEEE; 2004

ence on Computer Vision and Pattern Recognition. IEEE; 2000

IEEE; 2005


[100] Wang Y, Jing Z, Hu S. Data association for PHD filter based on MHT. In: 2008 11th International Conference on Information Fusion. IEEE; 2008

[83] Gehrig T, et al. Kalman filters for audio-video source localization. In: IEEE Workshop on

[84] McGee LA, Schmidt SF. Discovery of the Kalman filter as a practical tool for aerospace

[85] Mahler RP. "Statistics 101" for multisensor, multitarget data fusion. IEEE Aerospace and

[86] Mahler R. PHD filters of higher order in target number. IEEE Transactions on Aerospace

[87] Beyan C, Temizel A. Adaptive mean-shift for automated multi object tracking. IET

[88] Leichter I, Lindenbaum M, Rivlin E. Mean shift tracking with multiple reference color histograms. Computer Vision and Image Understanding. 2010;114(3):400–408

[89] Collins RT. Mean-shift blob tracking through scale space. In: Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE; 2003

[90] Li P. An adaptive binning color model for mean shift tracking. IEEE Transactions on

[91] Li Z, Tang Q, Sang, N. Improved mean shift algorithm for occlusion pedestrian tracking.

[92] Maggio E, Cavallaro A. Hybrid particle filter and mean shift tracker with adaptive

[93] Chang C, Ansari R, Khokhar A. Multiple object tracking with kernel particle filter. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

[94] Blackman S, Popoli R. Design and Analysis of Modern Tracking Systems (Book). Nor-

[95] Panta K, et al. Probability hypothesis density filter versus multiple hypothesis tracking. In: Defense and Security. International Society for Optics and Photonics; 2004

[96] Bar-Shalom Y, Li X-R. Multitarget-Multisensor Tracking: Principles and Techniques.

[97] Chakravorty R, Challa S. Multitarget tracking algorithm-Joint IPDA and Gaussian mixture PHD filter. In: FUSION'09. 12th International Conference on Information Fusion. IEEE; 2009

[98] Jaward M, et al. Multiple object tracking using particle filters. In: 2006 IEEE Aerospace

[99] Kim K, Davis LS. Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering. In: European Conference on Com-

Circuits and Systems for Video Technology. 2008;18(9):1293–1299

Applications of Signal Processing to Audio and Acoustics. IEEE; 2005

and industry. NASA, Moffett Field, CA, USA, NASATM-86847, 1985

Electronic Systems Magazine. 2004;19(1):53–64

and Electronic Systems. 2007;43(4):1523–1543

Computer Vision. 2012;6(1):1–12

72 Motion Tracking and Gesture Recognition

Electronics Letters. 2008;44(10):622–623

(CVPR'05). IEEE; 2005

Conference. IEEE; 2006

puter Vision. Springer; 2006

wood, MA: Artech House, 1999

Storrs, CT: University of Connecticut; 1995

transition model. In: ICASSP (2). Citeseer; 2005

