**Gesture Recognition by Using Depth Data: Comparison of Different Methodologies**

Grazia Cicirelli and Tiziana D'Orazio

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/68118

#### Abstract

In this chapter, the problem of gesture recognition in the context of human computer interaction is considered. Several classifiers based on different approaches such as neural network (NN), support vector machine (SVM), hidden Markov model (HMM), deep neural network (DNN), and dynamic time warping (DTW) are used to build the gesture models. The performance of each methodology is evaluated considering different users performing the gestures. This performance analysis is required as the users perform gestures in a personalized way and with different velocity. So the problems concerning the different lengths of the gesture in terms of number of frames, the variability in its representation, and the generalization ability of the classifiers have been analyzed.

Keywords: gesture recognition, feature extraction, model learning, gesture segmentation, human-robot interface, Kinect camera

#### 1. Introduction

In the last decade, gesture recognition has been attracting a lot of attention as a natural way to interact with computer and/or robots through intentional movements of hands, arms, face, or body. A number of approaches have been proposed giving particular emphasis on hand gestures and facial expressions by the analysis of images acquired by conventional RGB cameras [1, 2].

The recent introduction of low cost depth sensors, such as the Kinect camera, allowed the spreading of new gesture recognition approaches and the possibility of developing personalized human computer interfaces [3, 4]. The Kinect camera provides RGB images together with depth information, so the 3D structure of the scene is immediately available. This allows us to

easily manage many tasks such as people segmentation and tracking, body part recognition, motion estimation, and so on. Recently human activity recognition and motion analysis from 3D data have been reviewed in a number of interesting works [5–8].

At present, Gesture Recognition through visual and depth information is one of the main active research topics in the computer vision community. The launch on the market of the popular Kinect, by the Microsoft Company, influenced video-based recognition tasks such as object detection and classification and in particular allowed the increment of the research interest in gesture/activity recognition. The Kinect provides synchronized depth and color (RGB) images where each pixel corresponds to an estimate of the distance between the sensor and the closest object in the scene together with the RGB values at each pixel location. Together with the sensor some software libraries are also available that permit to detect and track one or more people in the scene and to extract the corresponding human skeleton in real time. The availability of information about joint coordinates and orientation has promoted a great impulse to research on gesture and activity recognition [9–14].

Many papers, presented in literature in the last years, use normalized coordinates of proper subset of skeleton joints which are able to characterize the movements of the body parts involved in the gestures [15, 16]. Angular information between joint vectors has been used as features to eliminate the need of normalization in Ref. [17].

Different methods have been used to generate gesture models. Hidden Markov Models (HMM) are a common choice for gesture recognition as they are able to model sequential data over time [18, 19]. Usually HMMs require sequences of discrete symbols, so different quantization schemes are first used to quantize the features which characterize the gestures. Support vector machines (SVM) reduce the classification problem into multiple binary classifications either by applying a one-versus-all (OVA-SVM) strategy (with a total of N classifiers for N classes) [20, 21] or a oneversus-one (OVO-SVM) strategy (with a total of N · ðN � 1Þ=2 classifiers for N classes) [22, 23]. Artificial neural networks (ANNs) represent another alternative methodology to solve classification problems in the context of gesture recognition [24]. The choice of the network topology, the number of nodes/layers and the node activation functions depends on the problem complexity and can be fixed by using iterative processes which run until the optimal parameters are found [25].

Distance-based approaches are also used in gesture recognition problems. They use distance metrics for measuring the similarity between samples and gesture models. In order to apply any metric for making comparisons, these methods have to manage the problem related to the different length of feature sequences. Several solutions have been proposed in literature: Dynamic Time Warping technique (DTW) [26] is the most commonly used. It calculates an optimal match between two sequences that are nonlinearly aligned. A frame-filling algorithm is proposed in Ref. [27] to first align gesture data, then an eigenspace-based method (called Eigen3Dgesture) is applied for recognizing human gestures.

In the last years, the growing interest in automatically learning the specific representation needed for recognition or classification has fostered the recent emergence of deep learning architectures [28]. Rather than using handcrafted features as in conventional machine learning techniques, deep neural architectures are applied to learn representations of data at multiple levels of abstractions in order to reduce the dimensionality of feature vectors and to extract relevant features at higher level. Recently, several approaches have been proposed such as in Refs. [29, 30]. In Ref. [29], a method for gesture detection and localization based on multiscale and multimodel deep learning is presented. Both temporal and spatial scales are managed by employing a multimodel convolutional neural network. Similarly in Ref. [30], a multimodel gesture segmentation and recognition method, called deep dynamic neural networks, is presented. A semisupervised hierarchical dynamic framework based on a Hidden Markov Model is proposed for simultaneous gesture segmentation and recognition.

easily manage many tasks such as people segmentation and tracking, body part recognition, motion estimation, and so on. Recently human activity recognition and motion analysis from

At present, Gesture Recognition through visual and depth information is one of the main active research topics in the computer vision community. The launch on the market of the popular Kinect, by the Microsoft Company, influenced video-based recognition tasks such as object detection and classification and in particular allowed the increment of the research interest in gesture/activity recognition. The Kinect provides synchronized depth and color (RGB) images where each pixel corresponds to an estimate of the distance between the sensor and the closest object in the scene together with the RGB values at each pixel location. Together with the sensor some software libraries are also available that permit to detect and track one or more people in the scene and to extract the corresponding human skeleton in real time. The availability of information about joint coordinates and orientation has promoted a great impulse to research

Many papers, presented in literature in the last years, use normalized coordinates of proper subset of skeleton joints which are able to characterize the movements of the body parts involved in the gestures [15, 16]. Angular information between joint vectors has been used as

Different methods have been used to generate gesture models. Hidden Markov Models (HMM) are a common choice for gesture recognition as they are able to model sequential data over time [18, 19]. Usually HMMs require sequences of discrete symbols, so different quantization schemes are first used to quantize the features which characterize the gestures. Support vector machines (SVM) reduce the classification problem into multiple binary classifications either by applying a one-versus-all (OVA-SVM) strategy (with a total of N classifiers for N classes) [20, 21] or a oneversus-one (OVO-SVM) strategy (with a total of N · ðN � 1Þ=2 classifiers for N classes) [22, 23]. Artificial neural networks (ANNs) represent another alternative methodology to solve classification problems in the context of gesture recognition [24]. The choice of the network topology, the number of nodes/layers and the node activation functions depends on the problem complexity and can be fixed by using iterative processes which run until the optimal parameters are found

Distance-based approaches are also used in gesture recognition problems. They use distance metrics for measuring the similarity between samples and gesture models. In order to apply any metric for making comparisons, these methods have to manage the problem related to the different length of feature sequences. Several solutions have been proposed in literature: Dynamic Time Warping technique (DTW) [26] is the most commonly used. It calculates an optimal match between two sequences that are nonlinearly aligned. A frame-filling algorithm is proposed in Ref. [27] to first align gesture data, then an eigenspace-based method (called

In the last years, the growing interest in automatically learning the specific representation needed for recognition or classification has fostered the recent emergence of deep learning architectures [28]. Rather than using handcrafted features as in conventional machine learning

3D data have been reviewed in a number of interesting works [5–8].

on gesture and activity recognition [9–14].

120 Motion Tracking and Gesture Recognition

[25].

features to eliminate the need of normalization in Ref. [17].

Eigen3Dgesture) is applied for recognizing human gestures.

In this chapter, we compare different methodologies to approach the problem of Gesture Recognition in order to develop a natural human-robot interface with good generalization ability. Ten gestures performed by one user in front of a Kinect camera are used to train several classifiers based on different approaches such as dynamic time warping (DTW), neural network (NN), support vector machine (SVM), hidden Markov model (HMM), and deep neural network (DNN).

The performance of each methodology is evaluated considering several tests carried out on depth video streams of gestures performed by different users (diverse from the one used for the training phase). This performance analysis is required as users perform gestures in a personalized way and with different velocity. Even the same user executes gestures differently in separate video acquisition sessions. Furthermore, contrarily to the case of static gesture recognition, in the case of depth videos captured live the problem of gesture segmentation must be addressed. During the test phase, we apply a sliding window approach to extract sequences of frames to be processed and recognized as gestures. Notice that the training set contains gestures which are accompanied by the relative ground truth labels and are well defined by their start and end points. Testing live video streams, instead, involves several challenging problems such as the identification of the starting/ending frames of a gesture, the different length related to the different types of gestures and finally the different speeds of execution. The analysis of the performance of the different methodologies allows us to select, among the set of available gestures, the ones which are better recognized together with the better classifier, in order to construct a robust human-robot interface.

In this chapter, we consider all the mentioned challenging problems. In particular, the fundamental steps that characterize an automatic gesture recognition system will be analyzed: (1) feature extraction that involves the definition of the features that better and distinctively characterize a specific movement or posture; (2) gesture recognition that is seen as a classification problem in which examples of gestures are used into supervised and semisupervised learning schemes to model the gestures; (3) spatiotemporal segmentation that is necessary for determining, in a video sequence, where the dynamic gestures are located, i.e., when they start and end.

The rest of the chapter is organized as follows. The overall description of the problem and the definition of the gestures are given in Section 2. The definition of the features is provided in Section 3. The methodologies selected for the gesture model generation are described in Section 4. Section 6 presents the experiments carried out both in the learning and prediction stage. Furthermore, details on gesture segmentation will be given in the same section. Finally, Section 7 presents the final conclusions and delineates some future works.

#### 2. Problem definition

In this chapter, we consider the problems related to the development of a gesture recognition interface giving a panoramic view and comparing the most commonly used methodologies of machine learning theory. At this aim, the Kinect camera is used to record video sequences of different users while they perform predefined gestures in front of it. The OpenNI Library is used to detect and segment the user in the scene in order to obtain the information of the joints of the user's body. Ten different gestures have been defined. They are pictured in Figure 1. Throughout the chapter the gestures will be referred by using the following symbols G1, G2, G3, … GN, where N ¼ 10. Some gestures are quite similar in terms of variations of joint orientations; the only difference is the plane in which the bones of the arm rotate. This is the case, for example, of gestures G<sup>9</sup> and G<sup>4</sup> or G1, and G8. Furthermore, some gestures involve movements in a plane parallel to the camera (G1, G3, G4, G7) while others involve a forward motion in a plane perpendicular to the camera (G2, G5, G6, G8, G9, G10). In the last case, instability in detecting some joints can occur due to autoocclusions.

The proposed approaches for gesture recognition involve three main stages: a feature selection stage, a learning stage and a prediction stage. Firstly the human skeleton information, captured and returned by the depth camera, is converted into representative and discriminant

Figure 1. Ten different gestures are shown. Gestures G1, G3, G4, and G<sup>7</sup> involve movements in a plane parallel to the camera. Gestures G2, G5, G6, G8, G9, and G<sup>10</sup> involve a forward motion in a plane perpendicular to the camera.

features. These features are used during the learning stage to learn the gesture model. In this chapter, different methodologies are applied and compared in order to construct the gesture model. Some methodologies are based on a supervised or semisupervised process such as neural network (NN), support vector machine (SVM), hidden Markov model (HMM), and deep neural network (DNN). Dynamic time warping (DTW) is a distance-based approach, instead. Finally, during the prediction stage new video sequences of gestures are tested by using the learned models. The following sections will describe in detail each stage previously introduced.

#### 3. Feature selection

Furthermore, details on gesture segmentation will be given in the same section. Finally, Section 7

In this chapter, we consider the problems related to the development of a gesture recognition interface giving a panoramic view and comparing the most commonly used methodologies of machine learning theory. At this aim, the Kinect camera is used to record video sequences of different users while they perform predefined gestures in front of it. The OpenNI Library is used to detect and segment the user in the scene in order to obtain the information of the joints of the user's body. Ten different gestures have been defined. They are pictured in Figure 1. Throughout the chapter the gestures will be referred by using the following symbols G1, G2, G3, … GN, where N ¼ 10. Some gestures are quite similar in terms of variations of joint orientations; the only difference is the plane in which the bones of the arm rotate. This is the case, for example, of gestures G<sup>9</sup> and G<sup>4</sup> or G1, and G8. Furthermore, some gestures involve movements in a plane parallel to the camera (G1, G3, G4, G7) while others involve a forward motion in a plane perpendicular to the camera (G2, G5, G6, G8, G9, G10). In the last case, instability in detecting some joints can occur due

The proposed approaches for gesture recognition involve three main stages: a feature selection stage, a learning stage and a prediction stage. Firstly the human skeleton information, captured and returned by the depth camera, is converted into representative and discriminant

Figure 1. Ten different gestures are shown. Gestures G1, G3, G4, and G<sup>7</sup> involve movements in a plane parallel to the camera. Gestures G2, G5, G6, G8, G9, and G<sup>10</sup> involve a forward motion in a plane perpendicular to the camera.

presents the final conclusions and delineates some future works.

2. Problem definition

122 Motion Tracking and Gesture Recognition

to autoocclusions.

The complexity of the gestures strictly affects the feature selection and the choice of the methodology for the construction of the gesture model. If gestures are distinct enough, the recognition can be easy and reliable. So, the coordinates of joints, which are immediately available by the Kinect software platforms, could be sufficient. In this case a preliminary normalization is required in order to guarantee invariance with respect to the height of the users, distance and orientation with respect to the camera. On the other hand, the angular information of joint vectors has the great advantage of maximizing the invariance of the skeletal representation with respect to the camera position. In Ref. [31], the angles between the vectors generated by the elbow-wrist joints, and the shoulder-elbow joints, are used to generate the models of the gestures. The experimental results, however, prove that these features are not discriminant enough to distinguish all the gestures.

In our approach, we use more complex features that represent orientations and rotations of a rigid body in three dimensions. The quaternions of two joints (shoulder and elbow) of the left arm are used. A quaternion comprises a scalar component and a vector component in complex space and is generally represented in the following form:

$$
\mathfrak{q} = a + b\mathfrak{i} + c\mathfrak{j} + dk \tag{1}
$$

where the coefficients a, b, c, d are real numbers and i, j, k are the fundamental quaternion units. The quaternions are extremely efficient to represent three-dimensional rotations as they combine the rotation angles together with the rotation axes. In this work, the quaternions of the shoulder and elbow joints are used to define a feature vector Vi for each frame i:

$$V\_i = [a\_{i'}^s, b\_{i'}^s, c\_{i'}^s, d\_{i'}^s, a\_{i'}^e, b\_{i'}^e, c\_{i'}^e, d\_i^e] \tag{2}$$

where the index s stands for shoulder and e stands for elbow. The sequence of vectors of a whole gesture execution is defined by the following vector:

$$\overline{V} = [V\_1, V\_{2'}, \dots, V\_n] \tag{3}$$

Where n is the number of frames during which the gesture is entirely performed.

#### 4. Learning stage: gesture model construction

The learning stage regards the construction of the gesture model. As introduced in Section 1, machine learning algorithms are largely and successfully applied to gesture recognition. In this context, gesture recognition is considered as a classification problem. So, under this perspective, a number of gesture templates are collected, opportunely labeled with the class labels (supervised learning) and used to train a learning scheme in order to learn a classification model. The constructed model is afterwards used to predict the class label of unknown templates of gestures.

In this chapter, different learning methodologies are applied to learn the gesture model. For each of them, the best parameter configuration and the best architecture topology which assure the convergence of each methodology are selected. Artificial neural networks (ANNs), support vector machines (SVMs), hidden Markov models (HMMs), and deep neural networks (DNNs) are the machine learning algorithms compared in this chapter. Furthermore a distance-based method, the dynamic time warping (DTW), is also applied and compared with the aforementioned algorithms. The following subsections will give a brief introduction of each algorithm and some details on how they are applied to solve the proposed gesture recognition problem.

#### 4.1. Neural network

A neural network is a computational system that simulates the way biological neural systems process information [32]. It consists of a large number of highly interconnected processing units (neurons) typically distributed on multiple layers. The learning process involves successive adjustments of connection weights, through an iterative training procedure, until no further improvement occurs or until the error drops below some predefined reasonable threshold. Training is accomplished by presenting couples of input/output examples to the network (supervised learning).

In this work, 10 different neural networks have been used to learn the models of the defined gestures. The architecture of each NN consists of an input layer, one hidden layer and an output layer with a single neuron. The back-propagation algorithm is applied during the learning process. Each training set contains the templates of one gesture as positive examples and those of all the others as negative ones. As each gesture execution lasts a different number of frames, a preliminary normalization of the feature vectors has been carried out by using a linear interpolation. Linear interpolation to resample the number of features is a good compromise between computational burden and quality of results. The length of a feature vector V, which describes one single gesture, has been fixed to n ¼ 60. This length has been fixed considering the average time of execution of each type of gesture which is about 2 seconds and the sample rate of the Kinect camera which is 30 Hz.

#### 4.2. Support vector machine

Support vector machine is a supervised learning algorithm widely used in classification problems [33]. The peculiarity of SVM is that of finding the optimal separating hyperplane between the negative and positive examples of the training set. The optimal hyperplane is defined as the maximum margin hyperplane, i.e., the one for which the distance between the hyperplane (decision surface) and the closest data points is maximum. It can be shown that the optimal hyperplane is fully specified by a subset of data called support vectors which lie nearest to it, exactly on the margin.

In this work, SVMs have been applied considering the one-versus-one strategy. This strategy builds a two-class classifier for each pair of gesture classes. In our case, the total number of SVMs is defined by:

$$M = \frac{N(N-1)}{2} \tag{4}$$

where N is the number of gesture classes. The training set of each SVM contains the examples of the two gesture classes for which the current classifier is built. As in the case of NNs, the feature vectors are preliminary normalized to the same length n.

#### 4.3. Hidden Markov model

4. Learning stage: gesture model construction

124 Motion Tracking and Gesture Recognition

problem.

4.1. Neural network

(supervised learning).

4.2. Support vector machine

the sample rate of the Kinect camera which is 30 Hz.

The learning stage regards the construction of the gesture model. As introduced in Section 1, machine learning algorithms are largely and successfully applied to gesture recognition. In this context, gesture recognition is considered as a classification problem. So, under this perspective, a number of gesture templates are collected, opportunely labeled with the class labels (supervised learning) and used to train a learning scheme in order to learn a classification model. The constructed model is afterwards used to predict the class label of unknown templates of gestures. In this chapter, different learning methodologies are applied to learn the gesture model. For each of them, the best parameter configuration and the best architecture topology which assure the convergence of each methodology are selected. Artificial neural networks (ANNs), support vector machines (SVMs), hidden Markov models (HMMs), and deep neural networks (DNNs) are the machine learning algorithms compared in this chapter. Furthermore a distance-based method, the dynamic time warping (DTW), is also applied and compared with the aforementioned algorithms. The following subsections will give a brief introduction of each algorithm and some details on how they are applied to solve the proposed gesture recognition

A neural network is a computational system that simulates the way biological neural systems process information [32]. It consists of a large number of highly interconnected processing units (neurons) typically distributed on multiple layers. The learning process involves successive adjustments of connection weights, through an iterative training procedure, until no further improvement occurs or until the error drops below some predefined reasonable threshold. Training is accomplished by presenting couples of input/output examples to the network

In this work, 10 different neural networks have been used to learn the models of the defined gestures. The architecture of each NN consists of an input layer, one hidden layer and an output layer with a single neuron. The back-propagation algorithm is applied during the learning process. Each training set contains the templates of one gesture as positive examples and those of all the others as negative ones. As each gesture execution lasts a different number of frames, a preliminary normalization of the feature vectors has been carried out by using a linear interpolation. Linear interpolation to resample the number of features is a good compromise between computational burden and quality of results. The length of a feature vector V, which describes one single gesture, has been fixed to n ¼ 60. This length has been fixed considering the average time of execution of each type of gesture which is about 2 seconds and

Support vector machine is a supervised learning algorithm widely used in classification problems [33]. The peculiarity of SVM is that of finding the optimal separating hyperplane between Hidden Markov model is a statistical model which assumes that the system to be modeled is a Markov process. Even if the theory of HMMs dates back to the late 1960s, their widespread application occurred only within the past several years [34, 35]. Their successful application to speech recognition problems motivated their diffusion in gesture recognition as well. An HMM consists of a set of unobserved (hidden) states, a state transition probability matrix defining the transition probabilities among states and an observation or emission probability matrix which defines the output model. The goal is to learn the best set of state transition and emission probabilities, given a set of observations. These probabilities completely define the model.

In this work, one discrete hidden Markov model is learnt for each gesture class. The feature vectors of each training set, which represent the observations, are firstly normalized and then discretized by applying a K-means algorithm. A fully connected HMM topology and the Baum-Welch algorithm have been applied to learn the optimal transition and emission probabilities.

#### 4.4. Deep neural network

Deep learning is a relatively new branch of machine learning research [28]. Its objective is to learn features automatically at multiple levels of abstraction exploiting an unsupervised learning algorithm at each layer [36]. At each level a new data representation is learnt and used as input to the successive level. Once a good representation of data has been found, a supervised stage is performed to train the top level. A final supervised fine-tuning stage of the entire architecture completes the training phase and improves the results. The number of levels defines the deepness of the architecture.

In this work, a deep neural network with 10 output nodes (one for each class of gesture) is constructed. It comprises two levels of unsupervised autoencoders and a supervised top level. The autoencoders are used to learn a lower dimensional representation of the feature vectors at a higher level of abstraction. An autoencoder is a neural network which is trained to reconstruct its own input. It is comprised of an encoder, that maps the input to the new representation of data, and a decoder that reconstruct the original input. We use two autoencoders with one hidden layer. The number of hidden neurons represents the dimension of the new data representation. The feature vectors of training set are firstly normalized, as described in Section 4.1, and fed into the first autoencoder. So the features generated by the first autoencoder are used as input to the second one. The size of the hidden layer for both the first and second autoencoder has been fixed to half the size of the input vector. The features learnt by the last autoencoder are given as input to the supervised top level implemented by using a softmax function trained with a scaled conjugate gradient algorithm [37]. Finally the different levels are stacked to form the deep network and its parameters are fine-tuned by performing backpropagation using the training data in a supervised fashion.

#### 4.5. Dynamic time warping

DTW is a different technique with respect to the previously described ones as it is a distance-based algorithm. Its peculiarity is to find the ideal alignment (warping) of two time-dependent sequences considering their synchronization. For each pair of elements of the sequences, a cost matrix, also referred as local distance matrix, is computed by using a distance measure. Then the goal is to find the minimal cost path through this matrix. This optimal path defines the ideal alignment of the two sequences [38]. DTW is successfully applied to compare sequences that are altered by noise or by speed variations. Originally, the main application field of DTW was automatic speech processing [39], where variation in speed appears concretely. Successively DTW found its application in movement recognition, where variation in speed is of major importance, too.

In this work, DTW is applied to compare the feature vectors in order to measure how different they are for solving the classification problem. Differently from the previously described methodologies, the preliminary normalization of feature vectors is not required due to the warping peculiarity of DTW algorithm. For each class of gesture, one target feature vector is selected. This is accomplished by applying DTW to the set of training samples inside each gesture class. The one with the minimum distance from all the other samples of the same class is chosen as target gesture. Each target gesture will be used in the successive prediction stage for classification.

#### 5. Prediction stage: gesture model testing

In prediction stage, also referred as testing stage, video sequences with unknown gestures are classified by using the learnt gesture models. This stage allows us to compare the recognition performance of the methodologies introduced in the learning stage. These methodologies have been applied by using different strategies as described in the following.

In the case of NN, 10 classifiers have been trained, one for each class. So the feature vector of a new gesture sample is inputted into all the classifiers and is assigned to the class with the maximum output value.

In the case of SVM, instead, a max-win voting strategy has been applied. The trained SVMs are 45 two-class classifiers. When each classifier receives as input a gesture sample, classifies it into one of the two classes. Therefore, the winning class gets one vote. When all the 45 votes have been assigned, the instance of the gesture is classified into the class with the maximum number of votes.

In the case of HMM, 10 HMMs have been learnt during the learning stage, one for each class of gesture. As introduced in Section 4.3 the model of each class is specified by the transition and emission probabilities learnt in the learning stage. When a gesture instance is given as input to the HMM, this computes the probability of that instance given the model. The class of the HMM returning the maximum probability is the winning class.

In the case of DNN, as described in Section 4.4, the deep architecture, constructed in the learning stage, has 10 output nodes. So, when a gesture sample is inputted in the network for prediction, the winning class is simply the one relative to the node with the maximum output value.

Finally, for what concerns the DTW case, the target gestures, found during the learning stage, are used to predict the class of new gesture instances. The distances between the unknown gesture sample and the 10 target gestures are computed. The winning class is that of the target gesture with minimum distance.

#### 6. Experiments

The autoencoders are used to learn a lower dimensional representation of the feature vectors at a higher level of abstraction. An autoencoder is a neural network which is trained to reconstruct its own input. It is comprised of an encoder, that maps the input to the new representation of data, and a decoder that reconstruct the original input. We use two autoencoders with one hidden layer. The number of hidden neurons represents the dimension of the new data representation. The feature vectors of training set are firstly normalized, as described in Section 4.1, and fed into the first autoencoder. So the features generated by the first autoencoder are used as input to the second one. The size of the hidden layer for both the first and second autoencoder has been fixed to half the size of the input vector. The features learnt by the last autoencoder are given as input to the supervised top level implemented by using a softmax function trained with a scaled conjugate gradient algorithm [37]. Finally the different levels are stacked to form the deep network and its parameters are fine-tuned by performing backpropagation using the training

DTW is a different technique with respect to the previously described ones as it is a distance-based algorithm. Its peculiarity is to find the ideal alignment (warping) of two time-dependent sequences considering their synchronization. For each pair of elements of the sequences, a cost matrix, also referred as local distance matrix, is computed by using a distance measure. Then the goal is to find the minimal cost path through this matrix. This optimal path defines the ideal alignment of the two sequences [38]. DTW is successfully applied to compare sequences that are altered by noise or by speed variations. Originally, the main application field of DTW was automatic speech processing [39], where variation in speed appears concretely. Successively DTW found its appli-

In this work, DTW is applied to compare the feature vectors in order to measure how different they are for solving the classification problem. Differently from the previously described methodologies, the preliminary normalization of feature vectors is not required due to the warping peculiarity of DTW algorithm. For each class of gesture, one target feature vector is selected. This is accomplished by applying DTW to the set of training samples inside each gesture class. The one with the minimum distance from all the other samples of the same class is chosen as target gesture. Each target gesture will be used in the successive prediction stage for classification.

In prediction stage, also referred as testing stage, video sequences with unknown gestures are classified by using the learnt gesture models. This stage allows us to compare the recognition performance of the methodologies introduced in the learning stage. These methodologies have

In the case of NN, 10 classifiers have been trained, one for each class. So the feature vector of a new gesture sample is inputted into all the classifiers and is assigned to the class with the

cation in movement recognition, where variation in speed is of major importance, too.

5. Prediction stage: gesture model testing

maximum output value.

been applied by using different strategies as described in the following.

data in a supervised fashion.

126 Motion Tracking and Gesture Recognition

4.5. Dynamic time warping

In this section the experiments carried out in order to evaluate the performance of the analyzed methodologies will be described and the obtained results will be shown and compared. In particular, the experiments conducted in both the learning stage and the prediction stage will be detailed separately for a greater clarity of presentation.

Several video sequences of gestures performed by different users have been acquired by using a Kinect camera. Sequences of the same users in different sessions (e.g., in different days) have been also acquired in order to have a wide variety of data. The length of each sequence is about 1000 frames. The users have been requested to execute gestures standing in front of the Kinect, by using the left arm and without pause between one gesture execution and the successive one. The distance between Kinect and user is not fixed. The only constraint is that the whole user's body has to be seen by the sensor, so its skeleton data can be detected by using the OpenNi processing Library. These data are recorded for each frame of the sequence.

#### 6.1. Learning stage

As described in Section 4, the objective of the learning stage is to construct or, more specifically, to learn a gesture model. In order to reach this goal, the first step is the construction of the training datasets. The idea of using public datasets has been discarded as they do not assure that real situations are managed. Furthermore, they contain sample gestures which are acquired mainly in the same conditions. We have decided to use a set of gestures chosen by us (see Figure 1), which have been selected from the "Arm-and-Hand Signals for Ground Forces" [40].

The video sequences of only one user (afterward referred as Training User) are considered for building the training sets. Each sequence contains several executions of the same gesture without idle frames between one instance and the other. In this stage, we manually segment the training streams into gesture instances in order to guarantee that each extracted subsequence contains exactly one gesture execution. Then each instance is converted in feature vector by using the skeleton data as described in Section 3. Notice that feature vectors V can have different lengths, because either gesture execution lasts a different number of frames or users execute gestures with different speeds. Part of the obtained feature vectors are used for training and the rest for validation.

The second step of the learning stage is the construction of the gesture model by using the methodologies described in Section 4. A preliminary normalization of feature vectors to the same length is needed in the cases of NN, SVM, HMM, and DNN. As described in Section 4.1, n has been fixed to 60. So each normalized feature vector V has 480 components which have been defined by using the quaternion coefficients of shoulder and elbow joints (see Eqs. (2) and (3)). In the case of DTW this normalization is not required.

For each methodology, different models can be learnt depending on the parameters of the methodology. These parameters can be structural such as the number of hidden nodes in the NN architecture or in the autoencoder or the number of hidden states in a HMM; or they can be tuning parameters as in the case of SVM. So, different experiments have been carried out for selecting the optimal parameters inside each methodology. Optimal parameters have to be intended as those which provide a good compromise between over-fitting and prediction error over the validation set.

#### 6.2. Prediction stage

The prediction stage represents the recognition phase which allows us to compare the performance of each methodology. In this phase the class labels of feature vectors are predicted based on the learnt gesture model. Differently from the training phase that can be defined as an offline phase, the prediction stage can be defined as an on-line stage. In this case the video sequences of six different users (excluded the Training-User) have been properly processed by using an approach that works when live video sequences have to be tested. Differently from the learning stage, where gesture instances were manually selected from the sequences and were directly available for training the classifiers, in the prediction stage the sequences need to be opportunely processed by applying a gesture segmentation approach. This process involves several challenging problems such as the identification of the staring/ending points of a gesture instance, the different length related to the different classes of gestures and finally the different speeds of execution.

In this work, the sequences are processed by using a sliding window approach, where a window slides forward over the sequence by one frame per time in order to extract subsequences. First, the dimension of the sliding window must be defined. As there are no idle frames among successive gesture executions, an algorithm based on Fast Fourier Transform (FFT) has been applied in order to estimate the duration of each gesture execution [41]. As each sequence contains several repetitions of the same gesture, it is possible to approximate the sequence of features as a periodic signal. Applying the FFT and by tacking the position of the fundamental harmonic component, the period can be evaluated as the reciprocal value of the peak position. The estimated period is then used to define the sliding window's dimension in order to extract subsequences of features from the original sequence. Each subsequence represents the feature vector which is then normalized (if required) and provided as input to the classifier which returns a prediction label for the current vector. In order to construct a more robust human computer interface, a further verification check has been introduced before the final decision is taken. This process has been implemented by using a max-voting scheme on 10 consecutive answers of the classifier obtained testing 10 consecutive subsequences. The final decision is that relative to the class label with the maximum number of votes.

#### 6.3. Results and discussion

The video sequences of only one user (afterward referred as Training User) are considered for building the training sets. Each sequence contains several executions of the same gesture without idle frames between one instance and the other. In this stage, we manually segment the training streams into gesture instances in order to guarantee that each extracted subsequence contains exactly one gesture execution. Then each instance is converted in feature vector by using the skeleton data as described in Section 3. Notice that feature vectors V can have different lengths, because either gesture execution lasts a different number of frames or users execute gestures with different speeds. Part of the obtained feature vectors are used for

The second step of the learning stage is the construction of the gesture model by using the methodologies described in Section 4. A preliminary normalization of feature vectors to the same length is needed in the cases of NN, SVM, HMM, and DNN. As described in Section 4.1, n has been fixed to 60. So each normalized feature vector V has 480 components which have been defined by using the quaternion coefficients of shoulder and elbow joints (see Eqs. (2) and

For each methodology, different models can be learnt depending on the parameters of the methodology. These parameters can be structural such as the number of hidden nodes in the NN architecture or in the autoencoder or the number of hidden states in a HMM; or they can be tuning parameters as in the case of SVM. So, different experiments have been carried out for selecting the optimal parameters inside each methodology. Optimal parameters have to be intended as those which provide a good compromise between over-fitting and prediction error

The prediction stage represents the recognition phase which allows us to compare the performance of each methodology. In this phase the class labels of feature vectors are predicted based on the learnt gesture model. Differently from the training phase that can be defined as an offline phase, the prediction stage can be defined as an on-line stage. In this case the video sequences of six different users (excluded the Training-User) have been properly processed by using an approach that works when live video sequences have to be tested. Differently from the learning stage, where gesture instances were manually selected from the sequences and were directly available for training the classifiers, in the prediction stage the sequences need to be opportunely processed by applying a gesture segmentation approach. This process involves several challenging problems such as the identification of the staring/ending points of a gesture instance, the different length related to the different classes of gestures and finally the

In this work, the sequences are processed by using a sliding window approach, where a window slides forward over the sequence by one frame per time in order to extract subsequences. First, the dimension of the sliding window must be defined. As there are no idle frames among successive gesture executions, an algorithm based on Fast Fourier Transform (FFT) has been applied in order to estimate the duration of each gesture execution [41]. As each sequence contains several repetitions of the same gesture, it is possible to approximate

training and the rest for validation.

128 Motion Tracking and Gesture Recognition

over the validation set.

6.2. Prediction stage

different speeds of execution.

(3)). In the case of DTW this normalization is not required.

In Figures 2–7, the recognition rates obtained by testing the classifiers on a number of sequences performed by six different users are reported. For each user the plotted rates have been obtained by averaging the results over three testing sequences. As can be observed the classifiers behave in a very different way due to the personalized execution of gestures by the users. Furthermore, there are cases where some classifiers fail in assigning the correct class. This is, for example, the case of gestures G<sup>2</sup> and G<sup>4</sup> performed by User 6 (see Figure 7). DTW has 0% detection rate for G2, whereas NN has 0% detection rate for G4. The same happens for gesture G<sup>9</sup> performed by User 2 (see Figure 3) which is rarely recognized by all the classifiers, as well as G<sup>3</sup> performed by User 5 (see Figure 6).

Figure 2. Recognition rates obtained by testing each method on sequences of gestures performed by User 1.

Figure 3. Recognition rates obtained by testing each method on sequences of gestures performed by User 2.

Figure 4. Recognition rates obtained by testing each method on sequences of gestures performed by User 3.

Gesture Recognition by Using Depth Data: Comparison of Different Methodologies http://dx.doi.org/10.5772/68118 131

Figure 5. Recognition rates obtained by testing each method on sequences of gestures performed by User 4.

Figure 3. Recognition rates obtained by testing each method on sequences of gestures performed by User 2.

130 Motion Tracking and Gesture Recognition

Figure 4. Recognition rates obtained by testing each method on sequences of gestures performed by User 3.

Figure 6. Recognition rates obtained by testing each method on sequences of gestures performed by User 5.

In order to analyze the performance of classifiers when the same user is used in the learning and prediction phases, an additional experiment has been carried out. So the Training User has been asked to perform again the gestures. Figure 8 shows the obtained recognition rates. These

Figure 8. Recognition rates obtained by testing each method on sequences of gestures performed by the Training User in a session different from the one used for the learning phase.

results confirm the variability of classifiers performance even if the same user is used for training and testing the classifiers.

The obtained results confirm that it is difficult to determine the superiority of one classifier over the others because of the large number of variables involved that do not guarantee a uniqueness of gesture execution. These are for example: the different relative positions between users and camera, the different orientations of the arm, the different amplitude of the movement, and so on. All these factors can greatly modify the resulting skeletons and joint positions producing large variations in the extracted features.

Some important conclusions can be drawn from the experiments that have been carried out: the solution of using only one user to train the classifiers can be pursued as the recognition rates are quite good even if the gestures are performed in personalized way.

Another point concerns the complexity of the gestures used in our experiments. The results show that the failures are principally due either to the strict similarity between different gestures or to the fact that the gestures which involve a movement perpendicular with respect to the camera (not in the lateral plane) can produce false skeleton postures and consequently features affected by errors.

Moreover, some gestures have parts of the movement in common. Figures 9 and 10 have been pictured to better explain these problems.

Figure 9 shows the results obtained by testing the first 1000 frames of a sequence of gesture G<sup>3</sup> executed by User 1. Each plot in the figure represents the output of each classifier DTW, NN, SVM, HMM, and DNN, respectively. As can be seen in the case of DTW, SVM, and DNN, gesture G<sup>3</sup> is frequently misclassified as gesture G4. Both gestures are executed in a plane parallel to the camera: G<sup>3</sup> involves the rotation of the whole arm, whereas G<sup>4</sup> involves the rotation of the forearm only (as can be seen in Figure 11). Notice that the misclassification happens principally in the starting part of gesture G3, which is very similar to the starting part of G4; therefore, they can be easily mistaken.

Furthermore in Figure 9, it is worth to notice the good generalization ability of NN and HMM. As can be seen in these cases, both classifiers are always able to recognize the gesture even when the sliding windows cover the frames between two successive gesture executions.

An additional observation can be taken considering G<sup>1</sup> and G<sup>8</sup> as an example. In Figure 12, notice that gesture G<sup>1</sup> and gesture G<sup>8</sup> involve the same rotations of the forearm, but performed in different planes with respect to the camera (the lateral one in the case of G<sup>1</sup> and the frontal one in the case of G<sup>8</sup> ). It is evident that a slight different orientation of the user in front of the camera while performing gesture G<sup>1</sup> (risp. G8), could generate skeletons quite similar to those obtained by performing gesture G<sup>8</sup> (risp. G1). Figure 10 shows the results relative to this case. As can be seen gesture G<sup>8</sup> is sometimes misclassified as gesture G<sup>1</sup> by DTW and SVM. A few misclassifications of gesture G<sup>8</sup> as G<sup>6</sup> are also present since G<sup>8</sup> and G<sup>6</sup> have some parts of movement in common.

#### 6.4. Statistical evaluation

In order to analyze the performance of classifiers when the same user is used in the learning and prediction phases, an additional experiment has been carried out. So the Training User has been asked to perform again the gestures. Figure 8 shows the obtained recognition rates. These

Figure 8. Recognition rates obtained by testing each method on sequences of gestures performed by the Training User in

a session different from the one used for the learning phase.

132 Motion Tracking and Gesture Recognition

Figure 7. Recognition rates obtained by testing each method on sequences of gestures performed by User 6.

The analysis of the performance of the different methodologies, presented above, allows us to draw some important conclusions that must be considered in order to build a robust human-robot

Figure 9. Recognition results relative to the first 1000 frames of a test sequence relative to gesture G<sup>3</sup> performed by User 1. The x-axis represents the frame number of the sequence and the y-axis represents the gesture classes ranging from 1 to 10 (the range 0–11 has been used only for displaying purposes). The red line denotes the ground truth label (G<sup>3</sup> in this case), whereas the blue one represents the predicted labels obtained from the classifiers.

Figure 10. Gesture G3 and G4. Both gestures involve a rotation of the arm in a plane parallel to the camera.

interface. The recognition is highly influenced by the following elements: the subjectivity of the users, the complexity of the gestures, and the recognition performance of the applied methodology. In order to give an overall evaluation of the experimental results, a statistical analysis of the conducted tests has to be done. The F-score, also known as F-measure or F1-score, has been considered as global performance metrics [42]. It is defined by the following equation:

$$F = \frac{2TP}{2TP + FP + FN} \tag{5}$$

where TP, FP, and FN are the true positives, false positives, and false negatives, respectively. The best values for the F-score are those close to 1, whereas the worst are those close to 0. This measure captures information mainly on how well a model handles positive examples.

Figure 13 shows the F-score values obtained for each methodology and for each gesture, averaged over all users. As can be seen each methodology behaves differently among the set of available gestures: SVM, for example, has an F-score close to 1 for G<sup>1</sup> and G8, whereas DNN has maximum F-score in the case of G<sup>2</sup> or G4. Figure 13 highlights another important aspect: some gestures are better recognized instead of others. This is the case, for example of G<sup>8</sup> or G<sup>4</sup> for which the F-scores reaches high values whatever methodology is applied. On the contrary, gestures such as G<sup>5</sup> or G<sup>7</sup> are generally badly recognized by each methodology. These considerations are very useful as allows us to select a subset of gestures and for each of them the best methodology in order to build a robust human robot interface. To this aim, a threshold ð¼ 0:85Þ can be fixed for the F-score values and the gestures that have at least one classifier with F-score above this threshold can be selected. By seeing Figure 13, these gestures are: G1, G2, G4, G8, G9, and G10. For each selected gesture the classifier with the maximum F-score can be chosen: so SVM for G1, DNN for G<sup>2</sup> and G4, SVM for G8, DTW for G9, and finally SVM for G10. These set of gestures with the relative best classifiers can be used to build the humanrobot interface.

Figure 9. Recognition results relative to the first 1000 frames of a test sequence relative to gesture G<sup>3</sup> performed by User 1. The x-axis represents the frame number of the sequence and the y-axis represents the gesture classes ranging from 1 to 10 (the range 0–11 has been used only for displaying purposes). The red line denotes the ground truth label (G<sup>3</sup> in this case),

whereas the blue one represents the predicted labels obtained from the classifiers.

134 Motion Tracking and Gesture Recognition

Figure 11. Recognition results relative to the first 1000 frames of a test sequence relative to gesture performed by User 6. The x-axis represents the frame number of the sequence and the y-axis represents the gesture classes ranging from 1 to 10 (the range 0 -11 has been used only for displaying purposes). The red line denotes the ground truth label ( in this case), whereas the blue one represents the predicted labels obtained from the classifiers.

Gesture Recognition by Using Depth Data: Comparison of Different Methodologies http://dx.doi.org/10.5772/68118 137

Figure 12. Gestures G<sup>1</sup> and G8. Gesture G<sup>1</sup> involves a movement in a plane parallel to the camera, whereas gesture G<sup>8</sup> involves a movement in a plane perpendicular to the camera.

Figure 13. F-score values of all methodologies for each gesture averaged over all users.

#### 7. Conclusions

Figure 11. Recognition results relative to the first 1000 frames of a test sequence relative to gesture performed by User 6. The x-axis represents the frame number of the sequence and the y-axis represents the gesture classes ranging from 1 to 10 (the range 0 -11 has been used only for displaying purposes). The red line denotes the ground truth label ( in this case),

whereas the blue one represents the predicted labels obtained from the classifiers.

136 Motion Tracking and Gesture Recognition

In this chapter the problem of Gesture Recognition has been considered. Different methodologies have been tested in order to analyze the behaviors of the differently obtained classifiers. In particular, neural network (NN), support vector machine (SVM), hidden Markov model (HMM), deep neural network (DNN), and dynamic time warping (DTW) approaches have been applied.

The results obtained during the experimental phase prove the great heterogeneity of tested classifiers. In this work, the majority of problems arise in part from the complexity of the gestures and in part from the variations coming from the users. The classifiers perform differently often preserving complementarity and redundancy. These peculiarities are very important for fusion. So, encouraged by these observations, we will concentrate our further investigations on the fusion of different classifiers in order to improve the overall performance and reduce the total error.

#### Author details

Grazia Cicirelli\* and Tiziana D'Orazio

\*Address all correspondence to: cicirelli@ba.issia.cnr.it

Institute of Intelligent Systems for Automation, National Research Council of Italy, Bari, Italy

#### References


on Networking, Sensing and Control (ICNSC), pp. 463-466, IEEE Computer Society, Los Alamitos, 2013

[11] Jacob MG, Wachs JP. Context-based hand gesture reognition for the operating room. Pattern Recognition Letters. 2014;36:196-203

gestures and in part from the variations coming from the users. The classifiers perform differently often preserving complementarity and redundancy. These peculiarities are very important for fusion. So, encouraged by these observations, we will concentrate our further investigations on the fusion of different classifiers in order to improve the overall performance

Institute of Intelligent Systems for Automation, National Research Council of Italy, Bari, Italy

[1] Habib Z, Bux A, Angelov P. Vision Based Human Activity Recognition: A Review. Vol.

[2] Hassan MH, Mishra PK. Hand gesture modeling and recognition using geometric features: A review. Canadian Journal on Image Processing and Computer Vision. 2012;3(1):12-26 [3] Jang F, Zhang S, Wu S, Gao Y, Zhao D. Multi-layered gesture recognition with kinect.

[4] Traver VJ, Latorre-Carmona P, Salvador-Balaguer E, Pla F, Javidi B. Three-dimensional integral imaging for gesture recognition under occlusions. IEEE Signal Processing Letters.

[5] D'Orazio T, Marani R, Renó V, Cicirelli G. Recent trends in gesture recognition: How depth data has improved classical approaches. Image and Vision Computing. 2016;52:56-72 [6] Presti LL, Cascia ML. 3D skeleton-based human action classification: A survey. Pattern

[7] Cheng H, Yang L, Liu Z. Survey on 3d hand gesture recognition. IEEE Transactions on

[8] Aggarwal JK and Xia L. Human activity recognition from 3D data: A review. Pattern

[9] Cruz L, Lucio F, Velho L. Kinect and RGBD images: Challenges and applications. In: Proceedings of 25th SIBGRAPI IEEE Conference on Graphics, Patterns and Image Tuto-

[10] Almetwally I, Mallem M. Real-time tele-operation and tele-walking of humanoid robot Nao using Kinect depth camera. In: Proceedings of 10th IEEE International Conference

Circuits and Systems for Video Technology. 2016;26(9):1659-1673

rials, pp. 36-49, IEEE Computer Society, Los Alamitos, USA, 2012

and reduce the total error.

138 Motion Tracking and Gesture Recognition

Grazia Cicirelli\* and Tiziana D'Orazio

\*Address all correspondence to: cicirelli@ba.issia.cnr.it

513. Cham: Springer; 2017. pp. 341-371

Journal of Machine Learning Research. 2015;16:227-254

Author details

References

2017;24(2):171-175

Recognition. 2016;53:130-147

Recognition Letters. Oct. 2014;48:70-80


Technology (EAIT); November 30- December 01, 2012, IEEE Computer Society, Kolkata, India, pp. 348-351


[37] Møller MF. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks. 1993;6(4):525-533

Technology (EAIT); November 30- December 01, 2012, IEEE Computer Society, Kolkata,

[23] Althloothi S, Mahoor MH, Zhang X, Voyles RM. Human activity recognition using multifeatures and multiple kernel learning. Pattern Recognition. 2014;47:1800-1812

[24] Ibraheem NA, Khan RZ. Vision based gesture recognition using neural networks approaches:

[25] Cicirelli G, Attolico C, Guaragnella C, D'Orazio T. A kinect-based gesture recognition approach for a natural human robot interface. International Journal of Advanced Robotic

[26] Ruan X, Tian C. Dynamic gesture recognition based on improved DTW algorithm. In: Proceedings of IEEE International Conference on Mechatronics and Automation (ICMA); IEEE Computer Society, Los Alamitos, USA, 2–5 August 2015; 2015. pp. 2134-2138 [27] Ding IJ, Chang CW. An eigenspace-based method with a user adaptation scheme for human gesture recognition by using Kinect 3D data. Applied Mathematical Modelling.

[29] Neverova N, Wolf C, Taylor G, Nebout F. Mod drop: Adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. August

[30] Wu D, Pigou L, Kindermans PJ, Le N, Shao L, Dambre J, Odobez JM. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Transac-

[31] D'Orazio T, Attolico C, Cicirelli G, Guaragnella C. A neural network approach for human gesture recognition with a kinect sensor. In International Conference on Pattern Recognition Applications and Methods (ICPRAM); March 2014; Angers, France. SCITEPRESS -

[32] Haykin S. Neural Networks-A Comprehensive Foundation. 2nd ed. Prentice Hall PTR

[34] Rabiner LR. A tutorial on Hidden Markov Models and selected applications in speech

[35] Ghahramani Z. An introduction to Hidden Markov Models and Bayesian Networks. International Journal of Pattern Recognition and Artificial Intelligence. 2001;15(1):9-42

[36] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural net-

tions on Pattern Analysis and Machine Intelligence. August 2016;38(8):1583-1597

A review. International Journal of Human Computer Interaction. 2012;3(1):1-14

[28] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436-444

Sience and Technology Publications, Setubal, Portugal, pp. 741-746

[33] Vapnik V. The Nature of Statistical Learning Theory. Berlin: Spinger; 1995

recognition. Proceedings of the IEEE. February 1989;77(2):257-286

India, pp. 348-351

140 Motion Tracking and Gesture Recognition

Systems. 2015;12(3).

2015;39(19):5769-5777

2016;38(8):1692-1706

Upper Saddle River, NJ, USA, 1998

works. Science. July 2006;313:504-507


### **Chapter 7**

## **Gait Recognition**

Jiande Sun, Yufei Wang and Jing Li

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/68119

#### **Abstract**

Gait recognition has received increasing attention as a remote biometric identification technology, i.e. it can achieve identification at the long distance that few other identifica‐ tion technologies can work. It shows enormous potential to apply in the field of crimi‐ nal investigation, medical treatment, identity recognition, human‐computer interaction and so on. In this chapter, we introduce the state‐of‐the‐art gait recognition techniques, which include 3D‐based and 2D‐based methods, in the first part. And considering the advantages of 3D‐based methods, their related datasets are introduced as well as our gait database with both 2D silhouette images and 3D joints information in the second part. Given our gait dataset, a human walking model and the corresponding static and dynamic feature extraction are presented, which are verified to be view‐invariant, in the third part. And some gait‐based applications are introduced.

**Keywords:** gait recognition, gait dataset, 2D‐based, 3D‐based, view invariant

#### **1. Introduction**

Gait recognition has been paid lots of attention as one of the biometric identification technolo‐ gies. There have been considerable theories supporting that person's walking style is a unique behavioural characteristic, which can be used as a biometric. Differing from other biometric identification technologies such as face recognition, gait recognition is widely known as the most important non‐contactable, non‐invasive biometric identification technology, which is hard to imitate. Since these advantages, gait recognition is expected to be applied in scenar‐ ios, such as criminal investigation and access control. Usually gait recognition includes the following five steps, which are shown in **Figure 1**.

© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Figure 1.** The steps in gait recognition.

#### Step 1: acquisition of gait data

The ways of acquiring the original gait data depend on how to recognize the gait. Usually the gait is acquired by single camera, multiple cameras, professional motion capture system (e.g. VICON) and camera with depth sensor (e.g. Kinect).

#### Step 2: pre‐processing

The methods of pre‐processing are quite different corresponding to the terms of acquiring gait. For instance, in some single camera‐based methods, the pre‐processing is usually the background subtraction, which is to get the body silhouette of walking people. However, in Kinect‐based methods, the pre‐processing is to filter the noise out of the skeleton sequences.

#### Step 3: period extraction

Since human gait is a kind of periodic signal, a gait sequence may include several gait cycles. Gait period extraction is helpful to reduce the data redundancy because all the gait features can be included in one whole gait cycle.

#### Step 4: feature extraction

Various gait features are used in different kinds of gait recognition methods and they influ‐ ence the performance of gait recognition. Gait features can be divided into hand‐crafted and machine‐learned features. The hand‐crafted ones are easy to be generalized to different data‐ sets, while the machine‐learned ones are usually better for the specific dataset.

#### Step 5: classification

Gait classification, i.e. gait recognition, is to use the classifiers based on the gait features. The classifiers range from the traditional one, such as kNN (k‐nearest neighbour), to the modern one, such as deep neural network, which has achieved success in face recognition, handwrit‐ ing recognition, speech recognition, etc.

Generally, gait recognition methods can be divided into 3D‐based and 2D‐based ones. The 2D‐based gait recognition methods depend on the human silhouette captured by one 2D camera, which is the normal situation of the video surveillance. The 2D‐based gait recogni‐ tion methods are dominant in this field of gait recognition and they are usually divided into model‐based and model‐free methods.

The model‐based methods extract the information of the shape and dynamics of the human body from video sequences, establish the suitable skeleton or joint model by integrating the information and classify the individuals based on the variation of the parameters in such a model. Cunado et al. [1] modelled gait as an articulated pendulum and extracted the line via the dynamic Hough transform to represent the thigh in each frame, as shown in **Figure 2a**. Johnson et al. [2] identified the people based on the static body parameters recovered from the walking action across multiple views, which can reduce the influence introduced by varia‐ tion in view angle, as shown in **Figure 2b**. Guo et al. [3] modelled the human body structure from the silhouette by stick figure model, which had 10 sticks articulated with six joints, as shown in **Figure 2c**. Using this model, the human motion can be recorded as a sequence of stick figure parameters, which can be the input of BP neural network. Rohr [4] proposed a volumetric model for the analysis of human motion, using 14 elliptical cylinders to model the human body, as shown in **Figure 2d**. Tanawongsuwan et al. [5] projected the trajectories of lower body joint angles into walking plane and made them time‐normalized by dynamic time warping (DTW). Wang et al. [6] made a fusion between the static and dynamic body fea‐ tures. Specifically, the static body feature is in a form of a compact representation obtained by Procrustes shape analysis. The dynamic body feature is extracted via a model‐based approach, which can track the subject and recover joint‐angle trajectories of lower limbs, as shown in **Figure 2e**. Generally, the model‐based gait recognition methods have better invariant proper‐ ties and are better at handling occlusion, noise, scaling and view‐variation. However, model‐ based methods usually require a high resolution and a heavy computational cost.

On the other hand, model‐free methods generate gait signatures directly based on the sil‐ houettes, which are extracted from the video sequences, without fitting a model. Gait energy image (GEI) [7] is the most popular gait representation, which represents the spatial and tem‐ poral gait information in a grey image. GEI is generated by averaging silhouettes over a com‐ plete gait cycle and represents human motion sequence in a single image while preserving

**Figure 2.** Examples of model‐based methods.

Step 1: acquisition of gait data

**Figure 1.** The steps in gait recognition.

144 Motion Tracking and Gesture Recognition

Step 2: pre‐processing

Step 3: period extraction

Step 4: feature extraction

Step 5: classification

can be included in one whole gait cycle.

ing recognition, speech recognition, etc.

(e.g. VICON) and camera with depth sensor (e.g. Kinect).

The ways of acquiring the original gait data depend on how to recognize the gait. Usually the gait is acquired by single camera, multiple cameras, professional motion capture system

The methods of pre‐processing are quite different corresponding to the terms of acquiring gait. For instance, in some single camera‐based methods, the pre‐processing is usually the background subtraction, which is to get the body silhouette of walking people. However, in Kinect‐based methods, the pre‐processing is to filter the noise out of the skeleton sequences.

Since human gait is a kind of periodic signal, a gait sequence may include several gait cycles. Gait period extraction is helpful to reduce the data redundancy because all the gait features

Various gait features are used in different kinds of gait recognition methods and they influ‐ ence the performance of gait recognition. Gait features can be divided into hand‐crafted and machine‐learned features. The hand‐crafted ones are easy to be generalized to different data‐

Gait classification, i.e. gait recognition, is to use the classifiers based on the gait features. The classifiers range from the traditional one, such as kNN (k‐nearest neighbour), to the modern one, such as deep neural network, which has achieved success in face recognition, handwrit‐

sets, while the machine‐learned ones are usually better for the specific dataset.

the temporal information, as shown in **Figure 3a**. Motion silhouette image (MSI) [8] is like GEI and is a grey image too. The intensity of an MSI is determined by a function of the temporal history of the motion of each pixel, as shown in **Figure 3b**. The intensity of an MSI repre‐ sents motion information during one gait cycle. Because GEI and MSI represent both motion and appearance information, they are sensitive to the changes in various covariate conditions such as carrying and clothing. Shape variation‐based (SVB) frieze pattern is proposed in [9], as shown in **Figure 3c**, to improve their robustness against these changes. SVB frieze pat‐ tern projects the silhouettes horizontally and vertically to represent the gait information, and uses key frame subtraction to reduce the effects of appearance changes on the silhouettes. Although it has been shown that SVB frieze pattern can get better results when there are sig‐ nificant appearance changes, it does not outperform in the case of no changes, and it requires temporal alignment pre‐processing for each gait cycle, which brings more computation load. Gait entropy image (GEnI) [10] is another gait representation, which is based on Shannon entropy. It encodes the randomness of pixel values in the silhouette images over a complete gait cycle, and it is more robust to appearance changes, such as carrying and clothing, as shown in **Figure 3d**. Wang et al. [11] propose the Chrono‐Gait image (CGI), as shown in **Figure 3e**, to compress the silhouette images without losing too much temporal relationship between them. They utilize a colour mapping function to encode each gait contour image in the same gait sequence, and average over a quarter gait cycle to one CGI. It is helpful to pre‐ serve more temporal information of a gait cycle.

The methods mentioned above are all convert the gait sequence into a single image/template. There are other methods to keep temporal information of gait sequences, which have good performance too. Sundaresan et al. [12] propose the gait recognition methods based on hidden Markov models (HMMs) because the gait sequence is composed of a sequence of postures, which is suitable for HMM representation. In this method, the postures are regarded as the states of the HMM and are identical to individuals, which provide a means of discrimination. Wang et al. [13] apply principal component analysis (PCA) to extract statistical spatio‐tempo‐ ral features from the silhouette sequence and recogniz gait in the low‐dimensional eigenspace via supervised pattern classification techniques. Sudeep et al. [14] utilize the correlation of sequence pairs to preserve the spatio‐temporal relationship between the galley and probe sequences, and use it as the baseline for gait recognition.

The advantages of model‐free methods are computational efficiency and simplicity; however, the robustness against the variations of illumination, clothing, scaling and views still needs to be improved. Here, we focus on the view‐invariant gait recognition methods.

**Figure 3.** Examples of model‐free methods.

Up to date, 2D‐based view‐invariant gait recognition methods can be divided into pose‐free and pose‐based ones. The pose‐free methods aim at extracting the gait parameters indepen‐ dent from the view angle of the camera. Johnson et al. [2] present a gait recognition method to identify people based on static body parameters, which are extracted from the walking across multiple views. Abdelkader et al. [15] propose to extract an image template corre‐ sponding to the person's motion blob from each frame. Subsequently, the self‐similarity of the obtained template sequence is computed. On the other hand, the pose‐based method aims at synthesizing the lateral view of the human body from an arbitrary viewpoint. Kale et al. [16] show that if the person is far enough from the camera, it is possible to synthesize a side view from any of the other arbitrary views using a single camera. Goffredo et al. [17] use the human silhouette and human body anthropometric proportions to estimate the pose of lower limbs in the image reference system with low computational cost. After a marker‐less motion estimation, the trends of the obtained angles are corrected by the viewpoint‐independent gait reconstruction algorithm, which can reconstruct the pose of limbs in the sagittal plane for identification. Muramatsu et al. [18] propose an arbitrary view transformation model (AVTM) for cross‐view gait matching. 3D gait volume sequences of training subjects are constructed, and then 2D gait silhouette sequences of the training sub‐ jects are generated by projecting the 3D gait volume sequences onto the same views as the target views. Finally, the AVTM is trained with gait features extracted from the 2D sequences. In the latest work [19], the deep convolutional neural networks (CNNs) is estab‐ lished and trained with a group of labelled multi‐view human walking videos to carry out a gait‐based human identification via similarity learning. The method is evaluated on the CASIA‐B, OU‐ISIR and USF dataset and performed outstanding comparing with the previ‐ ous state‐of‐the‐art methods.

It can be seen from the above‐mentioned methods that the main idea of 2D view‐invariant methods is to find the identical gait parameters that are independent from the camera point of view or can be used to synthesize a lateral view with arbitrary viewpoint.

#### **2. 3D‐based gait recognition and dataset**

#### **2.1. 3D‐based gait recognition**

the temporal information, as shown in **Figure 3a**. Motion silhouette image (MSI) [8] is like GEI and is a grey image too. The intensity of an MSI is determined by a function of the temporal history of the motion of each pixel, as shown in **Figure 3b**. The intensity of an MSI repre‐ sents motion information during one gait cycle. Because GEI and MSI represent both motion and appearance information, they are sensitive to the changes in various covariate conditions such as carrying and clothing. Shape variation‐based (SVB) frieze pattern is proposed in [9], as shown in **Figure 3c**, to improve their robustness against these changes. SVB frieze pat‐ tern projects the silhouettes horizontally and vertically to represent the gait information, and uses key frame subtraction to reduce the effects of appearance changes on the silhouettes. Although it has been shown that SVB frieze pattern can get better results when there are sig‐ nificant appearance changes, it does not outperform in the case of no changes, and it requires temporal alignment pre‐processing for each gait cycle, which brings more computation load. Gait entropy image (GEnI) [10] is another gait representation, which is based on Shannon entropy. It encodes the randomness of pixel values in the silhouette images over a complete gait cycle, and it is more robust to appearance changes, such as carrying and clothing, as shown in **Figure 3d**. Wang et al. [11] propose the Chrono‐Gait image (CGI), as shown in **Figure 3e**, to compress the silhouette images without losing too much temporal relationship between them. They utilize a colour mapping function to encode each gait contour image in the same gait sequence, and average over a quarter gait cycle to one CGI. It is helpful to pre‐

The methods mentioned above are all convert the gait sequence into a single image/template. There are other methods to keep temporal information of gait sequences, which have good performance too. Sundaresan et al. [12] propose the gait recognition methods based on hidden Markov models (HMMs) because the gait sequence is composed of a sequence of postures, which is suitable for HMM representation. In this method, the postures are regarded as the states of the HMM and are identical to individuals, which provide a means of discrimination. Wang et al. [13] apply principal component analysis (PCA) to extract statistical spatio‐tempo‐ ral features from the silhouette sequence and recogniz gait in the low‐dimensional eigenspace via supervised pattern classification techniques. Sudeep et al. [14] utilize the correlation of sequence pairs to preserve the spatio‐temporal relationship between the galley and probe

The advantages of model‐free methods are computational efficiency and simplicity; however, the robustness against the variations of illumination, clothing, scaling and views still needs to

be improved. Here, we focus on the view‐invariant gait recognition methods.

serve more temporal information of a gait cycle.

146 Motion Tracking and Gesture Recognition

sequences, and use it as the baseline for gait recognition.

**Figure 3.** Examples of model‐free methods.

3D‐based methods have the instinctive superiority in the robustness against view variation. Generally, multiple calibrated cameras or cameras with depth sensors are used in 3D‐based methods, which is necessary to extract gait features with 3D information. Zhao et al. [20] propose to build the 3D skeleton model based on 10 joints and 24 degrees of freedom (DOF) captured by multiple cameras, and the 3D information provides robustness to the changes of viewpoints, as shown in **Figure 4a**. Koichiro et al. [21] capture the dense 3D range gait from a projector‐camera system, which can be used to recognize individuals at different poses, as shown in **Figure 4b**. Krzeszowski et al. [22] build a system with four calibrated and synchro‐ nized cameras, estimate the 3D motion using the video sequences and recognize the view‐ variant gaits based on marker‐less 3D motion tracking, as shown in **Figure 4c**.

**Figure 4.** Examples of 3D‐based methods.

3D‐based methods are usually better than 2D‐based view‐invariant approaches in not only the recognition accuracy but also the robustness against view changing. However, these methods have high computational cost due to the calibration of multiple cameras and fusion of mul‐ tiple videos.

The Microsoft Kinect brought about new strategies upon the traditional 3D‐based gait recogni‐ tion methods because it is a consumable RGB‐D (Depth) sensor, which can provide depth infor‐ mation easily. So far, there are two generations of Kinect, which are shown in **Figure 5a** and **b**. Sivapalan et al. [23] extend the concept of the GEI from 2D to 3D with the depth images captured by Kinect. They average the sequences of registered three‐dimensional volumes over a complete gait cycle, which is called gait energy volume (GEV), as shown in **Figure 6**. In Ref. [24], the depth information, which is represented by 3D point clouds, is integrated in a silhouette‐based gait recognition scheme.

Another characteristic of Kinect is that it can precisely estimate and track the 3D position of joints at each frame via machine learning technology. **Figure 7a** and **b** shows the differences of tracking points between the first and second generation of Kinect.

Araujo et al. [25] calculate the length of the body parts derived from joint points as the static anthropometric information, and use it for gait recognition.Milovanovic et al. [26] use the coordinates of all the joints captured by Kinect to generate a RGB image, combine such RGB images into a video to represent the walking sequence, and identify the gait based on the

**Figure 5.** The first and second generation kinects.

**Figure 6.** Gait energy volume (GEV).

spirit of content‐based image retrieval (CBIR) technologies. Preis et al. [27] select 11 skeleton features captured by Kinect as the static feature, use the step length and speed as dynamic feature and integrate both static and dynamic features for recognition. Yang et al. [28] pro‐ pose a novel gait representation called relative distance‐based gait features, which can reserve the periodic characteristic of gait comparing with anthropometric features. Ahmed et al.[29] propose a gait signature using Kinect, which a sequence of joint relative angles (JRAs) is cal‐ culated over a complete gait cycle. They also introduce a new dynamic time warping (DTW)‐ based kernel to complete the dissimilarity measure between the train and test samples with JRA sequences. Kastaniotis et al. [30] propose a framework for gait‐based recognition using Kinect. The captured pose sequences are expressed as angular vectors (Euler angles) of eight selected limbs. Then the angular vectors are mapped in the dissimilarity space resulting into a vector of dissimilarities. Finally, dissimilarity vectors of pose‐sequences are modelled via sparse representation.

#### **2.2. Dataset**

3D‐based methods are usually better than 2D‐based view‐invariant approaches in not only the recognition accuracy but also the robustness against view changing. However, these methods have high computational cost due to the calibration of multiple cameras and fusion of mul‐

The Microsoft Kinect brought about new strategies upon the traditional 3D‐based gait recogni‐ tion methods because it is a consumable RGB‐D (Depth) sensor, which can provide depth infor‐ mation easily. So far, there are two generations of Kinect, which are shown in **Figure 5a** and **b**. Sivapalan et al. [23] extend the concept of the GEI from 2D to 3D with the depth images captured by Kinect. They average the sequences of registered three‐dimensional volumes over a complete gait cycle, which is called gait energy volume (GEV), as shown in **Figure 6**. In Ref. [24], the depth information, which is represented by 3D point clouds, is integrated in a silhouette‐based gait

Another characteristic of Kinect is that it can precisely estimate and track the 3D position of joints at each frame via machine learning technology. **Figure 7a** and **b** shows the differences

Araujo et al. [25] calculate the length of the body parts derived from joint points as the static anthropometric information, and use it for gait recognition.Milovanovic et al. [26] use the coordinates of all the joints captured by Kinect to generate a RGB image, combine such RGB images into a video to represent the walking sequence, and identify the gait based on the

of tracking points between the first and second generation of Kinect.

tiple videos.

**Figure 4.** Examples of 3D‐based methods.

148 Motion Tracking and Gesture Recognition

recognition scheme.

**Figure 5.** The first and second generation kinects.

Gait dataset is important to gait recognition performance improvement and evaluation. There are lots of gait datasets in the current academia and their purposes and characteristics

**Figure 7.** (a) The 20 joints tracked by first generation Kinect and (b) 25 joints tracked by second generation Kinect.

are different from each other. The differences among these datasets are mainly on the num‐ ber of subjects, number of video sequences, covariate factors, viewpoints and environment (indoor or outdoor). Though the number of subjects in gait datasets is much smaller than that in the datasets of other biometrics (e.g. face, fingerprint, etc.), the current dataset can still satisfy the requirement of gait recognition method design and evaluation. Here, we give a brief introduction about several popular gait datasets. **Table 1** summarizes the information of these datasets.

SOTON Large Database [31] is a classical gait database containing 115 subjects, who are observed from side view and oblique view, and walk in several different environment, including indoor, treadmill and outdoor.

SOTON Temporal [32] contains the largest variations about time elapse. The gait sequences are captured monthly during 1 year with controlled and uncontrolled clothing conditions. It is suitable for purely investigating the time elapse effect on the gait recognition without regarding clothing conditions.

USF HumanID [14] is one of the most frequently used gait datasets. It contains 122 subjects, who walk along an ellipsoidal path outdoor, as well as contains a variety of covariates, includ‐ ing view, surface, shoes, bag and time elapse. This database is suitable for investigating the influence of each covariate on the gait recognition performance.

CASIA gait database contains three sets, i.e. A, B and C. Set A, also known as NLPR, is com‐ posed of 20 subjects, and each subject contains 12 sequences, which includes three walking directions, i.e. 0, 45 and 90°. Set B [33] contains large view variations from the front view to the rear view with 18° interval. There are 10 sequences for each subject, which are six normal sequences, two sequences with a long coat and two sequences with a backpack. Set B is suitable for evaluating cross‐view gait recognition. Set C contains the infrared gait data of 153 subjects captured by infrared camera at night under 4 walking conditions, which are walk with normal speed, walk fast, walk slow and walk with carrying backpack.

OU‐ISIR LP [34] contains the largest number of subjects, i.e. over 4000, with a wide age range from 1 year old to 94 years old and with an almost balanced gender ratio, although it does


**Table 1.** List of popular gait datasets.

not contain any covariate. It is suitable for estimating a sort of upper bound accuracy of the gait recognition with high statistical reliability. It is also suitable for evaluating gait‐based age estimation.

are different from each other. The differences among these datasets are mainly on the num‐ ber of subjects, number of video sequences, covariate factors, viewpoints and environment (indoor or outdoor). Though the number of subjects in gait datasets is much smaller than that in the datasets of other biometrics (e.g. face, fingerprint, etc.), the current dataset can still satisfy the requirement of gait recognition method design and evaluation. Here, we give a brief introduction about several popular gait datasets. **Table 1** summarizes the information

SOTON Large Database [31] is a classical gait database containing 115 subjects, who are observed from side view and oblique view, and walk in several different environment, including indoor,

SOTON Temporal [32] contains the largest variations about time elapse. The gait sequences are captured monthly during 1 year with controlled and uncontrolled clothing conditions. It is suitable for purely investigating the time elapse effect on the gait recognition without

USF HumanID [14] is one of the most frequently used gait datasets. It contains 122 subjects, who walk along an ellipsoidal path outdoor, as well as contains a variety of covariates, includ‐ ing view, surface, shoes, bag and time elapse. This database is suitable for investigating the

CASIA gait database contains three sets, i.e. A, B and C. Set A, also known as NLPR, is com‐ posed of 20 subjects, and each subject contains 12 sequences, which includes three walking directions, i.e. 0, 45 and 90°. Set B [33] contains large view variations from the front view to the rear view with 18° interval. There are 10 sequences for each subject, which are six normal sequences, two sequences with a long coat and two sequences with a backpack. Set B is suitable for evaluating cross‐view gait recognition. Set C contains the infrared gait data of 153 subjects captured by infrared camera at night under 4 walking conditions, which are

OU‐ISIR LP [34] contains the largest number of subjects, i.e. over 4000, with a wide age range from 1 year old to 94 years old and with an almost balanced gender ratio, although it does

walk with normal speed, walk fast, walk slow and walk with carrying backpack.

**Name Subject Sequence Covariates Viewpoints In/Outhoor Device** SOTON 115 2128 Yes 2 In/Outdoor Camera(2D) USF HumanID 122 1870 Yes 2 Outdoor Camera(2D) CASIA B 124 1240 Yes 11 Indoor Camera(2D) OU‐ISIR,LP 4007 7842 No 2 Indoor Camera(2D) TUM‐GAID 305 3370 Yes 1 Outdoor Multimedia KinectREID 71 483 yes 3 Indoor Kinect

influence of each covariate on the gait recognition performance.

of these datasets.

treadmill and outdoor.

150 Motion Tracking and Gesture Recognition

regarding clothing conditions.

**Table 1.** List of popular gait datasets.

TUM‐GAID [35] is the first multi‐model gait database, which contains gait audio signals, RGB gait images and depth body images obtained by Kinect.

KinectREID [36] is a Kinect‐based dataset that includes 483 video sequences of 71 individuals under different lighting conditions and 3 view directions (frontal, rear and lateral). Although the original motivation is for person re‐identification, all the video sequences are taken for each subject by using Kinect, which contains all the information Kinect provided and is con‐ venient for other Kinect SDK‐based applications.

According to the overview [37] about the gait dataset, most of datasets are based on 2D videos or based on 3D motion data captured by professional camera, such as VICON. To our best knowledge, there are a few gait datasets containing both 2D silhouette images and 3D joints position information. Such a dataset can make the joint position‐based methods, such as the method in Ref. [17] directly use the joint positions captured by Kinect, which can make use of both advantages of 2D‐ and 3D‐based methods and bring improvement to the recogni‐ tion performance. Meanwhile, the Kinect‐based method such as in Refs. [25–28] will have a uniform platform to compare with each other. Therefore, a novel database based on Kinect is built, whose characteristics are following:


The reason we choose Kinect V2 is that Kinect V2 has the comprehensive improvement over its first generation, such as broader field of view, higher resolution of colour and depth image, and more joints recognition ability. The 3D data and 2D RGB images are recorded, as shown in **Figure 8**. The upper area in **Figure 8** shows the 3D position of 21 joints, which means each joint will have a coordinate like (x, y, z) at each frame. We record all these original 3D position data at each frame during whole walking cycle. The lower area shows the corresponding binary silhouette image sequence after subtracting the sub‐ ject from background.

**Figure 8.** Two kinds of data in our database: 3D position of 21 joints in the upper area and the corresponding binarized silhouette images in the lower area.

The experimental environment is shown in **Figure 9**. Two Kinects are located mutually perpen‐ dicular at the distance of 2.5 m to form the biggest visual field, i.e. walking area. Considering the angle of view, we put two Kinects at 1 m height on the tripod. The red dash lines are the maximum and minimum deep that Kinect can probe. The area enclosed by the black solid lines is the available walking area.

**Figure 9.** The top view of the experimental environment.

Before we record the data of each subject, we collect the basic information, such as name, sex, age, height, wearing (e.g. the high‐heeled shoes, dress for female volunteers) and so on, for potential analysis and data mining. Each subject is asked to walk twice on the predefined directions shown as the arrows ①–⑤ in **Figure 9**, particularly ⑤ means the subjects walk in a straight line on an arbitrary direction. We can treat all the data as recorded by one Kinect since the two Kinects are the same, so that each subject has 20 walking sequences, and the walking duration on each predefined direction is shown in **Figure 10**. The dataset can be accessed at the website, https:/sites.google.com/site/sdugait/, and it can be downloaded with application.

**Figure 10.** Walking directions and the corresponding walking duration.

#### **3. Kinect‐based gait recognition**

The experimental environment is shown in **Figure 9**. Two Kinects are located mutually perpen‐ dicular at the distance of 2.5 m to form the biggest visual field, i.e. walking area. Considering the angle of view, we put two Kinects at 1 m height on the tripod. The red dash lines are the maximum and minimum deep that Kinect can probe. The area enclosed by the black solid

**Figure 8.** Two kinds of data in our database: 3D position of 21 joints in the upper area and the corresponding binarized

lines is the available walking area.

**Figure 9.** The top view of the experimental environment.

silhouette images in the lower area.

152 Motion Tracking and Gesture Recognition

#### **3.1. The Kinect‐based gait recognition**

The gait features extracted from Kinect captured data contain the static and dynamic features. In this part, we will firstly introduce how to extract the static and dynamic features and dem‐ onstrate the properties of these two kinds of features. And then we will show how to extract a walking period from the sequence. Finally, we make a feature fusion of these two kinds of features for gait recognition.

A static feature is a kind of feature that can barely change during the whole walking process, such as height, the length of skeletons and so on. Given the knowledge of anthropometry, the person can be recognized based on static body parameters to some extent. Here, we choose the length of some skeletons as the static features, including the length of legs and arms. Considering the symmetry of human body, the length of limbs on both sides is usually treated to be equal. The static feature is defined as an eight‐dimension vector, i.e. (*d*<sup>1</sup> , *d*2 , *d*3 , *d*4 , *d*5 , *d*6 , *d*7 , *<sup>d</sup>*8), where *di* is a space distance between Joint\_1 and Joint\_2 listed in **Table 2**. Here, the Euclidean distance is chosen to measure the space distance referring to the research experiences in Refs. [37, 38].

We can acquire the 3D coordinate of the joints listed in **Table 2** in each frame and calculate each component of the static feature vector. \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ (*x*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2) <sup>2</sup>

$$d\_{l\_1} = \sqrt{\left(x\_1 - x\_2\right)^2 + \left(y\_1 - y\_2\right)^2 + \left(z\_1 - z\_2\right)^2} \tag{1}$$

where (*x*<sup>1</sup> , *y*1 , *<sup>z</sup>*1) and (*x*<sup>2</sup> , *y*2 , *<sup>z</sup>*2) represent 3D positions of the corresponding joints listed as Joint\_1 and Joint\_2, respectively.

When we evaluate the estimation on the position of joints obtained by Kinect, we find that the accuracy will change along with the depth range. Given the empirical results, we discover that more stable data can be acquired when the depth is between 1.8 and 3.0 m. Hence, we propose a strategy to automatically choose the frames in that range. We choose the depth information of HEAD joint to represent the depth of whole body, because it can be detected stably and keep monotonicity in depth direction during walking. Then we set two depth thresholds, i.e. the distance in the Z‐direction, as the upper and lower boundaries, respectively. The frames between the two boundaries are regarded as the reliable frames.

$$\left\{ f\_{\ast} \right\} = \left\{ H\_{\sharp} \Big| \, H\_{\sharp \times \mathbf{1}.8} \cap H\_{\sharp, \approx \mathbf{3}.0} \right\} \tag{2}$$

where *Hf* denotes the frames of the HEAD, *f a* denotes the reliable frames and *Hf*, *<sup>z</sup>* represents the frame(s) that obtained when the coordinate of HEAD joint is *z*. We reserve the 3D coordi‐ nates of all the joints during the period when the reliable frames can be obtained. Finally, we calculate the length of the skeleton we need at each reliable frame, and take their average to calculate the components of the static feature vector.

The subjects are required to walk along the same path for seven times, which can make sub‐ jects walk more naturally later. For each subject, Kinect is turned for 5° started from −15° to


**Table 2.** Components of the static feature vector.

+15°, and the static feature vector on each direction is recorded. These directions are denoted by *n*15, *n*10, *n*5, 0, *p*5, *p*10 and *p*15, where '0' denotes the front direction, and '*n*' and '*p*' denote anticlockwise and clockwise, respectively. Totally 10 volunteers are randomly selected to repeat this experiment, and all the results prove that the static feature we choose is robust to the view variation. We show an example in **Figure 13a**, in which each component of these static vectors on the seven directions and the average values of these vectors are plotted.

The dynamic feature is a kind of feature that any change along with time during walking, such as speed, stride, variation of barycentre, etc. Given many researches [5, 39, 40], the angles of swing limbs during walking are remarkable dynamic gait features. For this reason, four groups of swing angles of upper limbs, i.e. arm and forearm, and lower limbs, i.e. thigh and crus, are defined as shown in **Figure 11**, and denoted as *a*1, …,*a*8. Here, *a*2 is taken as the exam‐ ple for illustration. The coordinate at KNEE\_RIGHT is denoted as (*x*, *y*, *z*), and coordinate at ANKLE\_RIGHT is denoted as (*x*′ , *y*′ , *z* ′ ), so *a*2 can be calculated as

$$\begin{aligned} \text{ANKLE\\_RIGHT is denoted as } (x, y, z), \text{ so } \mathbf{z} \text{ can be calculated as} \\ \tan \angle \mathbf{a} \mathbf{2} = \left( \frac{x - x'}{y - y'} \right) \mathbf{a} \mathbf{2} = \tan^{-1} \left( \frac{x - x'}{y - y'} \right) \end{aligned} \tag{3}$$

Each dynamic angle can be regarded as an independent dynamic feature for recognition. Given the research results in Ref. [41] and our comparison experiments on these dynamic angles, angle *a*2 on the right side or *a*4 on the left side is selected as the dynamic angle, accord‐ ing to the side near to the Kinect.

The value of *a*2 and *a*4 at each frame can be calculated, and the whole walking process can be described, as shown in **Figure 12**. We carried out the verification experiments similar to what for the static feature to prove its robustness against the invariant of views, the result shown in **Figure 13b** indicates that the proposed dynamic feature is also robust to the view variation.

Gait period extraction is an important step in gait analysis, because gait is a periodical feature and majority features could be captured within one period. Silhouette‐based methods usu‐ ally analyse the variation of silhouette width alone with time to obtain the period informa‐ tion. Some methods apply the signal processing to analyse the dynamic feature for period extraction, such as peak detection and Fourier transform. Different from them, we propose to extract periodicity by combining the data of left limb and right limb together, which can be shown in **Figure 12**. *a*2 and *a*4 sequences represent the right and left signals, respectively.

It can be concluded that the crossing points between left and right signals can segment the gait period appropriately. We use the crossing point between the left and right signals to extract

**Figure 11.** Side view of the walking model.

where *di*

[37, 38].

where (*x*<sup>1</sup>

where *Hf*

*d*

*d*

*d*

*d*

*d*

*d*

*d*

*d*

, *y*1

and Joint\_2, respectively.

154 Motion Tracking and Gesture Recognition

each component of the static feature vector.

, *y*2

between the two boundaries are regarded as the reliable frames.

denotes the frames of the HEAD, *f*

calculate the components of the static feature vector.

*<sup>a</sup>*} = { *Hf*

**Component Joint\_1 Joint\_2**

<sup>1</sup> HIP\_RIGHT KNEE\_RIGHT

<sup>2</sup> KNEE\_RIGHT ANKLE\_RIGHT

<sup>3</sup> SHOULDER\_RIGHT ELBOW\_RIGHT

<sup>4</sup> ELBOW\_RIGHT WRIST\_RIGHT

<sup>5</sup> SPINE\_SHOULDER SPINE\_BASE

<sup>7</sup> SPINE\_SHOULDER NECK

<sup>8</sup> NECK HEAD

<sup>6</sup> SHOULDER\_RIGHT SHOULDER\_LEFT

*a*

the frame(s) that obtained when the coordinate of HEAD joint is *z*. We reserve the 3D coordi‐ nates of all the joints during the period when the reliable frames can be obtained. Finally, we calculate the length of the skeleton we need at each reliable frame, and take their average to

The subjects are required to walk along the same path for seven times, which can make sub‐ jects walk more naturally later. For each subject, Kinect is turned for 5° started from −15° to

*di* = √

, *<sup>z</sup>*1) and (*x*<sup>2</sup>

{*f*

**Table 2.** Components of the static feature vector.

is a space distance between Joint\_1 and Joint\_2 listed in **Table 2**. Here, the Euclidean

distance is chosen to measure the space distance referring to the research experiences in Refs.

We can acquire the 3D coordinate of the joints listed in **Table 2** in each frame and calculate

When we evaluate the estimation on the position of joints obtained by Kinect, we find that the accuracy will change along with the depth range. Given the empirical results, we discover that more stable data can be acquired when the depth is between 1.8 and 3.0 m. Hence, we propose a strategy to automatically choose the frames in that range. We choose the depth information of HEAD joint to represent the depth of whole body, because it can be detected stably and keep monotonicity in depth direction during walking. Then we set two depth thresholds, i.e. the distance in the Z‐direction, as the upper and lower boundaries, respectively. The frames

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ (*x*<sup>1</sup> <sup>−</sup> *<sup>x</sup>*2) <sup>2</sup> + (*y*<sup>1</sup> − *y*2)

2

+ (*z*<sup>1</sup> − *z*2)

, *<sup>z</sup>*2) represent 3D positions of the corresponding joints listed as Joint\_1

2

<sup>|</sup> *Hf*,*z*>1.8 <sup>∩</sup> *Hf*, *<sup>z</sup>*<3.0} (2)

denotes the reliable frames and *Hf*, *<sup>z</sup>*

(1)

represents

**Figure 12.** Period extracted based on dynamic features.

**Figure 13.** (a) Static feature and (b) dynamic feature of one subject on seven directions.

the gait period. After cutting off the noisy part at the beginning of the signal, we make the subtraction on the left and right signals, obtain the crossing point as the zero point and extract the period between two interval zero points. The black dash lines show the detected period.

The static and dynamic features have their own advantages and disadvantages, respectively. These two kinds of features are fused in the score‐level. Two different kinds of matching scores are normalized onto the closed interval [0,1] by the linear normalization.

$$\text{S scores are normalized onto the closed interval } [0, 1] \text{ by the linear normalization.}$$

$$\hat{\mathbf{s}} = \frac{\mathbf{s} - \min(\mathbf{S})}{\max(\mathbf{S}) - \min(\mathbf{S})} \tag{4}$$

Where **S** is the matrix before normalization, whose component is *s*, here represent the score, **S** ^ is the normalized matrix, whose component is *s* ^ . The two kinds of features are weighted fused as

$$\mathbf{F} = \sum\_{i=1}^{\mathbb{R}} \omega\_i \hat{s}\_{\rho} \quad \omega\_i = \frac{\mathbb{C}\_i}{\sum\_{j=1}^{\mathbb{R}} \mathbb{C}\_i} \tag{5}$$

where *F* is the score after fusion, *R* is the number of features used for fusion, *ω<sup>i</sup>* is the weight of *i*th classifier, *s* ^ *i* is the score of *i*th classifier, here which is our distance. *<sup>C</sup> <sup>i</sup>* is the CCR (correct classification rate) of *i*th feature used to recognize separately, so the weight can be set accord‐ ing to the level of CCR.

#### **3.2. Comparisons**

The cross‐view recognition abilities of the static feature, dynamic feature and their fusion are analysed. Four sequences on 180° are used as the training data since both body sides of the subjects can be recorded. The sequences on the other directions are used as the testing data. Because the sequences acquired on the nearer side to Kinect have more accuracy, the data on the nearer body side is selected automatically for the calculation at each direction.

The static feature is extracted from the right body side on 0, 225 and 270°, and the left body side on 90 and 135°. Due to the symmetry, the skeleton lengths on the two sides of the body are regarded to be equal. The static feature is calculated as Eq. (1), and NN classifier is used for recognition. The results are shown in the first row of **Table 3**.

The dynamic feature, *a*2, is calculated from the right side of the limb on 0, 225 and 270°, and the dynamic feature, *a*4, is calculated on 90 and 135°, from the left side of the limb. As we can extract both *a*2 and *a*4 on the direction of 180°, either of them can be used as the dynamic fea‐ ture on the training set. The results are shown in the second row of **Table 3**.

The static feature and dynamic feature are fused in the score‐level as we discussed before, and the results are tested after feature fusion in situation under view variation. Given the CCR of dynamic feature and static feature obtained from different directions, we redistribute the weight, get the final score for different subjects and use the NN classifier to get the final recognition results as shown in the third row of **Table 3**. The comparison in **Table 3** shows that the feature fusion can improve the recognition rate on each direction.

Preis et al. [27] proposed a Kinect‐based gait recognition method, in which 11 lengths of limbs are extracted as the static feature, and step length and speed are taken as the dynamic feature. Their method was tested on their own dataset including nine persons and the highest CCR can reach to 91%. The gait feature they proposed is also based on 3D position joint, so it is possible to rebuild their method on our database. In this chapter, we rebuilt their method and test on our database with 52 persons and make a comparison with our proposed method. As their dataset only include frontal walking sequences, we compare two methods in our data‐ base only on 180° (frontal) directions. We randomly choose three sequences on 180° directions as training data and the rest are treated as testing data. The CCR results of both methods are shown in **Table 4**. Our proposed method has about 10% accuracy improvement.

the gait period. After cutting off the noisy part at the beginning of the signal, we make the subtraction on the left and right signals, obtain the crossing point as the zero point and extract the period between two interval zero points. The black dash lines show the detected period. The static and dynamic features have their own advantages and disadvantages, respectively. These two kinds of features are fused in the score‐level. Two different kinds of matching

Where **S** is the matrix before normalization, whose component is *s*, here represent the score, **S**

is the score of *i*th classifier, here which is our distance. *<sup>C</sup> <sup>i</sup>*

classification rate) of *i*th feature used to recognize separately, so the weight can be set accord‐

The cross‐view recognition abilities of the static feature, dynamic feature and their fusion are analysed. Four sequences on 180° are used as the training data since both body sides of the

, *<sup>ω</sup><sup>i</sup>* <sup>=</sup> *<sup>C</sup>* \_\_\_\_\_\_ *<sup>i</sup>* ∑*<sup>j</sup>*=<sup>1</sup> *<sup>R</sup> Ci*

^

^ <sup>=</sup> <sup>s</sup> <sup>−</sup> min(**<sup>S</sup>** ) \_\_\_\_\_\_\_\_\_\_\_\_ max(**S**) <sup>−</sup> min(**<sup>S</sup>** ) (4)

. The two kinds of features are weighted fused as

^ is

(5)

is the weight

is the CCR (correct

scores are normalized onto the closed interval [0,1] by the linear normalization.

**Figure 13.** (a) Static feature and (b) dynamic feature of one subject on seven directions.

*i*=1 R *ω<sup>i</sup> s* ^ *i*

where *F* is the score after fusion, *R* is the number of features used for fusion, *ω<sup>i</sup>*

**s**

**Figure 12.** Period extracted based on dynamic features.

156 Motion Tracking and Gesture Recognition

the normalized matrix, whose component is *s*

F = ∑

^ *i*

of *i*th classifier, *s*

**3.2. Comparisons**

ing to the level of CCR.

The proposed method is evaluated another Kinect‐based gait dataset, i.e. KinectREID dataset in Ref. [36]. Four recognition rate curves are shown in **Figure 14**, which are front\_VS\_front, rear\_VS\_rear, front\_VS\_rear and front\_VS\_lateral, because there are only three directions in


**Table 3.** CCR (%) results of the static feature, dynamic feature, and feature fusion on each walking direction.


**Table 4.** Comparison on CCR between the proposed methodand the method in [27].

KinectREID dataset, i.e. front, rear and lateral. It can be seen from **Figure 14** that the cross‐ view recognition rate of the propose method is slightly worse than that on the same direc‐ tions, which demonstrates that the robustness of the proposed method against view variation, though the recognition rate, decreases with the increasing of the amount of test subjects.

Given the experimental results we have discussed above, we can say that the static relation and dynamic moving relation among joints are very important features that can represent the characteristic of gait. In many 2D‐based methods, many researchers also tried to get the relation among joints, but the positions of joints have to be calculated from the 2D video with all kinds of strategies in advantage. Goffredo et al. proposed a view‐invariant gait rec‐ ognition method in Ref. [17]. They only make use of 2D videos obtained by one single cam‐ era. After extracting the walking silhouette from the background, they estimate the position of joints according to the geometrical characteristics of the silhouette and calculate the angle between the shins and the vertical axis and the angle between thigh and the vertical axis as the dynamic feature, and finally make a projection transformation to project these features into the sagittal plane using their viewpoint rectification algorithm. Actually, Goffredo's

**Figure 14.** Gait recognition performance on KinectREID dataset.

method has a lot of similar gait features comparing with our method in logically. As we mentioned before, our database not only have the 3D position data but also the 2D silhou‐ ette images at each frame. Take advantage of our database, we can rebuild their method using the 2D silhouette image sequences; meanwhile, we use the 3D joint position data of the same person. We compare this method with our method with the varying views on three directions. The comparison results in **Table 5** show that our proposed method has 14–19% accuracy improvement.

#### **3.3. Applications**

KinectREID dataset, i.e. front, rear and lateral. It can be seen from **Figure 14** that the cross‐ view recognition rate of the propose method is slightly worse than that on the same direc‐ tions, which demonstrates that the robustness of the proposed method against view variation, though the recognition rate, decreases with the increasing of the amount of test subjects.

**CCR**

Method in [27] 82.7% Our method 92.3%

158 Motion Tracking and Gesture Recognition

**Table 4.** Comparison on CCR between the proposed methodand the method in [27].

Given the experimental results we have discussed above, we can say that the static relation and dynamic moving relation among joints are very important features that can represent the characteristic of gait. In many 2D‐based methods, many researchers also tried to get the relation among joints, but the positions of joints have to be calculated from the 2D video with all kinds of strategies in advantage. Goffredo et al. proposed a view‐invariant gait rec‐ ognition method in Ref. [17]. They only make use of 2D videos obtained by one single cam‐ era. After extracting the walking silhouette from the background, they estimate the position of joints according to the geometrical characteristics of the silhouette and calculate the angle between the shins and the vertical axis and the angle between thigh and the vertical axis as the dynamic feature, and finally make a projection transformation to project these features into the sagittal plane using their viewpoint rectification algorithm. Actually, Goffredo's

**Figure 14.** Gait recognition performance on KinectREID dataset.

Gait research is still at an exploring stage rather than a commercial application stage. However, we have confidence to say that the gait analysis is promising given its recent development. The unique characteristics of gait, such as unobtrusive, non‐contactable and non‐invasive, have a powerful potential to apply in the scenarios including criminal investigation, access security and surveillance. For example, face recognition will become unreliable if there is a larger distance between the subject and camera. Fingerprint and iris recognition have proved to be more robust, but they can only be captured by some contact or nearly contact equipment.

For instance, gait biometrics has already been used as the evidence for forensics [42]. In 2004, a perpetrator robbed a bank in Denmark. The Institute of Forensic Medicine in Copenhagen (IFMC) was asked to confirm the perpetrator via gait analysis, as they thought the perpetrator had a unique gait. The IFMC instructed the police to establish a covert recording of the suspect from the same angles as the surveillance recordings for comparison. The gait analysis revealed several characteristic matches between the perpetrator and the suspect, as shown in **Figure 15**. In **Figure 15**, both the perpetrator on the left and the suspect on the right showed inverted left ankle, i.e. angle b, during left leg's stance phase and markedly outward rotated feet. The suspect was convicted of robbery and the court found that gait analysis is a very valuable tool.

Another similar example is in the intelligent airport, where the Kinect‐based gait recogni‐ tion is used during the security check. Pratik et al. [43] established a frontal gait recognition system using RGB‐D camera (Kinect) considering a typical application scenario of airport security check point, as shown in **Figure 16a**. In their further work [44], they addressed the occlusion problem in frontal gait recognition via the combination of two Kinects, which is demonstrated in **Figure 16b**.

In addition, gait analysis plays an important role in medical diagnosis and rehabilitation. For example, assessment of gait abnormalities in individuals affected by Parkinson's disease


**Table 5.** CCR (%) result comparing on three directions.

**Figure 15.** Bank robbery identification.

**Figure 16.** Gait‐based airport security check system with (a) single and (b) double Kinects.

(PD) is essential to determine the disease progression, the effectiveness of pharmacologic and rehabilitative treatments. Corona et al. [45] investigate the spatio‐temporal and kinematics parameters of gait between lots of elderly individuals affected by PD and normal people, which can help clinicians to detect and diagnose the Parkinson's disease.

#### **Author details**

Jiande Sun1,2\*, Yufei Wang1 and Jing Li3

\*Address all correspondence to: jiandesun@hotmail.com

1 School of Information Science and Engineering, Shaodong Normal University, Jinan, Shandong Province, China

2 Institute of Data Science and Technology, Shandong Normal University, Jinan, Shandong Province, China

3 School of Mechanical and Electrical Engineering, Shandong Management University, Jinan, Shandong Province, China

#### **References**

(PD) is essential to determine the disease progression, the effectiveness of pharmacologic and rehabilitative treatments. Corona et al. [45] investigate the spatio‐temporal and kinematics parameters of gait between lots of elderly individuals affected by PD and normal people,

1 School of Information Science and Engineering, Shaodong Normal University, Jinan,

2 Institute of Data Science and Technology, Shandong Normal University, Jinan, Shandong

3 School of Mechanical and Electrical Engineering, Shandong Management University, Jinan,

which can help clinicians to detect and diagnose the Parkinson's disease.

**Figure 16.** Gait‐based airport security check system with (a) single and (b) double Kinects.

and Jing Li3

\*Address all correspondence to: jiandesun@hotmail.com

**Author details**

Province, China

Jiande Sun1,2\*, Yufei Wang1

**Figure 15.** Bank robbery identification.

160 Motion Tracking and Gesture Recognition

Shandong Province, China

Shandong Province, China


[24] Nambiar AM, Correia PL, Soares LD. Frontal gait recognition combining 2D and 3D data. In: Proceedings of the on Multimedia and Security; 6‐7 September 2012; Coventry, United Kingdom. New York: ACM; 2012. pp. 145‐150

[13] Wang L, Tan T, Ning H, Hu W. Silhouette analysis‐based gait recognition for human identification. IEEE Transactions on Pattern Analysis & Machine Intelligence.

[14] Sudeep S, Phillips PJ, Liu Z, Isidro RV, Patrick G, Bowyer KW. The human ID gait chal‐ lenge problem: Data sets, performance, and analysis. IEEE Transactions on Pattern

[15] Abdelkader CB, Davis L, Cutler R. Motion‐based recognition of people in Eigen gait space. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2002); 20‐21 May 2002; Washington, DC, USA. New York: IEEE; 2002.

[16] Kale A, Chowdhury AKR, Chellappa R. Towards a view invariant gait recognition algo‐ rithm. In: IEEE Conference on Advanced Video & Signal Based Surveillance (AVSS 2003);

[17] Goffredo M, Bouchrika I, Carter JN, Nixon MS. Self‐calibrating view‐invariant gait biometrics. IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A

Publication of the IEEE Systems Man & Cybernetics Society. 2010;**40**(4):997‐1008

[18] Muramatsu D, Shiraishi A, Makihara Y, Uddin MZ, Yagi Y. Gait‐based person recogni‐ tion using arbitrary view transformation model. IEEE Transactions on Image Processing.

[19] Wu Z, Huang Y, Wang L, Wang X, Tan T. A comprehensive study on cross‐view gait based human identification with deep CNNs. IEEE Transactions on Pattern Analysis &

[20] Zhao G, Liu G, Li H, Pietikainen M. 3D gait recognition using multiple cameras. In: International Conference on Automatic Face and Gesture Recognition (FG 2006);

[21] Yamauchi K, Bhanu B, Saito H. Recognition of walking humans in 3D: Initial results. In: IEEE Computer Society Conference on Computer Vision & Pattern Recognition Workshops (CVPR 2009); 20‐25 June 2009; Miami, Florida, USA. New York: IEEE; 2009.

[22] Krzeszowski T, Michalczuk A, Kwolek B, Switonski A, Josinski H. Gait recognition based on marker‐less 3D motion capture. In: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS 2013); 27‐30 August 2013; Krakow, Poland.

[23] Sivapalan S, Chen D, Denman S, Sridharan S, Fookes C. Gait energy volumes and frontal gait recognition using depth images. In: International Joint Conference on Biometrics (IJCB 2011); 11‐13 October 2011; Washington, DC, USA. New York: IEEE;

10‐12 April 2006; Southampton, UK. New York: IEEE; 2006. pp. 529‐534

21‐22 July 2003; Miami, FL, USA. New York: IEEE; 2003. pp. 143‐150

Analysis & Machine Intelligence. 2005;**27**(2):162‐177

2004;**25**(12):1505‐1518

162 Motion Tracking and Gesture Recognition

pp. 267‐272

2014;**24**(1):140‐154

pp. 45‐52

2011. pp. 1‐6

Machine Intelligence. 2017;**39**(2):209‐226

New York: IEEE; 2013. pp. 232‐237


[38] Andersson VO, Araujo RM. Person identification using anthropometric and gait data from Kinect sensor. In: AAAI Conference on Artificial Intelligence; 25‐30 January 2015;

[39] Dan IS, Toth‐Tascau M. Influence of treadmill velocity on joint angles of lower limbs during human gait. In: E‐Health and Bioengineering Conference (EHB 2011); 24‐26

[40] Tech SXMYM, Larsen PK, Alkjær T, Simonsen EB, Lynnerup N. Variability and similar‐ ity of gait as evaluated by joint angles: Implications for forensic gait analysis. Journal of

[41] Pfster A, West AM, Bronner S, Noah JA. Comparative abilities of Microsoft Kinect and Vicon 3D motion capture for gait analysis. Journal of Medical Engineering & Technology.

[42] Bouchrika I, Goffredo M, Carter J, Nixon MS. On using gait in forensic biometrics.

[43] Chattopadhyay P, Sural S, Mukherjee J. Frontal gait recognition from incomplete sequences using RGB‐D camera. IEEE Transactions on Information Forensics & Security.

[44] Chattopadhyay P, Sural S, Mukherjee J. Frontal gait recognition from occluded scenes.

[45] Corona F, Pau M, Guicciardi M, Murgia M, Pili R, Casula C. Quantitative assessment of gait in elderly people affected by Parkinson's disease. In: IEEE International Symposium on Medical Measurements and Applications (MeMeA 2016); 15‐18 May 2016; Benevento,

Austin, Texas, USA. Menlo Park, CA: AAAI Press; 2015. pp. 425‐431

November 2011; Iasi, Romania. IEEE; 2011. pp. 1‐4

Journal of Forensic Sciences. 2011;**56**(4):882‐889

Pattern Recognition Letters. 2015;**63**:9‐15

Forensic Sciences. 2014;**59**(2):494‐504

2014;**38**(5):274‐280

164 Motion Tracking and Gesture Recognition

2014;**9**(11):1843‐1856

Italy. IEEE; 2016. pp. 1‐6

## *Edited by Carlos M. Travieso-Gonzalez*

Nowadays, the technological advances allow developing many applications on different fields. In this book Motion Tracking and Gesture Recognition, two important fields are shown. Motion tracking is observed by a hand-tracking system for surgical training, an approach based on detection of dangerous situation by the prediction of moving objects, an approach based on human motion detection results and preliminary environmental information to build a long-term context model to describe and predict human activities, and a review about multispeaker tracking on different modalities. On the other hand, gesture recognition is shown by a gait recognition approach using Kinect sensor, a study of different methodologies for studying gesture recognition on depth images, and a review about human action recognition and the details about a particular technique based on a sensor of visible range and with depth information.

Motion Tracking and Gesture Recognition

Motion Tracking

and Gesture Recognition

*Edited by Carlos M. Travieso-Gonzalez*

Photo by justenl / iStock