**2.2 Laptev et al. detector**

The Laptev et al. theory (Laptev & Lindeberg, 2004) is based on the Harris operator (Harris & Stephens, 1988) that had shown good performances interest points detection in static images. The operator extension over the spatiotemporal domain makes the spatio-temporal interest points detection possible. This extension consists of a search of points that maximize the local variation of image values simultaneously over the spatial dimensions and the temporal dimension. According to Laptev et al., a video sequence can be represented as a 156 Advances in Object Recognition Systems

We propose in this chapter a motion analysis and classification approach to learn and recognize human actions in video, taking advantage of the robustness of STIPs and the unsupervised learning approaches. Experimental results are validated on KTH human action database (Schuldt et al., 2004), and ATSI Human Action Database (see Figure 1). Results are compared to recent works on the human motion analysis and recognition.

Interest Points in a bitmap image are defined as pixels with maximum variations of the intensity in the local neighbourhood. These pixels represent corners, intersections, isolated points and specific points on image texture. This definition can describe the Spatio-temporal Interest Points (STIPs) when considering a video sequence instead of the image. Consequently, we deduce that STIPs can be defined as pixels with significant changes in space and time. It can represent irregular movements of the human body such as bending elbows or knees, moving limbs. Whereas, uniform movement such as moving a hard object does not generate any STIP. Video sequences are represented as a 3D function over two spatial dimensions (x, y) and one temporal dimension t. Many detectors can be used such as: Laptev et al. detector (Laptev & Lindeberg, 2004); Dollàr et al. detector (Dollár et al., 2005); FAST-3D detector (Koelstra et al., 2009); and Oikonomopoulos et al. detector

The Laptev et al. theory (Laptev & Lindeberg, 2004) is based on the Harris operator (Harris & Stephens, 1988) that had shown good performances interest points detection in static images. The operator extension over the spatiotemporal domain makes the spatio-temporal interest points detection possible. This extension consists of a search of points that maximize the local variation of image values simultaneously over the spatial dimensions and the temporal dimension. According to Laptev et al., a video sequence can be represented as a

Fig. 1. Samples from the KTH human action database

**2. Spatio-temporal interest points** 

(Oikonomopoulos et al., 2006).

**2.2 Laptev et al. detector** 

**2.1 Presentation** 

function <sup>2</sup> f:R R R over two spatial dimensions (x, y) and one temporal dimension t. The Local space time features are defined as 3D blocks of the sequence containing variations in space and time.

The scale-space representation 2 2 L:R R R R is generated by the convolution of f with a separable Gaussian kernel g (p ; Σ) (1). Where p is spatiotemporal position vector <sup>T</sup> p= x,y,t , the parameters 2 and τ2 of the covariance matrix correspond to the spatial and temporal scale parameters respectively and define spatiotemporal extension of the neighbourhoods.

$$\log(\mathbf{p};\Sigma) = \frac{1}{\sqrt{\left(2\pi\right)^3 \det(\Sigma)}} \mathbf{e} \xrightarrow{\begin{pmatrix} \mathbf{p}^T \Sigma^{-1} \mathbf{p} \\ \mathbf{2} \end{pmatrix}} \quad \text{and} \quad \Sigma = \begin{pmatrix} \sigma^2 & 0 & 0 \\ 0 & \sigma^2 & 0 \\ 0 & 0 & \tau^2 \end{pmatrix} \tag{1}$$

A spatiotemporal second-moment matrix (2) is defined in terms of spatiotemporal gradients and weighted with a Gaussian window function.

$$\begin{aligned} \mu(\cdot; \boldsymbol{\Sigma}) &= \operatorname{g}(\cdot; \boldsymbol{\Sigma}) \* \left( \nabla \mathcal{L}(\cdot; \boldsymbol{\Sigma}) (\nabla \mathcal{L}(\cdot; \boldsymbol{\Sigma}))^{\mathrm{T}} \right) \\ &= \operatorname{g}(\cdot; \boldsymbol{\Sigma}) \* \begin{pmatrix} \mathbf{L}\_{\mathbf{x}}^{2} & \mathbf{L}\_{\mathbf{x}} \mathbf{L}\_{\mathbf{y}} & \mathbf{L}\_{\mathbf{x}} \mathbf{L}\_{\mathbf{t}} \\ \mathbf{L}\_{\mathbf{x}} \mathbf{L}\_{\mathbf{y}} & \mathbf{L}\_{\mathbf{y}}^{2} & \mathbf{L}\_{\mathbf{y}} \mathbf{L}\_{\mathbf{t}} \\ \mathbf{L}\_{\mathbf{x}} \mathbf{L}\_{\mathbf{t}} & \mathbf{L}\_{\mathbf{y}} \mathbf{L}\_{\mathbf{t}} & \mathbf{L}\_{\mathbf{t}}^{2} \end{pmatrix} \end{aligned} \tag{2}$$

The spatiotemporal second-moment matrix μ, considered also as a structure tensor, is interpreted in terms of eigen values. This fact makes the distinguishing of image structures possible with variations over one, two and three dimensions. Three-dimensional variation of f corresponds to image points with non-constant motion. Such points can be detected by maximizing the three eigen values λ1, λ2, λ3 of μ over space and time.

STIP detection is realized by the extension of the Harris operator H into the spatiotemporal domain (3). Detection is based on points with high eigen values.

$$\mathbf{H} = \det(\boldsymbol{\mu}) - \mathbf{k} \cdot \text{trace}^{\mathfrak{G}}(\boldsymbol{\mu}) = \boldsymbol{\lambda}\_1 \cdot \boldsymbol{\lambda}\_2 \cdot \boldsymbol{\lambda}\_3 - \mathbf{k} \cdot \left(\boldsymbol{\lambda}\_1 + \boldsymbol{\lambda}\_2 + \boldsymbol{\lambda}\_3\right)^{\mathfrak{G}} \tag{3}$$

Local maxima of H correspond to points with high values λ1, λ2, λ3 (λ1 ≤ λ2 ≤ λ3). H can be written as equation (4), where α = λ2/λ1 and β = λ3/λ1.

$$\mathbf{H} = \lambda \frac{3}{1} \left( \mathbf{a}\beta - \mathbf{k} \left( \mathbf{1} + \mathbf{a} + \beta \right)^{\mathfrak{J}} \right) \tag{4}$$

From the requirement H ≥ 0, we get the condition represented by (5).

$$\mathbf{k} \le \alpha \beta \sqrt{\left(1 + \alpha + \beta\right)^{\mathfrak{J}}} \tag{5}$$

Non-Rigid Objects Recognition: Automatic Human Action Recognition in Video Sequences 159

The implementation of the first program generates a text file with space-time coordinates of the tracks (x, y, t). The second program displays STIPs detected on the images of the video sequence. Video sequences are processed using Matlab with a single variable representation. The three-dimensional tensors represent properly video sequences. Figure 2 shows the detected STIPs in different video frames' samples from the KTH human action database. The three components are x (height) y (widths) and t (time axis). This representation makes

possible the STIPs neighborhood search in space-time domain.

Fig. 2. Detected STIPs in different samples from the KTH human action database

consider only STIPs that coincide with the dilated shape of the human body.

Figure 3 shows the structure of the tensor with the three axes.

noise from video sequences.

Among all detected STIPs from video sequences, there are usually motion noises from the non uniform background that do not contribute to the action motion. In fact, those points normally make the modeling computation much harder and in some cases might completely distract the core parts of the action. In order to filter out these irrelevant elements, we

The tensor elements contain gray level values of pixels in each frame of the video sequence. The criteria developed by Laptev et al. are applied on tensors and STIPs detected are pixels with maximum values in local neighborhoods and this by maximizing the criterion H.

The STIP detected by the Laptev algorithm have interesting properties including their stability to geometric transformations. Other robustness properties of the STIPs can be determined. These properties are related to noise from video sequences, such as impulse noise, contrast changes, quick movement of the camera and the MPEG compression effects. Several studies have been done in this area. Lejeune-Simac et al. (Lejeune-Simac et al., 2010) present a comprehensive study of the robustness of the detector STIPs various effects of

And it follows that for perfectly isotropic image structures (α = β = 1), k assumes its maximum possible value kmax = 1/27. For sufficiently large values of k ≤ kmax, positive local maxima of H will correspond to space-time points with similar eigen values λ1, λ2, λ3. Consequently, such points indicate locations of image structures with high spatiotemporal variation and can be considered as positions of local spatiotemporal features. As k in (3) only controls the local shape of image structures and not their amplitude, the method for local features detection is invariant with respect to the affine variation of image brightness.

### **2.3 Dollàr et al. detector**

Compared to Laptev detector, Dollàr et al. detector (Dollàr et al., 2005) it produces dense features that can significantly improve the recognition performance in most cases. It uses two separate filters in spatial and temporal directions: 2-D Gaussian filter in space components and 1-D Gabor filter in time component.

A response function of the form (6) is obtained, where g is the 2D Gaussian kernel applied along the spatial dimensions of the video and hev (7) and hod (8) are a quadrature pair of

1D Gabor filters applied in the temporal dimension.

$$\mathbf{R} = (\mathbf{I}^\* \otimes \mathbf{^\*h\_{\mathbf{ev}}})^2 + (\mathbf{I}^\* \otimes \mathbf{^\*h\_{\mathbf{od}}})^2 \tag{6}$$

$$\mathbf{h}\_{\rm eff}(\mathbf{t}; \tau, \alpha) = -\cos(2\pi \text{to}) \text{ e}^{-\mathbf{t}^2/\tau^2} \tag{7}$$

$$\mathbf{h}\_{\rm od}(\mathbf{t};\tau,\mathbf{o}) = -\sin(2\pi\mathbf{t}\mathbf{o}) \text{ e}^{-\mathbf{t}^2/\tau^2} \tag{8}$$

The detector responds best to complex motions made by regions that are distinguishable spatially, including spatio-temporal corners, but not to pure translational motion or motions involving areas that are not distinct in space. Local maxima of the response function R are selected as interest points, and cuboids are extracted, which are the windowed pixel values around the interest point in the spatial and temporal dimensions.

#### **2.4 The FAST-3D detector**

The FAST-3D spatio-temporal detector, developed by (Koelstra et al., 2009), is inspired from the FAST detector (Features from Accelerated Segment Test detector). Instead of using a circle around each pixel (x, y, t), Koelstra et al considered the set C of the 26 directly neighbouring pixels to (x, y, t) in a 3D space-time neighbourhood. STIPs detection is correctly done even when videos are transformed by zoom, rotation or MPEG compression.

#### **2.5 Laptev detector Implementation**

The algorithm was applied to sequences of different types of video sequences for detecting the STIP. The application of the algorithm is made through two executable files "stipdet.exe" and "stipshow.exe". The first file corresponds to the detection algorithm STIP and the second for showing the detected STIPs on the sequences.

158 Advances in Object Recognition Systems

And it follows that for perfectly isotropic image structures (α = β = 1), k assumes its maximum possible value kmax = 1/27. For sufficiently large values of k ≤ kmax, positive local maxima of H will correspond to space-time points with similar eigen values λ1, λ2, λ3. Consequently, such points indicate locations of image structures with high spatiotemporal variation and can be considered as positions of local spatiotemporal features. As k in (3) only controls the local shape of image structures and not their amplitude, the method for local features detection is invariant with respect to the affine

Compared to Laptev detector, Dollàr et al. detector (Dollàr et al., 2005) it produces dense features that can significantly improve the recognition performance in most cases. It uses two separate filters in spatial and temporal directions: 2-D Gaussian filter in space

A response function of the form (6) is obtained, where g is the 2D Gaussian kernel applied along the spatial dimensions of the video and hev (7) and hod (8) are a quadrature pair of

The detector responds best to complex motions made by regions that are distinguishable spatially, including spatio-temporal corners, but not to pure translational motion or motions involving areas that are not distinct in space. Local maxima of the response function R are selected as interest points, and cuboids are extracted, which are the windowed pixel values

The FAST-3D spatio-temporal detector, developed by (Koelstra et al., 2009), is inspired from the FAST detector (Features from Accelerated Segment Test detector). Instead of using a circle around each pixel (x, y, t), Koelstra et al considered the set C of the 26 directly neighbouring pixels to (x, y, t) in a 3D space-time neighbourhood. STIPs detection is correctly done even when videos are transformed by zoom, rotation or MPEG compression.

The algorithm was applied to sequences of different types of video sequences for detecting the STIP. The application of the algorithm is made through two executable files "stipdet.exe" and "stipshow.exe". The first file corresponds to the detection algorithm STIP and the second

2 2 R (I \* <sup>g</sup> \* h ) (I \* <sup>g</sup> \*h ) ev od (6)

t²/ ² h (t; , )= cos(2 t ) e ev (7)

t²/ ² h (t; , )= sin(2 t ) e od (8)

variation of image brightness.

components and 1-D Gabor filter in time component.

1D Gabor filters applied in the temporal dimension.

around the interest point in the spatial and temporal dimensions.

**2.3 Dollàr et al. detector** 

**2.4 The FAST-3D detector** 

**2.5 Laptev detector Implementation** 

for showing the detected STIPs on the sequences.

The implementation of the first program generates a text file with space-time coordinates of the tracks (x, y, t). The second program displays STIPs detected on the images of the video sequence. Video sequences are processed using Matlab with a single variable representation. The three-dimensional tensors represent properly video sequences. Figure 2 shows the detected STIPs in different video frames' samples from the KTH human action database. The three components are x (height) y (widths) and t (time axis). This representation makes possible the STIPs neighborhood search in space-time domain.

Fig. 2. Detected STIPs in different samples from the KTH human action database

Among all detected STIPs from video sequences, there are usually motion noises from the non uniform background that do not contribute to the action motion. In fact, those points normally make the modeling computation much harder and in some cases might completely distract the core parts of the action. In order to filter out these irrelevant elements, we consider only STIPs that coincide with the dilated shape of the human body.

The tensor elements contain gray level values of pixels in each frame of the video sequence. The criteria developed by Laptev et al. are applied on tensors and STIPs detected are pixels with maximum values in local neighborhoods and this by maximizing the criterion H. Figure 3 shows the structure of the tensor with the three axes.

The STIP detected by the Laptev algorithm have interesting properties including their stability to geometric transformations. Other robustness properties of the STIPs can be determined. These properties are related to noise from video sequences, such as impulse noise, contrast changes, quick movement of the camera and the MPEG compression effects. Several studies have been done in this area. Lejeune-Simac et al. (Lejeune-Simac et al., 2010) present a comprehensive study of the robustness of the detector STIPs various effects of noise from video sequences.

Non-Rigid Objects Recognition: Automatic Human Action Recognition in Video Sequences 161

These statistics show that STIPs number depends directly of the movement realized. Indeed, running and jumping movements have high STIPs number however boxing and hand waving have a low STIPs number. Therefore we conclude that STIPs number in a sequence is an important parameter in human movements' recognition. To emphasize this study we present in the following section the evolution of STIPs in time by the "Activity" function.

Evolution of the STIPs number in a sequence is an important factor in human motion recognition. To synthesize this criterion we have used the "Activity" function. This function was defined by Laganière et al. (Laganière et al., 2008) as the number of pixels that are modified between two consecutive frames in a video sequence. Hence, frames that correspond to local maxima of the "Activity" function are the scenes of major movements. We have changed the "Activity" to fit our research, so we defined it as STIPs number in each frame of the sequence. The evolution of this number can lead us to recognize the type of movement made by detecting its local maxima which are the locations of large amounts of movement and its distribution that indicates the positions of these quantities in time scale. In Figure 4, we present the activity function applied to samples of sequences from KTH

Fig. 4. Application of the Activity function on samples from KTH human action database.

The curves in Figure 4 have repetitive peaks. These peaks are local maxima of the activity function and can be regarded as major movement's events in each class. From this analysis we can extract important information about the class of the movement performed. The curves obtained are so noised. This is caused by non significant STIPs detected and which appear between local maxima. To resolve this problem we applied a smoothing algorithm on curves to accentuate the peaks and eliminate the STIPs values between the local maxima. The smoothing was done on segments of frames by adding the STIPs detected

**3.2 Activity function** 

database.

Fig. 3. Reference axes (x,y,t) representation on a video sequence from KTH human action database
