**1. Introduction**

28 Will-be-set-by-IN-TECH

154 Advances in Object Recognition Systems

Uenohara, M. & Kanade, T. (1997). Use of Fourier and Karhunen-Loeve decomposition for

Uenohara, M. & Kanade, T. (1998). Optimal approximation of uniformly rotated images:

Van Der Maaten, L. & Postma, E. (2006). Towards automatic coin classification, *Proc. of Conf.*

Vassilas, N. & Skourlas, C. (2006). Content-based coin retrieval using invariant features and self-organizing maps, *Proc. of Int. Conf. on Artif. Neur. Netw.*, pp. 113–122. Venkatesh, B., Palanivel, S. & Yegnanarayana, B. (2002). Face detection and recognition in an image sequence using eigenedginess, *Proc. Indian Conf. Vis., Graph. Image Proc.* Yilmaz, A. & Gökmen, M. (2000). Eigenhills vs. eigenface and eigenedge, *Proc. of International*

Zaharieva, M., Huber-Mörk, R., Nölle, M. & Kampel, M. (2007). On ancient coin classification, *Proc. of Int. Symp. on Virtual Reality, Archaeology and Cultural Heritage*, pp. 55–62. Zambanini, S. & Kampel, M. (2008). Segmentation of ancient coins based on local entropy and

Zambanini, S. & Kampel, M. (2009). Robust automatic segmentation of ancient coins, *Proc.*

Zambanini, S., Schlapke, M., Kampel, M. & Müller, A. (2009). Historical coins in 3D:

Acquisition and numismatic applications, *Proc. Symp. Virtual Reality, Archaeology and*

*on Electronic Imaging and the Visual Arts*, Vienna, Austria, pp. 19–26.

*Conference on Pattern Recognition*, Vol. 2, pp. 827–830.

gray value range, *Proc. Comput. Vis. Winter Workshop*, pp. 9–16.

*Conf. on Comp. Vision Theory and Appl.*, Vol. 2, pp. 273–276.

19(8): 891–898.

*IEEE Trans. Image Proc.* 7(1): 116–119.

*Cultural Heritage*, pp. 49–52.

fast pattern matching with a large set of templates, *IEEE Trans. Patt. Anal. Mach. Inell.*

Relationship between Karhunen-Loeve expansion and discrete cosine transform,

Non-rigid objects recognition is an important problem in video analysis and understanding. It is nevertheless a challenging task to achieve due to the properties carried out by the nonrigid objects, and is more complicated by camera motion as well as background variation. Human body recognition in video sequences is the best application of the non-rigid objects recognition due to the large capacities of the human body in doing actions and poses. These difficulties prohibit practical attempts toward conceiving a robust global model for each action class. Human body recognition is highly interesting for a variety of applications: detecting relevant activities in surveillance video, summarizing and indexing video sequences. It relies, however, on the interpretation of the body movements and classifies them in different events.

A considerable amount of previous work has addressed the question of human action categorization and motion analysis. One line of work is based on the computation of correlation between volumes of video data (Efros et al., 2003). Another popular approach is to track body parts at first and then uses the obtained motion trajectories to perform action recognition (Ramanan & Forsyth, 2004). The robustness of the approach is highly dependent on the tracking system. Alternatively, researchers have considered the analysis of human actions by looking at video sequences as space-time intensity volumes (Bobick & Davis, 2001). Some researchers have also explored unsupervised methods for motion analysis such as hierarchical dynamic Bayesian network model (Hoey, 2001; Zhong et al., 2004). Another approach uses a video representation based on spatiotemporal interest points (STIPs). In spite of the existence of a fairly large variety of methods to extract interest points (IPs) from static images Harris corner detector (Harris & Stephens, 1988), Scale invariant feature transform (Lowe, 1999), Salient regions (Kadir & Brady, 2003) …, less work has been done on STIPs detection in videos. In 2005, Laptev (Laptev, 2005) present a STIPs detector based on the idea of the Harris IPs operators. They detect local structures in space-time where the image values have significant local variations in space and time dimension. IPs extracted with such methods had been used as features for human action classification. These points are particularly interesting because they focus the initial information contained in any image in a few specific points. The integration of the time component can perform filtering on the IP and keep only those who also have a temporal discontinuity.

Non-Rigid Objects Recognition: Automatic Human Action Recognition in Video Sequences 157

function <sup>2</sup> f:R R R over two spatial dimensions (x, y) and one temporal dimension t. The Local space time features are defined as 3D blocks of the sequence containing variations

The scale-space representation 2 2 L:R R R R is generated by the convolution of f with a separable Gaussian kernel g (p ; Σ) (1). Where p is spatiotemporal position

<sup>1</sup> <sup>2</sup> <sup>2</sup> <sup>g</sup> p; <sup>e</sup> and 0 0

A spatiotemporal second-moment matrix (2) is defined in terms of spatiotemporal gradients

<sup>T</sup> ; g; L; L;

The spatiotemporal second-moment matrix μ, considered also as a structure tensor, is interpreted in terms of eigen values. This fact makes the distinguishing of image structures possible with variations over one, two and three dimensions. Three-dimensional variation of f corresponds to image points with non-constant motion. Such points can be detected by

STIP detection is realized by the extension of the Harris operator H into the spatiotemporal

Local maxima of H correspond to points with high values λ1, λ2, λ3 (λ1 ≤ λ2 ≤ λ3). H can be

<sup>3</sup> <sup>3</sup>

H det k trace <sup>k</sup> 123 1 2 3 (3)

<sup>3</sup> <sup>3</sup> H k1 <sup>1</sup> (4)

<sup>3</sup> k 1 (5)

<sup>2</sup> g ; LL L LL xy y y <sup>t</sup>

maximizing the three eigen values λ1, λ2, λ3 of μ over space and time.

From the requirement H ≥ 0, we get the condition represented by (5).

domain (3). Detection is based on points with high eigen values.

written as equation (4), where α = λ2/λ1 and β = λ3/λ1.

p= x,y,t , the parameters 2 and τ2 of the covariance matrix correspond to the spatial and temporal scale parameters respectively and define spatiotemporal extension of

<sup>3</sup> 2 det <sup>2</sup> 0 0

2L LL LL x xy x <sup>t</sup>

<sup>2</sup> LL LL L x y t tt

p p T 1 <sup>2</sup> 0 0

(1)

(2)

in space and time.

vector <sup>T</sup>

the neighbourhoods.

and weighted with a Gaussian window function.

We propose in this chapter a motion analysis and classification approach to learn and recognize human actions in video, taking advantage of the robustness of STIPs and the unsupervised learning approaches. Experimental results are validated on KTH human action database (Schuldt et al., 2004), and ATSI Human Action Database (see Figure 1). Results are compared to recent works on the human motion analysis and recognition.

Fig. 1. Samples from the KTH human action database
