**3.3 Spatiotemporal boxes**

162 Advances in Object Recognition Systems

in an interval [n-2, n+2] where n is the time of the local maxima of the STIPs. Figure 5 shows the application of smoothing algorithm on the activity function curves for samples from the

Fig. 5. Application of the smoothing algorithm on the activity function curves for samples

We note that smoothing reduces the activity function noise and increases the local maxima values of the curves. To detect the locations of local maxima, a Gaussian model is fitted to the activity function. This model leads to the determination of the number of the local maxima and their time in a sequence. In addition, it contributes for motion recognition when considering the parameters of the used Gaussian model in the classification algorithm. The value of the global maximum is deduced to detect movement with only one global maximum. Figure 6 shows the application of the Gaussian model to activity function on sequences taken from the KTH human action database (from left to right and row-wise of the Figure we have the actions of, boxing, walking, hands waving, jogging, hands clapping

In Table 2, the number of local maxima is shown, their mean value and the global maximum value for different action classes taken from the KTH human action database. We note that the number of local maxima is the number of repetitions in a human movement such as walking or hand clapping. For fast movements such as running the smoothing algorithm reduces the number local maxima to one and extracts a single global maximum. The local maxima average value is a significant parameter in the classification of human movements. We note that the movements made only by arms such as: Boxing, Hand waving and Hand clapping have values lower than those achieved by the whole body such as: Running, Jogging and Walking. The global maximum can contribute to the classification since its

KTH human action database.

from the KTH human action database.

values are different from one to another class of motion.

and running).

STIPs are the most significant motion locations in video sequences. Most of the STIPs are located at the most valuable human body parts such as knees, elbow joints, the moving limbs. Boxes containing STIPs called as "Spatiotemporal Boxes" can be considered as

Non-Rigid Objects Recognition: Automatic Human Action Recognition in Video Sequences 165

Fig. 8. Classification of H-STIP and L-STIP for action samples from KTH human action

The evolution of H-STIP and L-STIP in time (see Figure 9) compared to centroid can be discriminative information to classify actions. In fact, actions containing H-STIP and L-STIP are Running, Jogging and Walking. On the other side, Boxing, Hand waving and Hand

Fig. 9. An illustration of the evolution of H-STIP, L-STIP and Centroid in time for running

To obtain fair judgement of the performances of the proposed approach, we compare our results with other human action recognition approaches using the same database. The

database

action

**4. Motion classification** 

clapping contain only points of H-STIP type.

important information to describe the actions and to differentiate between them. Spatiotemporal boxes containing detected STIPs are the most shining regions to describe human motion. The boxes size can be effective information to differentiate between motion done only by hands and the full body motion (see Figure 7).

For all STIPs belonging to the same image, we determine their spatial coordinates (x1, y1) (x2, y2), ..., (xn, yn) in the image reference. The spatiotemporal boxes can be described by a rectangle between points (xLeft, yTop) and (XRight, yBottom) these coordinates are determined by reference to the following equations.

$$\begin{aligned} \mathbf{x\_{Left}} &= \min(\mathbf{x\_{1}}, \mathbf{x\_{2}}, \dots, \mathbf{x\_{n}}) \text{ - r} \\ \mathbf{y\_{Top}} &= \min(\mathbf{y\_{1}}, \mathbf{y\_{2}}, \dots, \mathbf{y\_{n}}) \text{ - r} \\ \mathbf{x\_{Right}} &= \max(\mathbf{x\_{1}}, \mathbf{x\_{2}}, \dots, \mathbf{x\_{n}}) + \mathbf{r} \\ \mathbf{x\_{Bottom}} &= \max(\mathbf{y\_{1}}, \mathbf{y\_{2}}, \dots, \mathbf{y\_{n}}) + \mathbf{r} \end{aligned} \tag{9}$$

r is the extension radius of the spatiotemporal boxes. Figure 7 shows spatiotemporal boxes detected on images taken from the KTH human action database.

Fig. 7. Spatiotemporal boxes detected on images taken from the KTH human action database

Considering motion done using full body, we classified STIPs points in two parts, High body part STIPs (H-STIP) and Low body part STIPs (L-STIP). To achieve this classification we detected the centroid of the body silhouette in all frames of the sequence. Points located above centroid are classified in H-STIP and points below centroid are classified in L-STIP as shown in Figure 8.

Fig. 8. Classification of H-STIP and L-STIP for action samples from KTH human action database

The evolution of H-STIP and L-STIP in time (see Figure 9) compared to centroid can be discriminative information to classify actions. In fact, actions containing H-STIP and L-STIP are Running, Jogging and Walking. On the other side, Boxing, Hand waving and Hand clapping contain only points of H-STIP type.

Fig. 9. An illustration of the evolution of H-STIP, L-STIP and Centroid in time for running action
