**3.1 Multichannel gradient Model**

Multichannel gradient Model (McGM), developed by Johnston (Johnston *et al*., 1995, 1996), has been recently implemented and selected due its robustness and bio-inspiration. This model deals with many goals, such as illumination, static patterns, contrast invariance, noisy environments. Additionally it is robust against fails, justifies some optical illusions (Anderson *et al*., 2003), and detects second order motion (Johnston, 1994) that is particularly

Real-Time Motion Processing Estimation Methods in Embedded Systems 279

Doorn, 1988), in order to represent a local region by the weighted outputs of a set of filters applied at a location in the image as shown in the next expression (9). The weights attached depend on the direction and length of the vector joining the point where the measurements

I(x p,y q,t r) [I(x,y,t)] [I (x,y,t) I (x,y,t) I (x,y,t)]

[I (x,y,t)p I (x,y,t)2pq I (x,y,t)2pr I (x,y,t)2qr] ...

These filters are generated by progressively increasing the order of the spatial and temporal

xyt

+ + ++ + (9)

2 ( ) 2

*<sup>t</sup> <sup>r</sup>*

*e*

πτα

with σ=1.5, α=10 and τ=0.2. This expression is originally scaled so that the integral over its spatial-temporal scope is equal to 1.0. Also, it is tuned assuming a spatial frequency limit of 60 cycles/deg and a critical flicker fusion limit of 60 Hz, following evidences from the

The Taylor representation requires a bank of linear filters, taking derivatives in time, t, and two spatial directions, x and y, with the derivatives lobes in theirs receptive fields tuned to different frequencies. There are neurophysiological and psychophysical evidences that support this (Johnston *et al*., 1994; Bruce *et al*., 1996). This algorithm can be built by neural systems in the visual cortex since all operations involved in the model can be achieved by combining the output of linear space-temporal orientated filters through addition,

A truncated Taylor expansion is built eliminating terms above first order in time and orthogonal direction for ensuring that there are no more than three temporal filters and no

The reference frame is rotated through a number of orientations with respect to the input image. Several orientations (24 in the original model spaced over 360º) are employed. For each orientation, three vectors of filters are created by differentiating the vector of filter kernels with respect to *x*, *y* and *t*. From these last measurements, speed, orthogonal speed, inverse speed and orthogonal inverse speed are calculated with the local speed by rotating

2

⋅ ⋅⋅ == = + ⋅ ⋅⋅

⋅ ⋅⋅ = = =+ ⋅ ⋅⋅

θ

XT XT XY s speed cos 1 XX XX XX θ

YT YT XY <sup>2</sup> s orthog. speed sin 1 YY YY YY

4 2 4

ln

α

<sup>1</sup> <sup>2</sup>

(11)

(12)

−

<sup>1</sup> <sup>2</sup>

−

τ

 <sup>−</sup> <sup>−</sup> <sup>=</sup> (10)

are taken and the point where we wish to estimate image brightness:

2x xy xt yt

1 1 (,) <sup>4</sup>

πσ

*<sup>r</sup> K rt e <sup>e</sup>*

σ

+ + += + + +

2

differential operators applied to the following kernel filter:

greater spatial complexity in filters (Hess & Snowden, 1992).

II

⊥

1

2

human visual system (Johnston *et al*., 1994).

multiplication and division.

the reference frame:

useful in camouflage tasks, etc. At the same time, it avoids operations such as matrix inversion or iterative methods that are not biologically justified (Baker & Matthews, 2004; Lucas & Kanade, 1981). The main drawback of this system is its huge computational complexity. It is able to handle complex situations in real environments better than others algorithms (Johnston *et al*., 1994), with its physical architecture and design principles being based on the biological neural systems of mammalians (Bruce *et al*., 1996). Experimental results are provided using a Celoxica RC1000 platform (Alphadata, 2007).

This approach is based on the gradient one commented previously, the starting point is the motion constraint equation (MCE) shown in the expression 2. The luminance variation is assumed negligible over time. In this approach, velocity is calculated dividing the temporal derivative by the spatial derivative of image brightness, thus a gradient model can be obtained applying couples of filters, one of them being the spatial derivative and the other one a temporal derivative. If for the sake of clarity we only consider x variable in expression 2 and we get the velocity taking the quotient of the output filters:

$$\mathbf{v} = \frac{d\mathbf{x}}{dt} = -\frac{\partial I}{\partial t} \Big/ \frac{\partial I}{\partial \mathbf{x}}\tag{5}$$

Since velocity is given directly via the ratio of luminance derivatives, one potential problem appears when the output of the spatial filter is null, being the velocity undefined. This can be solved applying a threshold to the calculation or restricting the evaluation value (Baker & Matthews, 2004). Our approach is based on the fact that the human visual system measures at least three orders of the spatial derivative (Johnston *et al*., 1999; Koenderick & Van Doorn, 1988) and three orders of temporal differentiation (Baker & Matthews, 2004; Hess & Snowden, 1992). Therefore, it is possible to build low level filters for calculating the speed with additional derivatives, although it may be still ill-conditioned:

$$\mathbf{v} = -\frac{\partial \mathbf{y}^{n-1} \mathbf{t}}{\partial \mathbf{x}^{n-1} \mathbf{t}} \bigg/ \frac{\partial \mathbf{y}^{n}}{\partial \mathbf{x}^{n}} \tag{6}$$

Two vectors X and T, containing the results of applying the derivative operators to the image brightness, can be built:

$$\mathbf{X} = \left(\frac{\partial \mathbf{x}}{\partial \mathbf{x}}, \frac{\partial \mathbf{x}}{\partial \mathbf{x}^2}, \dots, \frac{\partial \mathbf{x}}{\partial \mathbf{x}^n}\right) \\ \mathbf{T} = \left(\frac{\partial \mathbf{f}}{\partial \mathbf{f}}, \frac{\partial \mathbf{x} \partial \mathbf{f}}{\partial \mathbf{x}^2}, \dots, \frac{\partial \mathbf{x}^{n-1} \partial \mathbf{f}}{\partial \mathbf{x}^{n-1} \partial \mathbf{f}}\right) \tag{7}$$

For extracting the best approximation to the speed from each of the measurements, a Least Squares Formulation is applied, thus recovering a value v'. The denominator is a sum of squares and therefore it is never null, so a spatial structure is provided:

$$\mathbf{v}' = \frac{\sum\_{\mathbf{u}} \frac{\partial \mathbf{x}^{\mathbf{u}}}{\partial \mathbf{x}^{\mathbf{u}}} \frac{\partial \mathbf{x}^{\mathbf{u}}}{\partial \mathbf{x}^{\mathbf{u}}}}{\sum\_{\mathbf{u}} \frac{\partial \mathbf{x}^{\mathbf{u}}}{\partial \mathbf{x}^{\mathbf{u}}} \frac{\partial \mathbf{x}^{\mathbf{u}}}{\partial \mathbf{x}^{\mathbf{u}}}} \tag{8}$$

In this framework, we represent the local image structure in the primary visual cortex as a spatial-temporal truncated Taylor expansion (Johnston *et al*., 1996, 1999; Koenderick & Van 278 Real-Time Systems, Architecture, Scheduling, and Application

useful in camouflage tasks, etc. At the same time, it avoids operations such as matrix inversion or iterative methods that are not biologically justified (Baker & Matthews, 2004; Lucas & Kanade, 1981). The main drawback of this system is its huge computational complexity. It is able to handle complex situations in real environments better than others algorithms (Johnston *et al*., 1994), with its physical architecture and design principles being based on the biological neural systems of mammalians (Bruce *et al*., 1996). Experimental

This approach is based on the gradient one commented previously, the starting point is the motion constraint equation (MCE) shown in the expression 2. The luminance variation is assumed negligible over time. In this approach, velocity is calculated dividing the temporal derivative by the spatial derivative of image brightness, thus a gradient model can be obtained applying couples of filters, one of them being the spatial derivative and the other one a temporal derivative. If for the sake of clarity we only consider x variable in expression

> *dx <sup>I</sup> <sup>I</sup> <sup>v</sup> dt t x* ∂ ∂ = =− ∂ ∂

Since velocity is given directly via the ratio of luminance derivatives, one potential problem appears when the output of the spatial filter is null, being the velocity undefined. This can be solved applying a threshold to the calculation or restricting the evaluation value (Baker & Matthews, 2004). Our approach is based on the fact that the human visual system measures at least three orders of the spatial derivative (Johnston *et al*., 1999; Koenderick & Van Doorn, 1988) and three orders of temporal differentiation (Baker & Matthews, 2004; Hess & Snowden, 1992). Therefore, it is possible to build low level filters for calculating the speed

> n n n1 n I I

xt x <sup>−</sup>

2 n 2 n 2 n n 1

> n n n n n1 n n n n n

∂ ∂

I I

I I x x

− ∂ ∂

Two vectors X and T, containing the results of applying the derivative operators to the

II I II I X , ,..., T , ,..., x x x t xt x t <sup>−</sup> ∂∂ ∂ ∂∂ ∂ = = ∂ ∂ ∂ ∂ ∂∂ ∂ ∂

For extracting the best approximation to the speed from each of the measurements, a Least Squares Formulation is applied, thus recovering a value v'. The denominator is a sum of

xx t v'

∂∂ ∂ <sup>=</sup> ∂ ∂

In this framework, we represent the local image structure in the primary visual cortex as a spatial-temporal truncated Taylor expansion (Johnston *et al*., 1996, 1999; Koenderick & Van

(5)

(7)

(8)

∂ ∂ = − ∂ ∂ (6)

results are provided using a Celoxica RC1000 platform (Alphadata, 2007).

2 and we get the velocity taking the quotient of the output filters:

with additional derivatives, although it may be still ill-conditioned:

image brightness, can be built:

v

squares and therefore it is never null, so a spatial structure is provided:

Doorn, 1988), in order to represent a local region by the weighted outputs of a set of filters applied at a location in the image as shown in the next expression (9). The weights attached depend on the direction and length of the vector joining the point where the measurements are taken and the point where we wish to estimate image brightness:

$$\begin{aligned} \mathbf{I}(\mathbf{x} + \mathbf{p}, \mathbf{y} + \mathbf{q}, \mathbf{t} + \mathbf{r}) &= [\mathbf{I}(\mathbf{x}, \mathbf{y}, \mathbf{t})] + [\mathbf{I}\_{\mathbf{x}}(\mathbf{x}, \mathbf{y}, \mathbf{t}) + \mathbf{I}\_{\mathbf{y}}(\mathbf{x}, \mathbf{y}, \mathbf{t}) + \mathbf{I}\_{\mathbf{t}}(\mathbf{x}, \mathbf{y}, \mathbf{t})] \\ + \frac{1}{2} [\mathbf{I}\_{2\mathbf{x}}(\mathbf{x}, \mathbf{y}, \mathbf{t}) \mathbf{p}^2 + \mathbf{I}\_{\mathbf{v}\mathbf{y}}(\mathbf{x}, \mathbf{y}, \mathbf{t}) 2 \mathbf{p} \mathbf{q} + \mathbf{I}\_{\mathbf{u}}(\mathbf{x}, \mathbf{y}, \mathbf{t}) 2 \mathbf{p} \mathbf{r} + \mathbf{I}\_{\mathbf{y}}(\mathbf{x}, \mathbf{y}, \mathbf{t}) 2 \mathbf{q} \mathbf{r}] + \dots \end{aligned} \tag{9}$$

These filters are generated by progressively increasing the order of the spatial and temporal differential operators applied to the following kernel filter:

$$K(r,t) = \frac{1}{4\pi\sigma}e^{-\frac{r^2}{4\sigma}} \frac{1}{\sqrt{\pi\sigma\alpha}} \frac{e^{-\left(\frac{\ln\left(\ell/\alpha\right)}{r}\right)\_2}}{e^{r^2/4}} e^{-\left(\frac{\ln\left(\ell/\alpha\right)}{r}\right)\_2} \tag{10}$$

with σ=1.5, α=10 and τ=0.2. This expression is originally scaled so that the integral over its spatial-temporal scope is equal to 1.0. Also, it is tuned assuming a spatial frequency limit of 60 cycles/deg and a critical flicker fusion limit of 60 Hz, following evidences from the human visual system (Johnston *et al*., 1994).

The Taylor representation requires a bank of linear filters, taking derivatives in time, t, and two spatial directions, x and y, with the derivatives lobes in theirs receptive fields tuned to different frequencies. There are neurophysiological and psychophysical evidences that support this (Johnston *et al*., 1994; Bruce *et al*., 1996). This algorithm can be built by neural systems in the visual cortex since all operations involved in the model can be achieved by combining the output of linear space-temporal orientated filters through addition, multiplication and division.

A truncated Taylor expansion is built eliminating terms above first order in time and orthogonal direction for ensuring that there are no more than three temporal filters and no greater spatial complexity in filters (Hess & Snowden, 1992).

The reference frame is rotated through a number of orientations with respect to the input image. Several orientations (24 in the original model spaced over 360º) are employed. For each orientation, three vectors of filters are created by differentiating the vector of filter kernels with respect to *x*, *y* and *t*. From these last measurements, speed, orthogonal speed, inverse speed and orthogonal inverse speed are calculated with the local speed by rotating the reference frame:

$$\hat{\mathbf{s}}\_{\text{ll}} = \text{speed} \mathbf{d} = \frac{\mathbf{X} \cdot \mathbf{T}}{\mathbf{X} \cdot \mathbf{X}} \cos^{2} \theta = \frac{\mathbf{X} \cdot \mathbf{T}}{\mathbf{X} \cdot \mathbf{X}} \left( \mathbf{1} + \left( \frac{\mathbf{X} \cdot \mathbf{Y}}{\mathbf{X} \cdot \mathbf{X}} \right)^{2} \right)^{-1} \tag{11}$$

$$\hat{\mathbf{s}}\_{\perp} = \text{orthogonal. speed} = \frac{\mathbf{Y} \cdot \mathbf{T}}{\mathbf{Y} \cdot \mathbf{Y}} \sin^{2}{\theta} = \frac{\mathbf{Y} \cdot \mathbf{T}}{\mathbf{Y} \cdot \mathbf{Y}} \left( \mathbf{1} + \left( \frac{\mathbf{X} \cdot \mathbf{Y}}{\mathbf{Y} \cdot \mathbf{Y}} \right)^{2} \right)^{-1} \tag{12}$$

Real-Time Motion Processing Estimation Methods in Embedded Systems 281

**FIR Filtering Espacial**

**Product & Taylor Stage**

**Velocity Primitive Stage**

**Software**

Stage I contains temporal differentiation through IIR filtering, being the output of this stage the three first derivatives of the input. Stage II performs the spatial differentiation building a pyramidal structure of each temporal derivative. Figure 9 represents what the authors (Botella *et al*., 2010) call as "Convolutive Unit Cell" which implements the separable convolution organized in rows and columns. Each part of this cell will be replicated

Stage III steers each one of space-time functions calculated previously. Stage IV performs a Taylor expansion and its derivatives over x, y and t delivering at the output a sextet which contains the product of those. Stage V forms a quotient of this sextet. Stage VI forms four different measurements corresponding to the direct and inverse speeds (8-11), which act as primitives for the final velocity estimation. Finally, Stage VII computes the modulus and

Stage VI does not calculate the final velocity estimation due to the bioinspired nature of the model (combining multiple speed measurements, so if direct speed does not provide an accurate value, inverse speed will do that, and *vice versa*). However, every measurement of the speed has entity of velocity, so it could be used as final velocity estimation, even though

**Modulus Final Value**

VII

**Phase Final Value**

**Quotient Stage**

**Steering Filtering Stage**

**IIR Filtering Temporal**

I

II

III

IV

V

VI

Fig. 8. General structure of the model implemented.

sufficiently to perform a pyramidal structure.

it would degrade the robustness of the mode

phase values (12-13) on software.

$$\bar{\mathbf{s}}\_{\text{ll}} = \text{inverse speed} = \frac{\mathbf{X} \cdot \mathbf{T}}{\mathbf{T} \cdot \mathbf{T}} \tag{13}$$

$$\bar{\mathbf{s}}\_{\perp} = \text{inverse orthogonal} \cdot \text{speed} = \frac{\mathbf{Y} \cdot \mathbf{T}}{\mathbf{T} \cdot \mathbf{T}} \tag{14}$$

Where *X*, *Y*, *T* are the vector of outputs from the x, y, t differentiated filters. Raw speed (*X·T* / *X·X*) and orthogonal speed (*Y·T* / *Y·Y*) measurements are ill-conditioned if there is no change over x and y, respectively.

To avoid degrading the final velocity estimation, they are conditioned by measurements of the angle of image structure relative to the reference frame (θ). If the speed is large, (and inverse speed is small) direction is led by the speed measurements. However, if the speed is small (and inverse speed is large) the measurement is led by inverse speed.

The use of these antagonistic and complementary measurements provides advantages in any system susceptible of having small signals affected by noise (Anderson *et al*., 2003). There is evidence of the neurons that built inverse speed (Lagae *et al*., 1983), also it provides an explanation to sensitivity to static noise for motion blind patients.

The calculus of additional image measurements in order to increase robustness is one of the core design criteria that enhances the model. Finally, the motion modulus is calculated through a quotient of determinants:

$$\mathbf{Modulus}^{2} = \begin{bmatrix} \hat{\mathbf{s}}\_{\text{ll}} \cos \theta & \hat{\mathbf{s}}\_{\text{ll}} \sin \theta \\ \hat{\mathbf{s}}\_{\text{\perp}} \cos \theta & \hat{\mathbf{s}}\_{\text{\perp}} \sin \theta \\ \frac{\hat{\mathbf{s}}\_{\text{ll}} \bar{\mathbf{s}}\_{\text{ll}}}{\hat{\mathbf{s}}\_{\text{\perp}} \bar{\mathbf{s}}\_{\text{\perp}}} & \hat{\mathbf{s}}\_{\text{\perp}} \bar{\mathbf{s}}\_{\text{\perp}} \\ \end{bmatrix} \tag{15}$$

Speed and orthogonal speed vary with the angle of reference frame. The numerator of (12) takes a measure of the amplitude of the distribution of speed measurements that are combined across both speed and orthogonal speed. The denominator is included to stabilize the final speed estimation. Direction of motion is extracted by calculating a measurement of phase that is combined across all speed related measures, since they are in phase:

$$\text{Phase} = \tan^{-1}\left(\frac{(\bar{\mathbf{s}}\_{\text{ll}} + \hat{\mathbf{s}}\_{\text{ll}})\sin\theta + (\bar{\mathbf{s}}\_{\text{\perp}} + \hat{\mathbf{s}}\_{\text{\perp}})\cos\theta}{(\bar{\mathbf{s}}\_{\text{ll}} + \hat{\mathbf{s}}\_{\text{ll}})\cos\theta - (\bar{\mathbf{s}}\_{\text{\perp}} + \hat{\mathbf{s}}\_{\text{\perp}})\sin\theta}\right) \tag{16}$$

The model can be degraded to an ordinary Gradient (Baker & Matthews, 2004; Lucas & Kanade, 1981) by suppressing the number of orientations of the reference frame, taking only one temporal derivative and two spatial derivatives, not considering the inverse velocity measures, etc. The computations of speed and direction are based directly on the output of filters applied to the input image, providing a dense final map of information.

The general structure of the implementation is summarized in Figure 8, where operations are divided in conceptual stages with several variations to improve the viability of the hardware implementation.

280 Real-Time Systems, Architecture, Scheduling, and Application

X T s inverse speed T T <sup>⋅</sup> = = <sup>⋅</sup>

Y T s inverse orthog. speed T T <sup>⊥</sup> <sup>⋅</sup> = = <sup>⋅</sup>

Where *X*, *Y*, *T* are the vector of outputs from the x, y, t differentiated filters. Raw speed (*X·T* / *X·X*) and orthogonal speed (*Y·T* / *Y·Y*) measurements are ill-conditioned if there is no

To avoid degrading the final velocity estimation, they are conditioned by measurements of the angle of image structure relative to the reference frame (θ). If the speed is large, (and inverse speed is small) direction is led by the speed measurements. However, if the speed is

The use of these antagonistic and complementary measurements provides advantages in any system susceptible of having small signals affected by noise (Anderson *et al*., 2003). There is evidence of the neurons that built inverse speed (Lagae *et al*., 1983), also it provides

The calculus of additional image measurements in order to increase robustness is one of the core design criteria that enhances the model. Finally, the motion modulus is calculated

II II

 

⊥ ⊥ θ

θ

s cos s sin

 θ

 θ

> θ

(16)

 θ

(15)

II II II II

⊥⊥ ⊥⊥

(s s )cos (s s )sin θ

⊥ ⊥

θ

− ⊥ ⊥

The model can be degraded to an ordinary Gradient (Baker & Matthews, 2004; Lucas & Kanade, 1981) by suppressing the number of orientations of the reference frame, taking only one temporal derivative and two spatial derivatives, not considering the inverse velocity measures, etc. The computations of speed and direction are based directly on the output of

The general structure of the implementation is summarized in Figure 8, where operations are divided in conceptual stages with several variations to improve the viability of the

 + ++ <sup>=</sup> + −+

ss ss ss ss

<sup>=</sup>

(13)

(14)

II

small (and inverse speed is large) the measurement is led by inverse speed.

2

phase that is combined across all speed related measures, since they are in phase:

1 II II II II (s s )sin (s s )cos Phase tan

filters applied to the input image, providing a dense final map of information.

s cos s sin Modulus

Speed and orthogonal speed vary with the angle of reference frame. The numerator of (12) takes a measure of the amplitude of the distribution of speed measurements that are combined across both speed and orthogonal speed. The denominator is included to stabilize the final speed estimation. Direction of motion is extracted by calculating a measurement of

an explanation to sensitivity to static noise for motion blind patients.

change over x and y, respectively.

through a quotient of determinants:

hardware implementation.

Fig. 8. General structure of the model implemented.

Stage I contains temporal differentiation through IIR filtering, being the output of this stage the three first derivatives of the input. Stage II performs the spatial differentiation building a pyramidal structure of each temporal derivative. Figure 9 represents what the authors (Botella *et al*., 2010) call as "Convolutive Unit Cell" which implements the separable convolution organized in rows and columns. Each part of this cell will be replicated sufficiently to perform a pyramidal structure.

Stage III steers each one of space-time functions calculated previously. Stage IV performs a Taylor expansion and its derivatives over x, y and t delivering at the output a sextet which contains the product of those. Stage V forms a quotient of this sextet. Stage VI forms four different measurements corresponding to the direct and inverse speeds (8-11), which act as primitives for the final velocity estimation. Finally, Stage VII computes the modulus and phase values (12-13) on software.

Stage VI does not calculate the final velocity estimation due to the bioinspired nature of the model (combining multiple speed measurements, so if direct speed does not provide an accurate value, inverse speed will do that, and *vice versa*). However, every measurement of the speed has entity of velocity, so it could be used as final velocity estimation, even though it would degrade the robustness of the mode

Real-Time Motion Processing Estimation Methods in Embedded Systems 283

The implementation of these systems combines the low-level vision optical flow model (Botella *et al*., 20010) with the orthogonal variant moments (Martín.H *et al*., 2010) as Area, Length and Phase for the two cartesian components (A, LX, LY, PX, PY) to built a real-time mid-level vision platform which is able to deliver output task as tracking, segmentation and

The architecture of the system can be seen in the Figure 10. Several external banks have been used for different implementations, accessing to them from both the FPGA and the PCI bus. Low-level optical flow vision is designed and built through an asynchronous pipeline (micropipeline), where a token is passed to the next core each time one core finish its processing. The high level description tool Handel-C is chosen to implement this core with the DK environment. The board used is the well-known AlphaData RC1000 (Alphadata, 2007) which includes a Virtex 2000E-BG560 chip and 4 SRAM banks of 2MBytes each one as shown in the Figures 11a and 12a. Nevertheless Low-level moment vision platform is implemented in a parallel way, being independent the wielding of single one. Each orthogonal variant moment and the optical flow scheme advances the final Mid-Level Vision estimation. The Multimodal sensor core integrates information from different abstraction layers (six modules for optical flow, five modules for the orthogonal moments and one module for the Mid-Level vision tasks). Mid-Level vision core is arranged in this work for segmentation and tracking estimation with also an efficient implementation of the clustering algorithm, although additional functionality to this last module can be added

Fig. 10. Scheme of the real-time architecture of low-level and mid-level vision.

Fig. 11. (11a.left)Xilinx Virtex 2000E. (11b.right) Altera Cyclone II.

so on (Botella *et al*., 2011).

using this general architecture.

Fig. 9. Unit cell to perform the convolution operation.

## **3.2 Low and mid-level vision platform. Orthogonal variant models**

One of the most well established approaches in computer-vision and image analysis is the use of Moment invariants. Moment invariants, surveyed extensively by Prokop and Reeves (Prokop & Reeves, 1992) and more recently by Flusser (Flusser, 2006), were first introduced to the pattern recognition community by Hu (Hu, 1961), who employed the results of the theory of algebraic invariants and derived a set of seven moment invariants (the well-known Hu invariant set). The Hu invariant set is now a classical reference in any work that makes use of moments. Since the introduction of the Hu invariant set, numerous works have been devoted to various improvements, generalizations and their application in different areas, e.g., various types of moments such as Zernike moments, pseudo-Zernike moments, rotational moments, and complex moments have been used to recognize image patterns in a number of applications (The & Chin, 1986).

The problem of the influence of discretization and noise on moment accuracy as object descriptors has been previously addressed. There are proposed several techniques to increase the accuracy and efficiency of moment descriptors. (Zhang, 2000), (Papakostas *et al*., 2007, 2009). (Sookhanaphibarn & Lursinsap, 2006).

In short, moment invariants are measures of an image or signal that remain constant under some transformations, e.g., rotation, scaling, translation or illumination. Moments are applicable to different aspects of image processing, ranging from invariant pattern recognition and image encoding to pose estimation. Such Moments can produce image descriptors invariant under rotation, scale, translation, orientation, etc.

282 Real-Time Systems, Architecture, Scheduling, and Application

MULTIPLICATION

ROWS

PIXEL FROM PREVIOUS BLOCK

One of the most well established approaches in computer-vision and image analysis is the use of Moment invariants. Moment invariants, surveyed extensively by Prokop and Reeves (Prokop & Reeves, 1992) and more recently by Flusser (Flusser, 2006), were first introduced to the pattern recognition community by Hu (Hu, 1961), who employed the results of the theory of algebraic invariants and derived a set of seven moment invariants (the well-known Hu invariant set). The Hu invariant set is now a classical reference in any work that makes use of moments. Since the introduction of the Hu invariant set, numerous works have been devoted to various improvements, generalizations and their application in different areas, e.g., various types of moments such as Zernike moments, pseudo-Zernike moments, rotational moments, and complex moments have been used to recognize image patterns in a

The problem of the influence of discretization and noise on moment accuracy as object descriptors has been previously addressed. There are proposed several techniques to increase the accuracy and efficiency of moment descriptors. (Zhang, 2000), (Papakostas *et al*.,

In short, moment invariants are measures of an image or signal that remain constant under some transformations, e.g., rotation, scaling, translation or illumination. Moments are applicable to different aspects of image processing, ranging from invariant pattern recognition and image encoding to pose estimation. Such Moments can produce image

ADDITION & RENORMALIZATION

> CONVOLUTIVE COLUMN BLOCK

PIXEL OUTPUT

NEW PIXEL CONVOLUTIVE COLUMN BLOCK

DATA REGISTER

DATA ADDRESS

DATA ADDRESS DATA ADDRESS

ADDRESS CONTROLLER

number of applications (The & Chin, 1986).

2007, 2009). (Sookhanaphibarn & Lursinsap, 2006).

Fig. 9. Unit cell to perform the convolution operation.

**3.2 Low and mid-level vision platform. Orthogonal variant models** 

descriptors invariant under rotation, scale, translation, orientation, etc.

FILTER COEFFICIENT

COLUMNS SCALABLE BLOCK RAM

The implementation of these systems combines the low-level vision optical flow model (Botella *et al*., 20010) with the orthogonal variant moments (Martín.H *et al*., 2010) as Area, Length and Phase for the two cartesian components (A, LX, LY, PX, PY) to built a real-time mid-level vision platform which is able to deliver output task as tracking, segmentation and so on (Botella *et al*., 2011).

The architecture of the system can be seen in the Figure 10. Several external banks have been used for different implementations, accessing to them from both the FPGA and the PCI bus. Low-level optical flow vision is designed and built through an asynchronous pipeline (micropipeline), where a token is passed to the next core each time one core finish its processing. The high level description tool Handel-C is chosen to implement this core with the DK environment. The board used is the well-known AlphaData RC1000 (Alphadata, 2007) which includes a Virtex 2000E-BG560 chip and 4 SRAM banks of 2MBytes each one as shown in the Figures 11a and 12a. Nevertheless Low-level moment vision platform is implemented in a parallel way, being independent the wielding of single one. Each orthogonal variant moment and the optical flow scheme advances the final Mid-Level Vision estimation. The Multimodal sensor core integrates information from different abstraction layers (six modules for optical flow, five modules for the orthogonal moments and one module for the Mid-Level vision tasks). Mid-Level vision core is arranged in this work for segmentation and tracking estimation with also an efficient implementation of the clustering algorithm, although additional functionality to this last module can be added using this general architecture.

Fig. 10. Scheme of the real-time architecture of low-level and mid-level vision.

Fig. 11. (11a.left)Xilinx Virtex 2000E. (11b.right) Altera Cyclone II.

Real-Time Motion Processing Estimation Methods in Embedded Systems 285

Fig. 13. (13a. Upper row) Full Search Technique. (13b. medium row) Three Steps Search

In this section, we analyse each of these real-time motion estimation systems regarding the computational resources needed and the throughput obtained. The resources used in the implementations described in 3.1 and 3.2 are shown in the Tables 1-3 as Slices and Block Ram percentage accounting. MC is the maximum delay in terms of the number of clock cycles needed for each modulus implemented. Finally, the Throughput provides the Kilo

Table 1 shows a set of these parameters just for the McGM optical flow implementation, (see case 3.1) but Table 2 manages the multimodal sensor treated in 3.2. For this last case, has been separated the implementation of each one of the orthogonal variant moments (Table 2) and just the whole implementation (all modules working together as shown in Table 3).

Tables 4 is regarding the implementation explained in 3.3. The FPGA used is Altera Cyclone

II EP2C35F672C6 with a core NIOS II processor embedded in a DE2 platform.

**4. Computational resources and throughput of the real-time systems**

Technique. (13c. lower row) Multiscale Search Technique.

pixels per second (Kpps) of the implementations.

Fig. 12. (12a.left) ALPHADATA RC1000 branded by Celoxica. (12b.right) DE2 branded by Altera.

### **3.3 Block Matching models**

The full search technique (FST) is the most straightforward Block Matching Method (BMM) and also the most accurate one. FST (Figure 13a) matches all possible blocks within a search window in the reference frame to find the block with the minimum Summation of absolute differences (SAD), which is defined as:

$$\text{SAD}(\mathbf{x}, \mathbf{y}; \mathbf{u}, \mathbf{v}) = \sum\_{\mathbf{x} = \mathbf{0}}^{\text{31}} \sum\_{\mathbf{y} = \mathbf{0}}^{\text{31}} \left| \mathbf{I}\_{\text{i}}(\mathbf{x}, \mathbf{y}) - \mathbf{I}\_{\text{i-1}}(\mathbf{x} + \mathbf{u}, \mathbf{y} + \mathbf{v}) \right| \tag{17}$$

Where It(x, y) represents the pixel value at the coordinate (x, y) in the frame t and the (u, v) is the displacement of the possible Macro Block (MB). For example, for a block with size 32×32, the FST algorithm requires 1024 subtractions and 1023 additions to calculate a SAD. The required number of checking blocks is (1+2d)2 while the search window is limited within ± d pixels, currently is used for it a power of two. Three Steps Search Technique (TSST) is not an exhaustive search.

The step size of the search window is chosen to be half of the search area (Figure 13b). Nine candidate points, including the centre point and eight checking points on the boundary of the search are selected in each step. Second step, will move the search centre forwarding to the matching point with the minimum SAD of the previous step, and the step size of the second step is reduced by half. Last step stops the search process with the step size of one pixel; and the optimal MV with the minimum SAD can be obtained.

The hierarchical methods as Multi-scale Search Technique are based on building a pyramidal representation for each frame. This representation calculates an image where each pixel's value is a mean of itself with its neighbourhood; after that, the image is under sampled to the half, as shown in Figure 13c. This model is implemented using the Altera DE2 board and the Cyclone II FPGA as shown in the Figures 11b and 12b (González *et al*., 2011), balancing the code between the microprocessor implemented and the acceleration system which uses Avalon Bus thanks to so-called "C to hardware compiler" from Altera (Altera, 2011).

284 Real-Time Systems, Architecture, Scheduling, and Application

Fig. 12. (12a.left) ALPHADATA RC1000 branded by Celoxica. (12b.right) DE2 branded by

The full search technique (FST) is the most straightforward Block Matching Method (BMM) and also the most accurate one. FST (Figure 13a) matches all possible blocks within a search window in the reference frame to find the block with the minimum Summation of absolute

SAD(x,y;u,v) I (x,y) I (x u,y v) , <sup>−</sup>

Where It(x, y) represents the pixel value at the coordinate (x, y) in the frame t and the (u, v) is the displacement of the possible Macro Block (MB). For example, for a block with size 32×32, the FST algorithm requires 1024 subtractions and 1023 additions to calculate a SAD. The required number of checking blocks is (1+2d)2 while the search window is limited within ± d pixels, currently is used for it a power of two. Three Steps Search Technique

The step size of the search window is chosen to be half of the search area (Figure 13b). Nine candidate points, including the centre point and eight checking points on the boundary of the search are selected in each step. Second step, will move the search centre forwarding to the matching point with the minimum SAD of the previous step, and the step size of the second step is reduced by half. Last step stops the search process with the step size of one

The hierarchical methods as Multi-scale Search Technique are based on building a pyramidal representation for each frame. This representation calculates an image where each pixel's value is a mean of itself with its neighbourhood; after that, the image is under sampled to the half, as shown in Figure 13c. This model is implemented using the Altera DE2 board and the Cyclone II FPGA as shown in the Figures 11b and 12b (González *et al*., 2011), balancing the code between the microprocessor implemented and the acceleration system which uses Avalon Bus thanks to so-called "C to hardware compiler" from Altera

pixel; and the optimal MV with the minimum SAD can be obtained.

t t 1

= − ++ (17)

31 31

x 0y 0

= =

Altera.

**3.3 Block Matching models** 

differences (SAD), which is defined as:

(TSST) is not an exhaustive search.

(Altera, 2011).

Fig. 13. (13a. Upper row) Full Search Technique. (13b. medium row) Three Steps Search Technique. (13c. lower row) Multiscale Search Technique.
