**1.1.10 True motion estimation**

For video compression applications it is enough to get a motion vector corresponding to best match. This in turns results in lower residual energy and better compression. However, for video processing applications, especially for scan rate conversion, true motion estimation is desired. In True Motion estimation, the motion vectors should represent true motion of the objects in the video sequence rather than providing best block match. Hence, it is important to achieve a consistent motion vector field rather that best possible match. True motion estimation can be achieved via both post-processing the motion vectors to get smooth motion vector field as well as building the consistency measures in motion estimation algorithm itself. Three Dimensional Recursive Search (3DRS) in Fig.5 is one such algorithm where the consistency assumption is inbuilt into the motion estimation.

The algorithm works on two important assumptions – objects are larger than block size and objects have inertia. The first assumption suggests that the neighboring block's motion vectors can be used as candidates for the current block. However, for neighboring blocks ahead in raster scan, there is no motion vectors calculated yet. Here, the second assumption is applied and motion vectors from previous frame are for these blocks. 3DRS motion estimator's candidate set consists of only spatial & temporal neighboring motion vectors. This results in a very consistent motion vector field giving true motion. To kick-start the algorithm a random motion vector is also used as a candidate as illustrated in figure 5.

### **1.2 Distortion metrics role of estimation criteria**

The algorithms/techniques discussed above need to be incorporated into an estimation criterion that will subsequently be optimized in order to obtain the prediction error (Young et.al 2009) or the residual energy of the video frames. There is no unique criterion as such for motion estimation because its choice depends on the task/application at hand. For example, in compression an average performance (prediction error) of a motion estimator is important, whereas in motion-compensated interpolation (Philip 2009) the worst case performance (maximum interpolation error) may be of concern. Moreover, the selection of a criterion may be guided by the processor capabilities on which the motion estimation will be implemented. The difficulty in establishing a good criterion is primarily caused by the fact that motion in images is not directly observable and that particular dynamics of intensity in an image sequence may be induced by more than one motion.

 Gy = p1\* x + q1 (7) However a combination of all the parameters is usually present. Global motion estimation involves calculation of the four parameters in the model (p0, p1, q0, q1). The parameters can be calculated by treating them as four unknowns. Hence, ideally sample motion vectors at four different locations can be used to calculate the four run known parameters. In practice though, more processing is needed to get good estimate for the parameters. Also, note that still local motion estimation, at least at four locations, is essential to calculate the global motion estimation parameters. However, there are algorithms for global motion estimation, which do not rely on local motion estimation. The above parametric model with four parameters cannot fit rotational global motion. For rotational motion a six-parameter model is needed. However,

the same four-parameter model concepts can be extended to the six-parameter model.

where the consistency assumption is inbuilt into the motion estimation.

**1.2 Distortion metrics role of estimation criteria** 

an image sequence may be induced by more than one motion.

For video compression applications it is enough to get a motion vector corresponding to best match. This in turns results in lower residual energy and better compression. However, for video processing applications, especially for scan rate conversion, true motion estimation is desired. In True Motion estimation, the motion vectors should represent true motion of the objects in the video sequence rather than providing best block match. Hence, it is important to achieve a consistent motion vector field rather that best possible match. True motion estimation can be achieved via both post-processing the motion vectors to get smooth motion vector field as well as building the consistency measures in motion estimation algorithm itself. Three Dimensional Recursive Search (3DRS) in Fig.5 is one such algorithm

The algorithm works on two important assumptions – objects are larger than block size and objects have inertia. The first assumption suggests that the neighboring block's motion vectors can be used as candidates for the current block. However, for neighboring blocks ahead in raster scan, there is no motion vectors calculated yet. Here, the second assumption is applied and motion vectors from previous frame are for these blocks. 3DRS motion estimator's candidate set consists of only spatial & temporal neighboring motion vectors. This results in a very consistent motion vector field giving true motion. To kick-start the algorithm a random motion vector is also used as a candidate as illustrated in figure 5.

The algorithms/techniques discussed above need to be incorporated into an estimation criterion that will subsequently be optimized in order to obtain the prediction error (Young et.al 2009) or the residual energy of the video frames. There is no unique criterion as such for motion estimation because its choice depends on the task/application at hand. For example, in compression an average performance (prediction error) of a motion estimator is important, whereas in motion-compensated interpolation (Philip 2009) the worst case performance (maximum interpolation error) may be of concern. Moreover, the selection of a criterion may be guided by the processor capabilities on which the motion estimation will be implemented. The difficulty in establishing a good criterion is primarily caused by the fact that motion in images is not directly observable and that particular dynamics of intensity in

**1.1.10 True motion estimation** 

Motion estimation therefore aims to find a 'match' to the current block or region that minimizes the energy in the motion compensated residual (the difference between the current block and the reference area). An area in the reference frame centered on the current macro block (Iain Richardson 2010) position (the search area) is searched and the 16 × 16 region within the search area that minimizes a matching criterion is chosen as the 'best match'. The choice of matching criterion is important since block matching might require the distortion measure for 'residual energy' affects computational complexity and the accuracy of the motion estimation process. Therefore, all attempts to establish suitable criteria for motion estimation require further implicit or explicit modeling of the image sequence of the video. If all matching criteria resulted in compressed video of the same quality then, of course, the least complex of these would always be used for block matching.

 However matching criteria (IEG Richardson 2003) often differ on the choice of substitute for the target block, with consequent variation in the quality of the coded frame. The MSD, for example, requires many multiplications whereas the MAD primarily uses additions. While multiplication might not have too great an impact on a software (Romuald 2006) coder, a hardware coder using MSE could be significantly more expensive than a hardware implementation of the SAD/MAD function. Equations 8,9,10 describe three energy measures, MSD, MAD and SAD. The motion compensation block size is *N* × *N* samples; Cur i, j, Ref i, j are current and reference area samples respectively.Fig.6 shows the image in macroblock form for the current video frame.(see photo)

Fig. 6. Macroblock view of the Frame

#### **1.2.1 Mean squared difference**

$$MSD = \frac{1}{N^2} \sum\_{1=0}^{N-1} \sum\_{j=0}^{N-1} \left( \mathbf{Cur}\_{i,j} - \mathbf{Ref}\_{i,j} \right)^2$$

H.264 Motion Estimation and Applications 67

SAD is commonly used as the error estimate to identify the most similar block when trying to obtain the block motion vectoring the process of motion estimation, which requires only

, ,

*i j i j*

Cur Ref

1 1

10 0

 

*j*

*N N*

*SAD*

easy calculation as in fig.9 without the need for multiplication.

**1.2.3 Sum of absolute difference** 

Fig. 9. SAD Map & its PSNR sketch

MSD is also called as Mean Square Error (MSE). It is the indication of amount of difference between two macro blocks. Practically, the lower MSD value better is the match.

Fig. 7. MSD Map

#### **1.2.2 Mean absolute difference**

The lower MAD the better the match and so candidate block with minimum MAD should be chosen. The function is also called as Min Absolute Error (MAE).

MSD is also called as Mean Square Error (MSE). It is the indication of amount of difference

1 1

10 0

 

*j*

The lower MAD the better the match and so candidate block with minimum MAD should

be chosen. The function is also called as Min Absolute Error (MAE).

*N N*

*N*

*MAD*

2 , ,

*i j i j*

<sup>1</sup> Cur Re f

between two macro blocks. Practically, the lower MSD value better is the match.

Fig. 7. MSD Map

Fig. 8. MAD Map

**1.2.2 Mean absolute difference** 

#### **1.2.3 Sum of absolute difference**

$$SAD = \sum\_{1=0}^{N-1} \sum\_{j=0}^{N-1} \left| \mathbf{Cur}\_{i,j} - \mathbf{Ref}\_{i,j} \right|^2$$

SAD is commonly used as the error estimate to identify the most similar block when trying to obtain the block motion vectoring the process of motion estimation, which requires only easy calculation as in fig.9 without the need for multiplication.

Fig. 9. SAD Map & its PSNR sketch

H.264 Motion Estimation and Applications 69

The RDO mode selection algorithm attempts to find a mode that minimizes the joint cost J. The trade-off between Rate and Distortion is controlled by the Lagrange multiplier λ (Alan Bovik 2009). A smaller λ will give more emphasis to minimizing D, allowing a higher rate, whereas a larger λ will tend to minimize R at the expense of a higher distortion. Selecting the best λ for a particular sequence is a highly complex problem. Fortunately, empirical approximations have been developed that provide an effective choice of λ in a practical

λ = 0.852(QP−12)/3 (12)

*SSD i j i j j*

Where i,j are the sample positions in a block, Cur(i,j) are the original sample values and

are the decoded sample values at each sample position. Other distortion metrics, such as Sum of Absolute Differences (SAD), Mean of Absolute Difference (MAD) or Mean of Squared errors (MSE) may be used in processes such as selecting the best motion vector for a block [iv]. A different distortion metric typically requires a different λ calculation and

Code the macroblock using mode m and calculate R, the number of bits required to

Reconstruct the macroblock and calculate D, the distortion between the original and

This is clearly a computationally intensive process, since there is hundreds of possible modes combination and therefore it is necessary to code the macroblock hundreds of times

Thus a *matching criterion*, or *distortion function*, is used to quantify the similarity between the target block and candidate blocks. If, due to a large search area, many candidate blocks are considered, then the matching criteria will be evaluated many times. Thus the choice of the matching criteria has an impact on the success of the compression. If the matching criterion is slow, for example, then the block matching will be slow. If the matching criterion results

10 0

indeed will have an impact in the computation process taken as a whole.

Calculate the mode cost Jm using (11), with appropriate choice of λ

A typical mode selection algorithm might proceed as follows:

 

*N N*

 1 1 <sup>2</sup> , ,

Cur Re f

Good results can be obtained by calculating λ as a function of QP.

Distortion (D) is calculated as the Sum of Squared Distortion (SSD,

*D*

mode selection scenario.

Ref(i,j)

For every macroblock

code the macroblock

decoded macroblock

**1.2.5 Conclusions and results** 

For every available coding mode m

Choose the mode that gives the minimum Jm

to find the 'best' mode in a rate-distortion sense.

SAD (Young et.al 2009) is an extremely fast metric due to its simplicity; it is effectively the simplest possible metric that takes into account every pixel in a block. Therefore it is very effective for a wide motion search of many different blocks.

SAD is also easily parallelizable since it analyzes each pixel separately, making it easily implementable with hardware and software coders. Once candidate blocks are found, the final refinement of the motion estimation process is often done with other slower but more accurate metrics like which better take into account human perception. These include the sum of absolute transformed differences (SATD), the sum of squared differences (SSD), and rate-distortion optimization (RDO).

The usual coding techniques applied to moving objects within a video scene lower the compression efficiency as they only consider the pixels at the same position in the video frames. Motion estimation with SAD as the distortion metric used to capture such movements more accurately for better compression efficiency. For example in video surveillance using moving cameras, a popular way to handle translation problems on images, using template matching is to compare the intensities of the pixels, using the SAD measure. The motion estimation on a video sequence using SAD uses the current video frame and a previous frame as the target frame. The two frames are compared pixel by pixel, summing up the absolute values of the differences of each of the two corresponding pixels. The result is a positive number that is used as the score. SAD reacts very sensitively to even minor changes within a scene.

SAD is probably the most widely-used measure of residual energy for reasons of computational simplicity. The H.264 reference model software [5] uses SA (T) D, the sum of absolute differences of the *transformed* residual data, as its prediction energy measure (for both Intra and Inter prediction). Transforming the residual at each search location increases computation but improves the accuracy of the energy measure. A simple multiply-free transform is used and so the extra computational cost is not excessive. The results of the above example in Fig.6 indicate that the best choice of motion vector is (+2, 0). The minimum of the MSE or SAE map indicates the offset that produces a minimal residual energy and this is likely to produce the smallest energy of quantized transform.

#### **1.2.4 Rate distortion optimization**

These distortion metrics often play the pivotal role in deciding the quality of the videos viewed when choosing the method of Rate of Distortion Optimization (RDO) (Iain Richardson 2010)which is a technique for choosing the coding mode of a macroblock based on the rate and the distortion cost. Formulating this, we represent bitrate R and distortion cost D combined into a single cost J given by,

$$\mathbf{J} = \mathbf{D} + \lambda \,\, \mathbf{R} \tag{11}$$

The bits are mathematically measured by multiplying the bit cost by the Lagrangian λ, a value representing the relationship between bit cost and quality for a particular quality level. The deviation from the source is usually measured in terms of distortion metrics in order to maximize the PSNR video quality metric.

SAD (Young et.al 2009) is an extremely fast metric due to its simplicity; it is effectively the simplest possible metric that takes into account every pixel in a block. Therefore it is very

SAD is also easily parallelizable since it analyzes each pixel separately, making it easily implementable with hardware and software coders. Once candidate blocks are found, the final refinement of the motion estimation process is often done with other slower but more accurate metrics like which better take into account human perception. These include the sum of absolute transformed differences (SATD), the sum of squared differences (SSD), and

The usual coding techniques applied to moving objects within a video scene lower the compression efficiency as they only consider the pixels at the same position in the video frames. Motion estimation with SAD as the distortion metric used to capture such movements more accurately for better compression efficiency. For example in video surveillance using moving cameras, a popular way to handle translation problems on images, using template matching is to compare the intensities of the pixels, using the SAD measure. The motion estimation on a video sequence using SAD uses the current video frame and a previous frame as the target frame. The two frames are compared pixel by pixel, summing up the absolute values of the differences of each of the two corresponding pixels. The result is a positive number that is used as the score. SAD reacts very sensitively to even minor changes within a

SAD is probably the most widely-used measure of residual energy for reasons of computational simplicity. The H.264 reference model software [5] uses SA (T) D, the sum of absolute differences of the *transformed* residual data, as its prediction energy measure (for both Intra and Inter prediction). Transforming the residual at each search location increases computation but improves the accuracy of the energy measure. A simple multiply-free transform is used and so the extra computational cost is not excessive. The results of the above example in Fig.6 indicate that the best choice of motion vector is (+2, 0). The minimum of the MSE or SAE map indicates the offset that produces a minimal residual

These distortion metrics often play the pivotal role in deciding the quality of the videos viewed when choosing the method of Rate of Distortion Optimization (RDO) (Iain Richardson 2010)which is a technique for choosing the coding mode of a macroblock based on the rate and the distortion cost. Formulating this, we represent bitrate R and distortion

The bits are mathematically measured by multiplying the bit cost by the Lagrangian λ, a value representing the relationship between bit cost and quality for a particular quality level. The deviation from the source is usually measured in terms of distortion metrics in

J = D+ λ R (11)

energy and this is likely to produce the smallest energy of quantized transform.

effective for a wide motion search of many different blocks.

rate-distortion optimization (RDO).

**1.2.4 Rate distortion optimization** 

cost D combined into a single cost J given by,

order to maximize the PSNR video quality metric.

scene.

The RDO mode selection algorithm attempts to find a mode that minimizes the joint cost J. The trade-off between Rate and Distortion is controlled by the Lagrange multiplier λ (Alan Bovik 2009). A smaller λ will give more emphasis to minimizing D, allowing a higher rate, whereas a larger λ will tend to minimize R at the expense of a higher distortion. Selecting the best λ for a particular sequence is a highly complex problem. Fortunately, empirical approximations have been developed that provide an effective choice of λ in a practical mode selection scenario.

Good results can be obtained by calculating λ as a function of QP.

$$
\lambda = 0.852^{(0^n - 12)/3} \tag{12}
$$

Distortion (D) is calculated as the Sum of Squared Distortion (SSD,

$$D\_{SSD} = \sum\_{1=0}^{N-1} \sum\_{j=0}^{N-1} \left( \mathbf{Cur}\_{i,j} - \mathbf{Ref}\_{i,j} \right)^2$$

Where i,j are the sample positions in a block, Cur(i,j) are the original sample values and Ref(i,j)

are the decoded sample values at each sample position. Other distortion metrics, such as Sum of Absolute Differences (SAD), Mean of Absolute Difference (MAD) or Mean of Squared errors (MSE) may be used in processes such as selecting the best motion vector for a block [iv]. A different distortion metric typically requires a different λ calculation and indeed will have an impact in the computation process taken as a whole.

A typical mode selection algorithm might proceed as follows:

For every macroblock


This is clearly a computationally intensive process, since there is hundreds of possible modes combination and therefore it is necessary to code the macroblock hundreds of times to find the 'best' mode in a rate-distortion sense.

#### **1.2.5 Conclusions and results**

Thus a *matching criterion*, or *distortion function*, is used to quantify the similarity between the target block and candidate blocks. If, due to a large search area, many candidate blocks are considered, then the matching criteria will be evaluated many times. Thus the choice of the matching criteria has an impact on the success of the compression. If the matching criterion is slow, for example, then the block matching will be slow. If the matching criterion results

H.264 Motion Estimation and Applications 71

It can be observed that motion estimation, including integer-pel motion estimation, fractional-pel motion estimation, and fractional-pel interpolation in the table, takes up more than 95 percent of the computation in the whole encoder, which is a common characteristic in all video encoders. The total required computing power for a H.264 encoder is more than 300 giga instructions per second (GIPS), which cannot be achieved by existing processors. To account for this problem, several approaches have been adopted, such as the application of new low complexity ME algorithms that have been studied and developed (Yu Wen 2006),

Nevertheless, the innumerous data dependencies imposed by this video standard frequently inflict a very difficult challenge in order to efficiently take advantage of the several possible parallelization strategies that may be applied. Up recently, most parallelization (Florian et.al 2010) efforts around the H.264 standard have been mainly focused on the decoder implementation [2]. When the most challenging and rewarding goal of parallelizing the encoder is concerned, it has been observed that a significant part of the efforts have been devised in the design of specialized and dedicated systems [7, 6]. Most of these approaches are based on parallel or pipeline topologies, using dedicated HWstructures to implement several parts of the encoder. When only pure software (SW) approaches are considered, fewer parallel solutions have been proposed. Most of them are based on the exploitation of the data independency between Group-of-Pictures (GOPs) of slices. For such a video encoder, it may be probably necessary to use some kind of parallel programming approach to share the encoding application execution time and also to balance the workload among

The primary aim of this section is to provide a deeper understanding of the scalability of parallelism in H.264. Several analyses and parallel optimizations have been presented about H.264/AVC encoders [3, 4, 8]. Due to the encoder's nature, many of these parallelization approaches exploit concurrent execution at: frame-level, slice-level, macroblock-level..The H.264 codec can be parallelized either by task-level and data-level decomposition. In Fig.10 the two approaches are sketched. In task-level decomposition individual tasks of the H.264 Codec are assigned to processors while in data-level decomposition different portions of

In task-level decomposition the functional partitions of the algorithm are assigned to different processors. As shown in Fig.10 the process of decoding H.264 consists of performing a series of operations on the coded input bitstream. Some of these tasks can be done in parallel. For example, Inverse Quantization (IQ) and the Inverse Transform (IDCT) can be done in parallel with the Motion Compensation (MC) stage. In Fig. 10a the tasks are mapped to a 4-processor system. A control processor is in charge of synchronization and parsing the bitstream. One processor is in charge of Entropy Decoding, IQ and IDCT, another one of the prediction stage (MC or IntraP), and a third one is responsible for the

data are assigned to processors running the same program.

dedicated hardware (HW) structures and, more recently, multi-processor solutions.

the concurrent processors.

**1.3.1 Parallelism in H.264** 

**1.3.2 Task-level decomposition** 

deblocking filter.

in bad matches then the quality of the compression will be adversely affected. Fortunately a number of matching criteria are suitable for use in video compression. Although, the number of matching criteria evaluated by block matching algorithms is largely independent of the sequence coded, the success of the algorithms is heavily dependent on the sequence coded.
