**1.3 Scalability of parallelism in H.264 video compression**

The H.264/AVC standard provides several profiles to define the applied encoding techniques, targeting specific classes of applications. For each profile, several levels are also defined, specifying upper bounds for the bit stream or lower bounds for the decoder capabilities, processing rate, memory size for multipicture buffers, video rate, and motion vector range (Alois 2009) significantly improving the compression performance relative to all existing video coding standards [1]. To achieve the offered encoding performance, this standard incorporates a set of new and powerful techniques: 4×4 integer transform, inter-prediction with variable block-size, quarter-pixel motion estimation (ME), in-loop deblocking filter, improved entropy coding based on Context-Adaptive Variable-Length Coding (CAVLC) or on Content-Adaptive Binary Arithmetic Coding (CABAC), new modes for intra prediction, etc. Moreover, the adoption of bi-predictive frames (B-frames) along with the previous features provides a considerable bit-rate reduction with negligible quality losses.

For instance using Intel VTune software running on a Pentium IV 3 GHz CPU with H.264/AVC SD in main profile encoding solution with Arithmetic, controlling, and data transfer instructions are separated would require about 1,600 billions of operations per second. Table.1 illustrates a typical profile of the H.264/AVC encoder complexity based on the Pentium IV general purpose processor architecture. Notice that motion estimation, macroblock/block processing (including mode decision), and motion compensation modules which take up nearly the entire cycle (78%) of operations and account for higher resource usage.


Table 1. Instruction profiling in Baseline Profile H.264

in bad matches then the quality of the compression will be adversely affected. Fortunately a number of matching criteria are suitable for use in video compression. Although, the number of matching criteria evaluated by block matching algorithms is largely independent of the sequence coded, the success of the algorithms is heavily dependent on the sequence

The H.264/AVC standard provides several profiles to define the applied encoding techniques, targeting specific classes of applications. For each profile, several levels are also defined, specifying upper bounds for the bit stream or lower bounds for the decoder capabilities, processing rate, memory size for multipicture buffers, video rate, and motion vector range (Alois 2009) significantly improving the compression performance relative to all existing video coding standards [1]. To achieve the offered encoding performance, this standard incorporates a set of new and powerful techniques: 4×4 integer transform, inter-prediction with variable block-size, quarter-pixel motion estimation (ME), in-loop deblocking filter, improved entropy coding based on Context-Adaptive Variable-Length Coding (CAVLC) or on Content-Adaptive Binary Arithmetic Coding (CABAC), new modes for intra prediction, etc. Moreover, the adoption of bi-predictive frames (B-frames) along with the previous features provides a

For instance using Intel VTune software running on a Pentium IV 3 GHz CPU with H.264/AVC SD in main profile encoding solution with Arithmetic, controlling, and data transfer instructions are separated would require about 1,600 billions of operations per second. Table.1 illustrates a typical profile of the H.264/AVC encoder complexity based on the Pentium IV general purpose processor architecture. Notice that motion estimation, macroblock/block processing (including mode decision), and motion compensation modules which take up nearly the entire cycle (78%) of operations and account for higher

**1.3 Scalability of parallelism in H.264 video compression** 

considerable bit-rate reduction with negligible quality losses.

Table 1. Instruction profiling in Baseline Profile H.264

coded.

resource usage.

It can be observed that motion estimation, including integer-pel motion estimation, fractional-pel motion estimation, and fractional-pel interpolation in the table, takes up more than 95 percent of the computation in the whole encoder, which is a common characteristic in all video encoders. The total required computing power for a H.264 encoder is more than 300 giga instructions per second (GIPS), which cannot be achieved by existing processors. To account for this problem, several approaches have been adopted, such as the application of new low complexity ME algorithms that have been studied and developed (Yu Wen 2006), dedicated hardware (HW) structures and, more recently, multi-processor solutions.

Nevertheless, the innumerous data dependencies imposed by this video standard frequently inflict a very difficult challenge in order to efficiently take advantage of the several possible parallelization strategies that may be applied. Up recently, most parallelization (Florian et.al 2010) efforts around the H.264 standard have been mainly focused on the decoder implementation [2]. When the most challenging and rewarding goal of parallelizing the encoder is concerned, it has been observed that a significant part of the efforts have been devised in the design of specialized and dedicated systems [7, 6]. Most of these approaches are based on parallel or pipeline topologies, using dedicated HWstructures to implement several parts of the encoder. When only pure software (SW) approaches are considered, fewer parallel solutions have been proposed. Most of them are based on the exploitation of the data independency between Group-of-Pictures (GOPs) of slices. For such a video encoder, it may be probably necessary to use some kind of parallel programming approach to share the encoding application execution time and also to balance the workload among the concurrent processors.
