**1.3.8 Macroblock-level parallelism in the spatial domain**

Usually MBs in a slice are processed in scan order, which means starting from the top left corner of the frame andmoving to the right, row after row. To exploit parallelism between MBs inside a frame it is necessary to take into account the dependencies between them. In H.264, motion vector prediction, intra prediction, and the deblocking filter use data from neighboring MBs defining a structured set of dependencies. These dependencies are shown in Fig. 14.


Fig. 14. 2D-Wave approach for exploiting MB parallelism in the spatial domain. The *arrows*  indicate dependencies.

MBs can be processed out of scan order provided these dependencies are satisfied. Processing MBs in a diagonal wavefront manner satisfies all the dependencies and at the same time allows to exploit parallelism between MBs. We refer to this parallelization technique as 2D-Wave.

Fig.14 depicts an example for a 5×5 MBs image (80×80 pixels). At time slot T7 three independent MBs can be processed: MB (4,1), MB (2,2) and MB (0,3). The figure also shows

H.264 Motion Estimation and Applications 77

frames are decoded. Figure 16 shows an example of two frames where the second depends on the first. MBs are decoded in scan order and one at a time. The figure shows that MB *(*2*,*  0*)* of frame *i* + 1 depends on MB *(*2*,* 1*)* of frame *i* which has been decoded. Thus this MB can

The main disadvantage of this scheme is the limited scalability. The number of MBs that can be decoded inparallel is inversely proportional to the length of the verticalmotion vector component. Thus for this scheme to be beneficial the encoder should be enforced to heavily restrict themotion search area which in far most cases is not possible. Assuming it would be possible, the minimum search area is around 3 MB rows: 16 pixels for the co-located MB, 3 pixels at the top and at the bottom of the MB for sub-sample interpolations and some pixels for motion vectors (at least 10). As a result the maximum parallelism is 14, 17 and 27MBs for

The second limitation of this type of MB-level parallelism is poor load-balancing (Lai Ming che 2006) because the decoding time for each frame is different. It can happen that a fast frame is predicted from a slow frame and can not decode faster than the slow frame and remains idle for some time. Finally, this approach works well for the encoder which has the freedom to restrict the range of the motion search area. In the case of the decoder the motion vectors can have large values and the number of frames that can be processed in parallel is reduced.

In summary, parallelizing the entire process of the H.264 encoding particularly motion estimation will definitely end in optimized (Kun et.al 2009) performance provided the hardware/software requirements of the design are required. This will lead to a higher computation throughput achieved at the cost of appreciable load balance among the

The H.264/AVC (T.Wiegand 2003) standard video format has a very broad application range that covers all forms of digital compressed video from low bit-rate Internet streaming

be decoded even though frame *i* is not completely decoded.

Fig. 16. MB-level parallelism in the temporal domain in H.264.

STD, HD and FHD frame resolutions respectively.

processor cores.

**1.4 Applications** 

the dependencies that need to be satisfied in order to process each of these MBs. The number of independent MBs in each frame depends on the resolution. For a low resolution like QCIF there are only 6 independent MBs during 4 time slots. For High Definition (1920×1080) there are 60 independent MBs during 9 slots of time. Fig. 15 depicts the available MB parallelism over time for a FHD resolution frame, assuming that the time to decode a MB is constant.

Fig. 15. MB parallelism for a single FHD frame using the 2Dwave approach.

MB-level parallelism in the spatial domain has many advantages over other schemes for parallelization of H.264. First, this scheme can have a good scalability. As shown before the number of independent MBs increases with the resolution of the image. Second, it is possible to achieve a good load balancing if a dynamicscheduling system is used. That is due to the fact that the time to decode a MB is not constant and depends on the data being processed. Load balancing could take place if a dynamic scheduler assigns a MB to a processor once all its dependencies have been satisfied. Additionally, because in MB-level parallelization all the processors/threads run the same program the sameset of software optimizations (for exploiting ILP and SIMD) can be applied to all processing elements. However, this kind ofMB-level parallelism has some disadvantages. The first one is the fluctuating number of independent MBs causing underutilization of cores and decreased total processing rate. The second disadvantage is that entropy decoding cannot be parallelized at the MB level. MBs of the same slice have to be entropy decoded sequentially. If entropy decoding is accelerated with specialized hardware MB level parallelism could still provide benefits.

## **1.3.9 Macroblock-level parallelism in the temporal domain**

In the decoding process the dependency betweenframes is in the MC module only. MC can be regarded as copying an area, called the reference area, from the reference frame, and then to add this predicted area to the residual MB to reconstruct the MB in the current frame. The reference area is pointed to by a Motion Vector (MV). Although the limit to the MV length is defined by the standard as 512 pixels vertical and 2048 pixels horizontal, in practice MVs are within the range of dozens of pixels.

When the reference area has been decoded it can be used by the referencing frame. Thus it is not necessary to wait until a frame is completely decoded before decoding the next frame. The decoding process of the next frame can start after the reference areas of the reference

the dependencies that need to be satisfied in order to process each of these MBs. The number of independent MBs in each frame depends on the resolution. For a low resolution like QCIF there are only 6 independent MBs during 4 time slots. For High Definition (1920×1080) there are 60 independent MBs during 9 slots of time. Fig. 15 depicts the available MB parallelism over time for a FHD resolution frame, assuming that the time to

Fig. 15. MB parallelism for a single FHD frame using the 2Dwave approach.

**1.3.9 Macroblock-level parallelism in the temporal domain** 

MB-level parallelism in the spatial domain has many advantages over other schemes for parallelization of H.264. First, this scheme can have a good scalability. As shown before the number of independent MBs increases with the resolution of the image. Second, it is possible to achieve a good load balancing if a dynamicscheduling system is used. That is due to the fact that the time to decode a MB is not constant and depends on the data being processed. Load balancing could take place if a dynamic scheduler assigns a MB to a processor once all its dependencies have been satisfied. Additionally, because in MB-level parallelization all the processors/threads run the same program the sameset of software optimizations (for exploiting ILP and SIMD) can be applied to all processing elements. However, this kind ofMB-level parallelism has some disadvantages. The first one is the fluctuating number of independent MBs causing underutilization of cores and decreased total processing rate. The second disadvantage is that entropy decoding cannot be parallelized at the MB level. MBs of the same slice have to be entropy decoded sequentially. If entropy decoding is accelerated with specialized hardware MB level parallelism could still

In the decoding process the dependency betweenframes is in the MC module only. MC can be regarded as copying an area, called the reference area, from the reference frame, and then to add this predicted area to the residual MB to reconstruct the MB in the current frame. The reference area is pointed to by a Motion Vector (MV). Although the limit to the MV length is defined by the standard as 512 pixels vertical and 2048 pixels horizontal, in practice MVs are

When the reference area has been decoded it can be used by the referencing frame. Thus it is not necessary to wait until a frame is completely decoded before decoding the next frame. The decoding process of the next frame can start after the reference areas of the reference

decode a MB is constant.

provide benefits.

within the range of dozens of pixels.

frames are decoded. Figure 16 shows an example of two frames where the second depends on the first. MBs are decoded in scan order and one at a time. The figure shows that MB *(*2*,*  0*)* of frame *i* + 1 depends on MB *(*2*,* 1*)* of frame *i* which has been decoded. Thus this MB can be decoded even though frame *i* is not completely decoded.

Fig. 16. MB-level parallelism in the temporal domain in H.264.

The main disadvantage of this scheme is the limited scalability. The number of MBs that can be decoded inparallel is inversely proportional to the length of the verticalmotion vector component. Thus for this scheme to be beneficial the encoder should be enforced to heavily restrict themotion search area which in far most cases is not possible. Assuming it would be possible, the minimum search area is around 3 MB rows: 16 pixels for the co-located MB, 3 pixels at the top and at the bottom of the MB for sub-sample interpolations and some pixels for motion vectors (at least 10). As a result the maximum parallelism is 14, 17 and 27MBs for STD, HD and FHD frame resolutions respectively.

The second limitation of this type of MB-level parallelism is poor load-balancing (Lai Ming che 2006) because the decoding time for each frame is different. It can happen that a fast frame is predicted from a slow frame and can not decode faster than the slow frame and remains idle for some time. Finally, this approach works well for the encoder which has the freedom to restrict the range of the motion search area. In the case of the decoder the motion vectors can have large values and the number of frames that can be processed in parallel is reduced.

In summary, parallelizing the entire process of the H.264 encoding particularly motion estimation will definitely end in optimized (Kun et.al 2009) performance provided the hardware/software requirements of the design are required. This will lead to a higher computation throughput achieved at the cost of appreciable load balance among the processor cores.
