**4.2 Parallelization of WZ decoding**

WZ video coding accumulates the majority of the complexity on the decoder side. If you study each module inside the decoder scheme (Figure 1), you discover that most of this complexity is concentrated in the Channel Decoder module (Brites et al., 2008). This module receives successive chunks of parity bits. Then, the quantized symbol stream associated to each bitplane is obtained in an iterative process, which is based on the residual statistics calculated by the CNM. This procedure stops when a condition based on probabilities is satisfied. Obviously, the complexity of the decoder increases when more bitplanes (in the pixel domain) or coefficient bands (for the transform domain) are decoded. At this point, as a first stage on the transcoding process, it is proposed a WZ decoding architecture which distributes decoding complexity across several processing units. The proposed architecture is shown in Figure 3. The approach is a flexible and scalable architecture which distributes the parallel decoding between two parallelism levels: GOPs and frames. First, the input bitstream composed of K frames is stored in a K-frame buffer. Then, at the first parallelism level, the WZ frames inside two K frames delimit a GOP structure, and therefore each GOP

Mobile Video Communications Based on Fast DVC to H.264 Transcoding 21

scheduler. When a thread finishes the decoding of a part of a frame, it can continue decoding other parts of the same frame. In the case of there being no more parts of this frame for decoding, this core has to wait until the rest of parts of the same frame are decoded. This is a consequence of the synchronization barrier implicit for each frame to be reconstructed. In Figure 4, when a thread is waiting it is labeled as being in an idle state. In addition, while the sequence decoding process is finishing, there are not enough tasks for available cores, so several cores change their status to idle until the decoding process finishes. Nevertheless, real sequences are composed of many GOPs and decoder

Fig. 4. Timeline for the proposed parallel WZ decoding with a sequence with 5 GOPs (GOP

The size of the K-frame buffer *S* is defined by Equation 1, where *i* is the number of GOPs which can be executed in parallel. For example, in the execution in Figure 4, a 4-core processor can execute two GOPs at the same time, so three stored K frames are providing enough tasks for four cores. In addition, it is not necessary to fill the buffer fully and it could be filled progressively during the decoding process. For different GOP lengths, the buffer size would be the same, since every WZ GOP length only needs two K frames to start the

Finally, considering that the parity bits could be requested to the encoder without following a sequential order, it calculates the Parity Position (*PP*) which determinates the parity bit position to start to send. *PP* is calculated by Equation 2, where *I* is the Intra period, *P* is the position of the current GOP, *Q* is the quantification parameter, and *W* is the width of the

ܲܲ ൌ ሺܫ െ ͳሻ כ ܲ כ ܳ כ ൬ቀௐכுכଶ

ܵൌ݅ͳ (1)

଼ ቁ ͳ൰ (2)

length = 2) and 4 cores.

first WZ decoding frame.

image and *H* the height.

initialization and ending times are quite shorter than the whole decoding time.

decoding procedure is carried out independently by a different core. Additionally, for each WZ frame inside a GOP, an SI is calculated and then split into several parts. Then each portion of the frame is assigned to any core which executes the iterative turbo decoding procedure in order to decode the corresponding part of the WZ reconstructed frame. Therefore, each spatial division of the frame is decoded in an independent way by using the feedback channel to request parity bits from the encoder. When each part of a given frame is decoded, these parts are joined in spatial order and the frame is reconstructed. Finally, a sequence joiner receives each decoded frame and key frames in order to reorganize the sequence in its temporal order.

Fig. 3. Proposed WZ-to-H.264/AVC transcoding achitecture

Concerning the scheduler, a dynamic scheduler is implemented. That means that whenever a core is free and there is no pending task, it is assigned to the idle core. The number of tasks is always equal to, or bigger than, the number of cores. So that means there are always tasks in the scheduler queue until the end of the decoding stage is reached. However, partial decoding for each frame requires a synchronization barrier. To illustrate this, Figure 4 shows the decoding time line for a sequence composed of 5 GOPs (with length = 2) on a multicore with four cores. As can be seen, decoder initialization takes some time at the beginning of the decoder process. After that, each core receives a task (defined by a thread) from the

decoding procedure is carried out independently by a different core. Additionally, for each WZ frame inside a GOP, an SI is calculated and then split into several parts. Then each portion of the frame is assigned to any core which executes the iterative turbo decoding procedure in order to decode the corresponding part of the WZ reconstructed frame. Therefore, each spatial division of the frame is decoded in an independent way by using the feedback channel to request parity bits from the encoder. When each part of a given frame is decoded, these parts are joined in spatial order and the frame is reconstructed. Finally, a sequence joiner receives each decoded frame and key frames in order to reorganize the

> Decoded Frame

> > AVC bitstream

Sequence

Decoded Frame

Frame Joiner i and Reconstruction i

Entropy

Frame Joiner 1 and Reconstruction 1

Partial WZ decoder i.1 <sup>+</sup> CNM i.1

Partial WZ decoder 1.1 <sup>+</sup> CNM 1.1

**Wyner-Ziv Parallel Decoder (frame level)**

Partial WZ decoder 1.2 <sup>+</sup> CNM 1.2 Partial WZ decoder 1.j <sup>+</sup> CNM 1.j

**. . .**

Partial WZ decoder i.2 <sup>+</sup> CNM i.2 Partial WZ decoder i.j <sup>+</sup> CNM i.j

encode <sup>T</sup> Reorder

(current) Output H.264/

∑ Q NAL

**. .**

Frame Spliter i +

Joiner Sequence

Frame Spliter 1 + Scheduler 1

FeedBack Channel

**MVs Buffer**

Inter

+


Intra

Side Information

MVs

**. . .**

MVs

Side Information

Scheduler i **.**

Q-1 T-1 ∑

+

Concerning the scheduler, a dynamic scheduler is implemented. That means that whenever a core is free and there is no pending task, it is assigned to the idle core. The number of tasks is always equal to, or bigger than, the number of cores. So that means there are always tasks in the scheduler queue until the end of the decoding stage is reached. However, partial decoding for each frame requires a synchronization barrier. To illustrate this, Figure 4 shows the decoding time line for a sequence composed of 5 GOPs (with length = 2) on a multicore with four cores. As can be seen, decoder initialization takes some time at the beginning of the decoder process. After that, each core receives a task (defined by a thread) from the

+

sequence in its temporal order.

**WZ PARALLEL DECODER**

Input WZ bitstream

> Spliter + Scheduler

Key Frames Buffer

**H.264 ENCODER**

Fn (reconstructed)

Fn-1 (reference)

Fn

Deblocking Filter

Fig. 3. Proposed WZ-to-H.264/AVC transcoding achitecture

**ME**

Intra Prediction MC

**Wyner-Ziv Parallel Decoder (GOP level)**

scheduler. When a thread finishes the decoding of a part of a frame, it can continue decoding other parts of the same frame. In the case of there being no more parts of this frame for decoding, this core has to wait until the rest of parts of the same frame are decoded. This is a consequence of the synchronization barrier implicit for each frame to be reconstructed. In Figure 4, when a thread is waiting it is labeled as being in an idle state. In addition, while the sequence decoding process is finishing, there are not enough tasks for available cores, so several cores change their status to idle until the decoding process finishes. Nevertheless, real sequences are composed of many GOPs and decoder initialization and ending times are quite shorter than the whole decoding time.

Fig. 4. Timeline for the proposed parallel WZ decoding with a sequence with 5 GOPs (GOP length = 2) and 4 cores.

The size of the K-frame buffer *S* is defined by Equation 1, where *i* is the number of GOPs which can be executed in parallel. For example, in the execution in Figure 4, a 4-core processor can execute two GOPs at the same time, so three stored K frames are providing enough tasks for four cores. In addition, it is not necessary to fill the buffer fully and it could be filled progressively during the decoding process. For different GOP lengths, the buffer size would be the same, since every WZ GOP length only needs two K frames to start the first WZ decoding frame.

$$\mathcal{S} = \mathfrak{i} + \mathbf{1} \tag{1}$$

Finally, considering that the parity bits could be requested to the encoder without following a sequential order, it calculates the Parity Position (*PP*) which determinates the parity bit position to start to send. *PP* is calculated by Equation 2, where *I* is the Intra period, *P* is the position of the current GOP, *Q* is the quantification parameter, and *W* is the width of the image and *H* the height.

$$PP = (l - 1) \* P \* Q \* \left( \left( \frac{W \* H \* 2}{8} \right) + 1 \right) \tag{2}$$

Mobile Video Communications Based on Fast DVC to H.264 Transcoding 23

One desired feature of every transcoder is flexibility. To achieve it, an important process is to perform a with care known as GOP mapping. On the second part of the transcoder, it is proposed a DVC to H.264/AVC conversion which allows every mapping combination by performing this task using techniques to improve the time spending by the transcoding process. To extract MVs, first the distance used to calculate the SI is considered. For example, Figure 7 shows the transcoding process for a DVC GOP of length 4 to a H.264/AVC pattern IPPP (baseline profile). In step 1, DVC starts to decode the frame labeled as WZ2 and the MVs generated in its SI generation are discarded because they are not closely correlated with the proper movement (low accuracy). When the WZ2 frame is reconstructed (through the entire WZ decoding algorithm, WZ'2) in step 2, the WZ decoding algorithm starts to decode frames WZ1 and WZ3 by using the reconstructed frame WZ'2. At this point, the MVs V0-2 and V2-4 generated in this second iteration of the DVC decoding algorithm are stored. These MVs will be used to reduce the H.264/AVC ME process. Notice that in the case of higher GOP sizes the procedure is the same. In other words, MVs are stored and reused when the distance between SI and the two reference frames is 1. Finally, V0-2 and V2-4 are divided into two halves because P frames have the reference frame with

distance one and MVs were calculated for a distance of two during the SI process.

Fig. 7. Mapping from DVC GOP of length 4 to H.264 GOP IPPP.

Fig. 6. Search area reduction for H.264 encoding stage.

**4.3.2 Mapping GOPs from DVC to H.264** 
