5. Proposed parallel decoding scheme

The SISO decoding unit is similar to the one depicted in Figure 8. ALPHA and BETA modules compute the unnormalized forward metrics and the unnormalized backward metrics, respectively. The GAMMA module computes the transition metrics and executes also the normalization (the metrics for state S<sup>0</sup> are subtracted from the metrics corresponding to states S1, …, S7). The output LLRs are computed inside the L module and normalized by the NORM module. The selection of the inputs for forward and backward moving on the trellis and also the maximum function are executed by the MUX-MAX module. Finally, the MEM BETA module

The L module produces the output log likelihood ratios. These are then normalized inside the NORM module. The MUX-MAX makes the inputs selection (for forward or backward trellis runs) and implements also the maximum operator. The MEM BETA module keeps the back-

Using the same approach for both WiMAX and LTE proposed serial decoding architectures, the same remarks apply. So, for the LTE turbo decoder also, the normalization function allows

stores the backward metrics.

40 Field - Programmable Gate Array

Figure 11. Time diagram for a serial turbo decoder.

ward metrics corresponding values into the memory.

The serial architecture described in Figure 10 for LTE systems can be reorganized in a parallel setup, by instantiating the RSC SISO module N times in the structure. We propose a configuration that concatenates the N values associated with the N RSCs and employs a single memory location for all the memories in the scheme. The K locations with 10 bits per location (corresponding to the serial architecture) are replaced by K/N positions with 10N bits per position (working for the parallel format).

The most important benefit brought by the proposed serial decoding scheme is the single usage of the interleaver module before the decoding stage. The ILM is updated, each time a new data block enters the decoder, while the previous block is still being decoded. This approach prepares a fast and simple transition to the parallel scheme. Considering that the factor N is known, the ILM will have K/N locations, with N values being written at each location (i.e., the ILM can be prepared for the parallel processing that follows). As mentioned in Ref. [16], a Virtex 5 block memory can be organized from a configuration of 32k locations · 1 bit to a setup of 512 locations · 72 bits. In the costliest scenario (i.e., K = 6144), based on the N values and representing the stored values on 10 bits, the parallel ILM can be employed as:


Only two BRAMs are used, the same as in the case of serial ILM.

Figure 12 shows the ILM working principle. As one can observe, during the writing procedure, each index i from 0 to K – 1 generates a corresponding interleaved value. All the computed values are stored in the ILM, in the same order. We will consider the ILM as a matrix, the rows being the memory locations and the columns being the positions on each location. The first K/N interleaved values are placed on the first column. The second set of K/N values is stored on the second column and the procedure continues. In order to perform the described method, a true dual port BRAM is selected. In Figure 12, each time a new value is added on row WA at column WP (near the already existing content at columns till WP-1), the content of row WA + 1 is also read from the memory. In the next clock period, a new value is added at row WA + 1 at column WP (near the already existing content at columns till WP − 1), while reading also the content of row WA + 2. And so on. When the interleaver function is used, the ILM is read in a normal way and the N interleaved values from a row are employed as reading addresses for the V2(Xk) memory. Furthermore, the new LTE interleaver module (with the QPP algebraic properties) will always place at the same row the N values that should be read in the interleaved order from ILM. The only additional task is a reordering process needed to match the corresponding RSCs. An example is presented in Figure 13 for the values K = 40 and N = 8. On the left side, the content of the V2(Xk) memory is shown. Each column is composed of the outputs generated by one of the N RSC SISOs. On the right side, the content of ILM memory is described. Each minimum value from a line of the ILM represents the line address for the V2(Xk) memory (see the gray color circle in the illustration). By using a reordering module, each position from the outputted line is directed to its corresponding SISO. For example, position c from the first read line (index 10) is sent to SISO g, whereas position c from the second read line (index 13) is sent to SISO a. The same procedure applies also for the deinterleaving process, only that the write addresses are extracted from ILM, while the reading ones are used in the natural order.

For the reordering module, an even-odd merge sorting network is applied. The corresponding method was introduced by Batcher in Ref. [14] and is part of the sorting network group that includes several sorting approaches. One such example is the bubble sorting, which sorts in a repeated manner the adjacent pairs of elements. Another example is the shell sorting, which groups the input data into an array and then performs the array's column sorting (also in a repeating manner). After each associated iteration, the array becomes one column smaller. A third example is the even-odd transposition sorting, which sorts alternatively the odd-indexed

Figure 12. ILM memory writing procedure.

mentioned in Ref. [16], a Virtex 5 block memory can be organized from a configuration of 32k locations · 1 bit to a setup of 512 locations · 72 bits. In the costliest scenario (i.e., K = 6144), based on the N values and representing the stored values on 10 bits, the parallel ILM

Figure 12 shows the ILM working principle. As one can observe, during the writing procedure, each index i from 0 to K – 1 generates a corresponding interleaved value. All the computed values are stored in the ILM, in the same order. We will consider the ILM as a matrix, the rows being the memory locations and the columns being the positions on each location. The first K/N interleaved values are placed on the first column. The second set of K/N values is stored on the second column and the procedure continues. In order to perform the described method, a true dual port BRAM is selected. In Figure 12, each time a new value is added on row WA at column WP (near the already existing content at columns till WP-1), the content of row WA + 1 is also read from the memory. In the next clock period, a new value is added at row WA + 1 at column WP (near the already existing content at columns till WP − 1), while reading also the content of row WA + 2. And so on. When the interleaver function is used, the ILM is read in a normal way and the N interleaved values from a row are employed as reading addresses for the V2(Xk) memory. Furthermore, the new LTE interleaver module (with the QPP algebraic properties) will always place at the same row the N values that should be read in the interleaved order from ILM. The only additional task is a reordering process needed to match the corresponding RSCs. An example is presented in Figure 13 for the values K = 40 and N = 8. On the left side, the content of the V2(Xk) memory is shown. Each column is composed of the outputs generated by one of the N RSC SISOs. On the right side, the content of ILM memory is described. Each minimum value from a line of the ILM represents the line address for the V2(Xk) memory (see the gray color circle in the illustration). By using a reordering module, each position from the outputted line is directed to its corresponding SISO. For example, position c from the first read line (index 10) is sent to SISO g, whereas position c from the second read line (index 13) is sent to SISO a. The same procedure applies also for the deinterleaving process, only that the write addresses are extracted from ILM, while the reading

For the reordering module, an even-odd merge sorting network is applied. The corresponding method was introduced by Batcher in Ref. [14] and is part of the sorting network group that includes several sorting approaches. One such example is the bubble sorting, which sorts in a repeated manner the adjacent pairs of elements. Another example is the shell sorting, which groups the input data into an array and then performs the array's column sorting (also in a repeating manner). After each associated iteration, the array becomes one column smaller. A third example is the even-odd transposition sorting, which sorts alternatively the odd-indexed

Only two BRAMs are used, the same as in the case of serial ILM.

can be employed as:

42 Field - Programmable Gate Array

• 768 locations · 80 bits • 1536 locations · 40 bits • 3072 locations · 20 bits • 6144 locations · 10 bits

ones are used in the natural order.

Figure 13. Virtual parallel interleaver.

and the adjacent even-indexed elements, respectively, the even-indexed elements and the adjacent odd-indexed values. The fourth example is the bitonic sorting. The two halves of the input data are sorted in opposite directions and then jointly processed to produce one complete sorted sequence.

The even-odd merge sorting method is based on a theorem saying that any list of a = 4b (b natural) elements can be sorted if the following steps are applied: first, separate sorting is executed over the two halves of the list. After this step, the elements with odd index and the ones with even index are sorted separately. The last step consists in a comparing and switching procedure executed over all the elements 2n and 2n +1(n = 1,…, a/2 − 1). The demonstration of this theorem is available in Ref. [23]. An example for N = 8 is depicted in a graphical format shown in Figure 14. From a timing point of view, Figure 15 depicts the case when N = 2 is used. Same comments as the ones for Figure 11 apply.

Figure 14. Even-odd merge sorting for N = 8.

In combination with the presented parallel decoding architecture, we also propose a simplified implementation for the interleaver block. As seen from Eq. (3), the arithmetic requirements for the computation of the memory addresses πðiÞ consist of three multipliers, one adder and one divider (used for the extraction of the remainder associated with the modulo operation). For all possible K values associated with the division, the quotients range is very large, since the numerator and the denominator can have very big values (and often situated in different numerical ranges—up to billions). We propose an efficient method to reduce the arithmetic complexity associated with Eq. (3).

By introducing the notation

Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 45

Figure 15. Time diagram for parallel turbo decoder (N = 2).

$$p(i) = f\_1 i + f\_2 i^2 \tag{23}$$

it can be observed that

and the adjacent even-indexed elements, respectively, the even-indexed elements and the adjacent odd-indexed values. The fourth example is the bitonic sorting. The two halves of the input data are sorted in opposite directions and then jointly processed to produce one com-

The even-odd merge sorting method is based on a theorem saying that any list of a = 4b (b natural) elements can be sorted if the following steps are applied: first, separate sorting is executed over the two halves of the list. After this step, the elements with odd index and the ones with even index are sorted separately. The last step consists in a comparing and switching procedure executed over all the elements 2n and 2n +1(n = 1,…, a/2 − 1). The demonstration of this theorem is available in Ref. [23]. An example for N = 8 is depicted in a graphical format shown in Figure 14. From a timing point of view, Figure 15 depicts the

In combination with the presented parallel decoding architecture, we also propose a simplified implementation for the interleaver block. As seen from Eq. (3), the arithmetic requirements for the computation of the memory addresses πðiÞ consist of three multipliers, one adder and one divider (used for the extraction of the remainder associated with the modulo operation). For all possible K values associated with the division, the quotients range is very large, since the numerator and the denominator can have very big values (and often situated in different numerical ranges—up to billions). We propose an efficient method to reduce the arithmetic

case when N = 2 is used. Same comments as the ones for Figure 11 apply.

plete sorted sequence.

44 Field - Programmable Gate Array

complexity associated with Eq. (3).

Figure 14. Even-odd merge sorting for N = 8.

By introducing the notation

$$\begin{array}{l} p(0) = 0, \\ p(i) = p(i-1) + s\_1 + s\_2(i), \quad i > 0, \end{array} \tag{24}$$

where

$$\begin{aligned} s\_1 &= f\_1 \quad \text{and} \\ s\_2(i) &= \begin{cases} 0, & i = 0, \\ f\_2, & i = 1, \\ s\_2(i-1) &+ 2f\_2, & i > 1 \end{cases} \end{aligned} \tag{25}$$

We can rewrite Eq. (3) using Eqs. (23) and (24)

$$
\pi(i) = p(i) \mod K = \left[ p(i-1) + s\_1 + s\_2(i) \right] \bmod K \tag{26}
$$

The multiplications are replaced by additions, which require less hardware resources. Nevertheless, the division is still necessary for the modulo operation. If we consider the modulo operator applied to a sum of elements expressed as

$$\left[\sum\_{k} c\_{k}\right] \bmod K = \left[\sum\_{k} c\_{k} \bmod K\right] \bmod K \tag{27}$$

we can decrease the arithmetic effort needed to obtain πðiÞ in Eq. (26). The number of modulo operations becomes bigger, but the overall complexity of the corresponding divisions is reduced since smaller quotients are used. Consequently, using Eqs. (25)–(27), one can write:

$$s\_3(i) = s\_1 + s\_2(i) = \begin{cases} 0, & i = 0, \\ f\_1 + f\_2, & i = 1, \\ s\_3(i-1) + 2f\_2, & i > 1 \end{cases} \tag{28}$$

Using Eq. (29) in Eq. (26), the result is

$$\begin{aligned} \pi(i) &= p(i) \mod K = [p(i-1) + s\_3(i)] \mod K \\ &= [p(i-1) + s\_3(i-1) + 2f\_2)] \mod K \\ &= [\pi(i-1) + s\_3(i-1) \bmod K + 2f\_2 \bmod K] \mod K \end{aligned} \tag{29}$$

All of the numerical values added in the last stage of Eq. (29) are lower than K and available recursively (during the processing of a distinct frame), such as πði−1Þ and s3ði−1Þ mod K or they can be predetermined and stored, like the case of 2f <sup>2</sup> mod K:. The overall arithmetic complexity is reduced to 2K additions and 2K simplified modulo operations (i.e., each is resolvable using a comparison and a subtraction) for the address generation module. The method improves the solutions presented in [24, 25], by eliminating any multiplications or divisions. Additionally, the lower numerical range of the operators (with values lower than 2K; i.e., values in the range of thousands) allows the usage of minimal resources for the representation of binary values.
