4. Proposed serial decoding scheme

#### 4.1. WiMAX systems

<sup>γ</sup>ij <sup>¼</sup> <sup>V</sup>ðXkÞXði, <sup>j</sup>Þ þ <sup>Λ</sup><sup>i</sup>

<sup>ð</sup>ZkÞ, whereas for SISO 2 it becomes <sup>Λ</sup><sup>i</sup>

<sup>ð</sup>XkÞ þ <sup>W</sup>ðXkÞ, whereas for SISO 2, VðXkÞ ¼ V2ðX<sup>0</sup>

2ðX<sup>0</sup>

branch and Λ<sup>i</sup>

34 Field - Programmable Gate Array

LLR is Λ<sup>i</sup>

whereas Λ<sup>o</sup>

<sup>1</sup>ðXk<sup>Þ</sup> and <sup>Λ</sup><sup>o</sup>

possible values for the branch metrics:

forward and backward through the trellis.

β^

and then stored in the dedicated memory.

3.2.2. Forward recursion

<sup>k</sup>ðSiÞ ¼ maxf

3.2.1. Backward recursion

<sup>¼</sup> <sup>Λ</sup><sup>i</sup>

β^

where X(i,j) and Z(i,j) are the data, respectively, the parity bits, both associated with one

operator denotes the interleaving procedure. In Figure 5, W(Xk) is the extrinsic information,

Looking at the LTE turbo encoder trellis, one can notice that between two states, there are four

ðZkÞ <sup>γ</sup><sup>2</sup> <sup>¼</sup> <sup>V</sup>ðXkÞ þ <sup>Λ</sup><sup>i</sup>

The LTE decoding process follows a similar approach as for WiMAX systems, i.e., it moves

The algorithm moves backward over the trellis computing the metrics. The obtained values for each node are stored in a normalized manner. They will be used for the LLR computation once the algorithm will start moving forward through the trellis. We name βkðSiÞ the backward metric computed at the kth stage, for the state Si, where 2 ≤ k ≤ K þ 3 and 0 ≤ i ≤ 7. For the backward recursion, the initialization <sup>β</sup><sup>K</sup>þ<sup>3</sup>ðSiÞ ¼ <sup>0</sup>, <sup>0</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> 7 is used at the stage <sup>k</sup> <sup>=</sup> <sup>K</sup> + 3. For

γ<sup>0</sup> ¼ 0 γ<sup>1</sup> ¼ VðXkÞ <sup>γ</sup><sup>2</sup> <sup>¼</sup> <sup>Λ</sup><sup>i</sup>

the rest of the stages 2 ≤ k ≤ K + 2, the computed backward metrics are

<sup>k</sup>ðSi<sup>Þ</sup> represents the unnormalized metric. Once the unnormalized metric <sup>β</sup>^

<sup>β</sup>kðSiÞ ¼ <sup>β</sup>^

for state S0, all the backward metrics for states S1…S<sup>7</sup> are normalized as

<sup>β</sup><sup>k</sup>þ<sup>1</sup>ðSj1Þ þ <sup>γ</sup>ij<sup>1</sup>

where Sj<sup>1</sup> and Sj<sup>2</sup> are the two states from stage k + 1 connected to the state Si from stage k and

kðSiÞ−β^

When the backward recursion is finished, the algorithm moves forward through the trellis in the normal direction. This specific phase of the decoding is similar to the one for Viterbi algorithm. In this case, the storing procedure is needed only for the previous stage metrics,

 , 

<sup>β</sup><sup>k</sup>þ<sup>1</sup>ðSj2Þ þ <sup>γ</sup>ij<sup>2</sup>

<sup>k</sup>ðS0Þ (17)

g , (16)

<sup>k</sup>ðS0Þ is computed

<sup>k</sup>Þ are the output LLRs generated by the two SISOs.

ðZkÞ is the LLR for the input parity bit. For SISO 1 decoding unit, this input

ðZkÞ :

ðZ 0

<sup>k</sup>Þ ¼ ILfΛ<sup>o</sup>

ðZkÞZði, jÞ , (14)

<sup>k</sup>Þ. For SISO 1, VðXkÞ ¼ V1ðXkÞ

<sup>1</sup>ðXkÞ þ WðXkÞg, where "IL"

(15)

One important remark about the decoding algorithm is that the outputs of one constituent decoder represent the inputs for the other constituent decoder. At the same time, knowing that the interleaver and deinterleaver procedures apply over the data blocks (so the complete block is needed) in a nonoverlapping manner will allow the usage of a single constituent decoder. This decoding unit operates time multiplexed and the corresponding proposed scheme is presented in Figure 7.

In Figure 7, we can identify storing requirements: the memory blocks that store data from one semi-iteration to another and the memory blocks used from one iteration to another. IL stands for the interleaver/deinterleaver procedure, while CONTROL is the management unit, controlling the decoder functionalities. This module provides the addresses used for

Figure 7. Proposed decoder scheme.

read and write, the signals used to trigger the forward and backward movements through the trellis, the selection for one of the two SISO units and also the control of MUX and DEMUX blocks. The input buffer is also selected since the decoding architecture can accept a new-encoded data block while still processing the previous one. The most important module shown in Figure 7 is the SISO unit, which is the decoding structure. Figure 8 depicts the block scheme of this decoding unit. One can observe the unnormalized metric computing modules BETA (backward) and ALPHA (forward) and the module GAMMA that computes the transition metric. This last one ensures also the normalization: the metrics values obtained for state S<sup>0</sup> are subtracted from the metrics values obtained for the states S1…S7. The output LLRs are computed inside the L module and normalized inside the NORM module. The MUX-MAX module provides the correct inputs when moving forward or backward through the trellis. It also computes the maximum function. The backward metrics are stored in MEM BETA memory during backwards recursion, their values being read when executing the forward recursion, in order to compute the estimated LLRs.

It is important to mention that some studies have been conducted regarding the normalization function. Trying to increase the system frequency (in order to reduce the decoding latency and

Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 37

Figure 8. SISO block scheme.

so, to increase the decoded data throughput), one may think of removing the normalization and so to reduce the amount of logic on the critical path. This solution is not applicable because five extra bits would be needed for metrics values. From here more the memory blocks and more the complex arithmetic. Finally, all these will lead to a lower system frequency, so no benefit on this approach. On the other hand, we propose a dedicated approach to implement the metric computation blocks (ALPHA, BETA and GAMMA). Based on the trellis state, we identified the relations for each metric, 32 equations being used for transition metric computation (we remind that for each of the eight trellis states we have four possible transitions). Moreover, only 16 are distinct (the other 16 are the same) and from these 16, some are null. Using this approach, a complexity decrease is obtained.

Figure 9 depicts the timing diagram for the proposed SISO. This corresponds to the scenario with one SISO unit and some MUX and DEMUX blocks replacing the two SISO units from the theoretical decoding architecture (see Figure 7).

In Figure 9, R/W (K − 1:0) means reading/writing memory from addresses K − 1 to 0, R/W {IL (K − 1:0)} means reading/writing memory from interleaved addresses K − 1 to 0 and COM-PUTE means that the block is processing the input data.

#### 4.2. LTE systems

read and write, the signals used to trigger the forward and backward movements through the trellis, the selection for one of the two SISO units and also the control of MUX and DEMUX blocks. The input buffer is also selected since the decoding architecture can accept a new-encoded data block while still processing the previous one. The most important module shown in Figure 7 is the SISO unit, which is the decoding structure. Figure 8 depicts the block scheme of this decoding unit. One can observe the unnormalized metric computing modules BETA (backward) and ALPHA (forward) and the module GAMMA that computes the transition metric. This last one ensures also the normalization: the metrics values obtained for state S<sup>0</sup> are subtracted from the metrics values obtained for the states S1…S7. The output LLRs are computed inside the L module and normalized inside the NORM module. The MUX-MAX module provides the correct inputs when moving forward or backward through the trellis. It also computes the maximum function. The backward metrics are stored in MEM BETA memory during backwards recursion, their values being read when executing the forward recursion, in order to compute the

It is important to mention that some studies have been conducted regarding the normalization function. Trying to increase the system frequency (in order to reduce the decoding latency and

estimated LLRs.

Figure 7. Proposed decoder scheme.

36 Field - Programmable Gate Array

The same remark about the two SISO units from Figure 5 working in a nonoverlapping manner applies for LTE systems as for WiMAX ones. The same approach is used, i.e., the proposed decoding architecture includes only one SISO unit and some MUX and DEMUX blocks. Figure 10 depicts the block scheme of the proposed decoding architecture.

Figure 9. Time utilization for one turbo iteration.

Figure 10. Proposed serial turbo decoder block scheme.

One can observe the memory blocks in Figure 10. Some are used to store data between two successive semi-iterations, respectively, between two successive iterations. Others, in dotted-line, are virtual memories used just to clarify the introduced notations. Moreover, the interleaver and deinterleaver modules are distinctively introduced in the scheme, but in fact they are the same. Both include a block memory called ILM (interleaver memory) and an interleaver. The novelty of this approach compared to the previous serial implementation proposed in Ref. [7] is the ILM. This memory will allow a fast transition to a parallel decoding architecture. The input data memories

(on the left side in Figure 10) and the ILM are switched buffers, allowing new data to be written while the previous block is still decoded. The ILM is filled with the interleaved addresses; at the same time, the new data are stored in the input memories. The saved addresses are then used as read addresses for the interleaver unit and as write addresses for the deinterleaver unit. Here, we detail the way the architecture from Figure 10 works. The vectors V1ðXkÞ ¼ <sup>Λ</sup><sup>i</sup> ðXkÞ þ WðXkÞ and Λi ðZkÞ are read from the corresponding memories by SISO 1. For the first semi-iteration, the memories are read in both directions, in order to ensure the forward and backward movements on the trellis.When this decoding phase is completed, the second semi-iteration starts, SISO 2 reads in both directions the memories storing the vectors V2ðX<sup>0</sup> <sup>k</sup>Þ ¼ ILfV2ðXkÞg ¼ ILfΛ<sup>o</sup> <sup>1</sup>ðXkÞ�WðXkÞg and Λ<sup>i</sup> ðZ 0 <sup>k</sup>Þ. IL stands again for the interleaver process.

In detail, SISO 1 reads the input memories and starts the decoding process, outputting the computed LLRs. Having the LLRs available and the extrinsic values, the vector V2(Xk) is computed and then stored in a normal order in the memory. The ILM content read in the normal order provides the reading addresses for V2(Xk) memory, emulating the interleaver process. The reordered LLRs V2(X'k) are available, the corresponding values for the three tail bits X'K+1, X'K+2 and X'K+3 being added at the end of this sequence. The same SISO unit acts now as SISO 2, this time reading data inputs from the other memory blocks. The two switching mechanisms from Figure 10 change the position between these two semi-iterations (when in position 1, V1(Xk) and Λ<sup>i</sup> <sup>ð</sup>Zk<sup>Þ</sup> memories are active, while in position 2, V2(X'k) and <sup>Λ</sup><sup>i</sup> ðZ 0 kÞ memories are used).

The SISO unit provides at the end of each semi-iteration K values for the LLRs. The LLRs obtained after the second semi-iteration are stored in the Λ<sup>o</sup> 2ðX<sup>0</sup> <sup>k</sup>Þ memory (the content of ILM, already available for the V2(Xk) interleaver process, is used also as writing address for Λ<sup>o</sup> 2ðX<sup>0</sup> kÞ memory, after a delay is added).

The memories Λ<sup>o</sup> 2ðX<sup>0</sup> <sup>k</sup>Þ and V2(Xk) are read in the normal order to allow W(Xk) computation; W(Xk) is written in the corresponding memory and at the same time it is used for a new semiiterations. In other words, the memory for W(Xk) is updated during a semi-iteration. The time diagram for the proposed serial decoding architecture is presented in Figure 11, the intervals colored with gray indicating the writing periods for W(Xk) memory. As mentioned in this chapter, the input memories and the ILM (the upper four memory blocks in the image) are switched buffers and they are filled with new data while the previous-coded block passes the last phase of its decoding process. The same notations as shown in Figure 9 are used.

All the memory blocks in Figure 10 have 6144 locations, this being the maximum coded data block length defined by the standard. Only the memory blocks with the input data for SISO units have 6144 + 3 locations because they store also the tail bits. All locations contain 10 bits. Using a Matlab simulator in finite precision, it has been observed that six bits are needed for the integer part, in order to cover the dynamic range of the variables and three bits are needed for the fractional part to maintain the decoding performance close to the theoretical one, with a certain accepted level of degradation. The 10th bit is for sign.

One can observe the memory blocks in Figure 10. Some are used to store data between two successive semi-iterations, respectively, between two successive iterations. Others, in dotted-line, are virtual memories used just to clarify the introduced notations. Moreover, the interleaver and deinterleaver modules are distinctively introduced in the scheme, but in fact they are the same. Both include a block memory called ILM (interleaver memory) and an interleaver. The novelty of this approach compared to the previous serial implementation proposed in Ref. [7] is the ILM. This memory will allow a fast transition to a parallel decoding architecture. The input data memories

Figure 9. Time utilization for one turbo iteration.

38 Field - Programmable Gate Array

Figure 10. Proposed serial turbo decoder block scheme.

Figure 11. Time diagram for a serial turbo decoder.

The SISO decoding unit is similar to the one depicted in Figure 8. ALPHA and BETA modules compute the unnormalized forward metrics and the unnormalized backward metrics, respectively. The GAMMA module computes the transition metrics and executes also the normalization (the metrics for state S<sup>0</sup> are subtracted from the metrics corresponding to states S1, …, S7). The output LLRs are computed inside the L module and normalized by the NORM module. The selection of the inputs for forward and backward moving on the trellis and also the maximum function are executed by the MUX-MAX module. Finally, the MEM BETA module stores the backward metrics.

The L module produces the output log likelihood ratios. These are then normalized inside the NORM module. The MUX-MAX makes the inputs selection (for forward or backward trellis runs) and implements also the maximum operator. The MEM BETA module keeps the backward metrics corresponding values into the memory.

Using the same approach for both WiMAX and LTE proposed serial decoding architectures, the same remarks apply. So, for the LTE turbo decoder also, the normalization function allows a reduced dynamic range for the variables. Trying to eliminate it, in order to reduce the number of logic levels on the critical path, will not lead to a higher system frequency because again, more memory blocks are required, more complex arithmetic (since variables are expressed on more bits) is used and finally, as an overall consequence, lower clock frequency is reported for the design.

And for ALPHA, BETA and GAMMA modules inside the SISO decoding unit, again the dedicated equations are used to compute the metrics. Sixteen such relations are implemented for transition metric computation (eight states in trellis with two possible transitions each). In fact, only four equations are distinct (as indicated in Eq. (15)]. And from these four equations, one of them is null. This way the computational effort is minimized for this proposed architecture.

The interleaving and deinterleaving procedures implement the same equation. The interleaved index is computed using a modified form of Eq. (3), i.e.,

$$
\pi(i) = \langle [(f\_1 + f\_2 \cdot i) \text{mod } K] \cdot i \rangle \text{mod } K \tag{22}
$$

For the interleaving process, the data are written in the memory block in the natural order and then it is read in the interleaved order, while for the deinterleaver process the data are written in the interleaved order and then it is read in the natural order.

The computation in Eq. (22) is executed in three phases. First, the value for ðf <sup>1</sup> þ f <sup>2</sup> � iÞmodK is obtained. The index i (describing the natural order) multiplies this partial result and the obtained value is passed once again through modulo K block. And as a remark for this computation: the formula is increased with f<sup>2</sup> for consecutive values of index i. So a register adds f<sup>2</sup> for each new index. If the register current value is higher than K, K is subtracted and the result is placed back in the register. This processing requires one system clock period, the results being generated in a continuous manner.
