2. The coding scheme

### 2.1. WiMAX systems

In their early existence, the turbo codes proved to obtain great decoding performances, so that they were used in many standards as recommendations. They transformed into a more appealing solution once the processing capacity increased for the field programmable gate array (FPGA) and digital signal processor (DSP). Their implementation complexity was not

In this context, the Third-Generation Partnership Project (3GPP) organization early proposed these novel coding techniques. It should be mentioned that turbo codes were introduced in standard by the first version of Universal Mobile Telecommunications System (UMTS) technology (in 1999). Moreover, the next UMTS releases (the following high-speed packet access) contributed with new and interesting features, while turbo coding remained still unchanged. Furthermore, several modifications were introduced by the long-term evolution (LTE) standard. Even if they were not significant as volume, their importance arose in terms of concept. In this framework, the 3GPP proposed for LTE a new interleaver scheme, while maintaining exactly the same coding structure as in UMTS. Also, the turbo codes were introduced by the Institute of Electrical and Electronics Engineers (IEEE) in 802.16 standards, known as the base

In Ref. [4], an UMTS dedicated turbo decoding binary scheme is developed, whereas for WiMAX systems a similar duo-binary architecture is presented in Refs. [5] and [6]. Thanks to the new LTE/LTE-advanced (LTE-A) interleaver, the decoding performances are improved, as compared to the ones corresponding to the UMTS standard. In addition, the new LTE interleaver comes with native properties suited for a parallel decoding approach inside the algorithm, thus taking advantage on the main idea brought by turbo decoders (i.e., exchanging the extrinsic values between the two decoding units). In Ref. [7], a serial decoding scheme implemented on FPGA is presented. However, parallelization is still required when high throughput is required, as in the particular case of LTE systems using diversity techniques.

In the past years, many interesting parallel decoder schemes were studied by the researchers. In this context, the obtained results are measured on two directions. The direction number 1 is represented by the decoding performance degradation between the parallel and the serial solutions. The direction number 2 is the hardware resources occupied for such parallel decoder implementation. In Ref. [8], a first group of parallel decoding solutions is presented. It is based on the classical maximum a posteriori (MAP) algorithm. This method passes through the trellis twice, first time to compute the forward state metrics (FSM) and the second time to obtain the backward state metrics (BSM) and simultaneously the log likelihood ratios (LLR). Following this approach, several approaches were developed in order to reduce the theoretical latency of the decoding process of 2K clock periods for each semi-iteration (where K is the data block length). In Refs. [9] and [10], a second set of parallel architectures that take advantage of the quadratic permutation polynomial (QPP) interleaver algebraic-geometric properties is described. In these works, efficient hardware implementations of the QPP interleaver are proposed. However, the parallelization factor N still represents the number of used interleavers in the devel-

In Ref. [11], a third approach was reported, which consists in using a folded memory. All the data needed for parallel processing are stored on the same time. On the other hand, the main

prohibitive anymore, this allowing them to become mandatory.

for WiMAX systems.

26 Field - Programmable Gate Array

oped architectures.

Section 8.4 from 802.16 standard [18] presents the coding scheme on the basis of which the proposed decoder is implemented. Figure 1 shows the duo-binary encoder. The native coding rate is 1/3. In order to obtain other coding rates, a puncturing block must be used. Accordingly, a depuncturing block must be added to the receiver architecture.

Figure 1. (a) 802-16e turbo coding scheme; (b) constituent encoder.

Let us define the following parameters: coding rate R; block dimension (in pairs of bits, i.e., dibits) K, which is computed independent of a coding rate, as a function of the uncoded block size; the number of iterations L, i.e., the latency Latency (in clock periods); information bits rate Rb [Mbps]; and system clock frequency Fclk [MHz].

As mentioned in Ref. [6], the main problem of a convolutional turbo code (CTC) decoder implementation is represented by the amount of required hardware resources. Moreover, in order to reach the targeted high data rate, the system clock has to be fast. Equation (1) presents the decoding throughput.

$$R\_b = \frac{2\text{K}}{\text{Latency } T\_{clk}}\tag{1}$$

For a fixed latency algorithm, according to Eq. (1), the output throughput is improved when achieving a higher clock frequency. Another way is to reduce latency using a parallel architecture; however, this increases the occupied area and may lead to a smaller clock frequency due to longer routes. Moreover, another direct constraint is the significant memory needed for storing data. This issue also affects the frequency, since a large number of used memory blocks leads to a large resource spread on chip and, obviously, longer routes.

Taking into account the previously mentioned aspects, we can conclude that all the parameters presented above are related, so that a global optimization is not possible. Consequently, we have chosen to balance each direction in order to meet throughput requirements.

#### 2.2. LTE systems

A classic turbo coding scheme is presented in the 3GPP LTE specification, including two constituent encoders and one interleaver module (Figure 2). The data block Ck can be observed at the input of the LTE turbo encoder. The K bits from this input data block are transferred at the output, as systematic bits, in the steam Xk. At the same time, the first constituent encoder processes the input data block, resulting the parity bits Zk, whereas the second constituent encoder processes the interleaved data block C 0 <sup>k</sup>, resulting the parity bits Z 0 <sup>k</sup>. Combining the systematic bits and the two streams of parity bits, we obtain the following sequence (at the output of the encoder): X1, Z1, Z'1, X2, Z2, Z'2, …, XK, ZK, Z'K.

In order to drive back the constituent encoders to the initial state (at the end of the coding process), the switches from Figure 2 are moved from position A to position B. Since the final states of the two constituent encoders are not the same (different input data blocks produce different final state), this switching procedure generates tail bits for each encoder. These tail bits are sent together with the systematic and parity bits, thus resulting the following final sequence: XK+1, ZK+1, XK+2, ZK+2, XK+3, ZK+3, X'K+1, Z'K+1, X'K+2, Z'K+2, X'K+3, Z'K+3.

As it was previously mentioned and discussed in Ref. [7], the LTE turbo coding scheme introduces a new interleaving structure. Thus, the input sequence is rearranged at the output using:

Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 29

Figure 2. LTE turbo coding scheme.

Let us define the following parameters: coding rate R; block dimension (in pairs of bits, i.e., dibits) K, which is computed independent of a coding rate, as a function of the uncoded block size; the number of iterations L, i.e., the latency Latency (in clock periods); information bits rate

As mentioned in Ref. [6], the main problem of a convolutional turbo code (CTC) decoder implementation is represented by the amount of required hardware resources. Moreover, in order to reach the targeted high data rate, the system clock has to be fast. Equation (1) presents

Rb <sup>¼</sup> <sup>2</sup><sup>K</sup>

leads to a large resource spread on chip and, obviously, longer routes.

Latency Tclk

For a fixed latency algorithm, according to Eq. (1), the output throughput is improved when achieving a higher clock frequency. Another way is to reduce latency using a parallel architecture; however, this increases the occupied area and may lead to a smaller clock frequency due to longer routes. Moreover, another direct constraint is the significant memory needed for storing data. This issue also affects the frequency, since a large number of used memory blocks

Taking into account the previously mentioned aspects, we can conclude that all the parameters presented above are related, so that a global optimization is not possible. Consequently, we have chosen to balance each direction in order to meet throughput

A classic turbo coding scheme is presented in the 3GPP LTE specification, including two constituent encoders and one interleaver module (Figure 2). The data block Ck can be observed at the input of the LTE turbo encoder. The K bits from this input data block are transferred at the output, as systematic bits, in the steam Xk. At the same time, the first constituent encoder processes the input data block, resulting the parity bits Zk, whereas the second constituent

0

systematic bits and the two streams of parity bits, we obtain the following sequence (at the

In order to drive back the constituent encoders to the initial state (at the end of the coding process), the switches from Figure 2 are moved from position A to position B. Since the final states of the two constituent encoders are not the same (different input data blocks produce different final state), this switching procedure generates tail bits for each encoder. These tail bits are sent together with the systematic and parity bits, thus resulting the following final

As it was previously mentioned and discussed in Ref. [7], the LTE turbo coding scheme introduces a new interleaving structure. Thus, the input sequence is rearranged at the

sequence: XK+1, ZK+1, XK+2, ZK+2, XK+3, ZK+3, X'K+1, Z'K+1, X'K+2, Z'K+2, X'K+3, Z'K+3.

<sup>k</sup>, resulting the parity bits Z

0

<sup>k</sup>. Combining the

(1)

Rb [Mbps]; and system clock frequency Fclk [MHz].

encoder processes the interleaved data block C

output of the encoder): X1, Z1, Z'1, X2, Z2, Z'2, …, XK, ZK, Z'K.

the decoding throughput.

28 Field - Programmable Gate Array

requirements.

2.2. LTE systems

output using:

$$\underline{\mathsf{C}}\_{i}^{\prime} = \mathsf{C}\_{\pi(i)}, \quad i = 1, 2, \ldots, K \,, \tag{2}$$

where the interliving function π applied over the output index i is defined as

$$
\pi(\mathbf{i}) = (f\_1 \cdot \mathbf{i} + f\_2 \cdot \mathbf{i}^2) \text{mod}\mathbf{K} \tag{3}
$$

The input block length K and the parameters f<sup>1</sup> and f<sup>2</sup> are provided in Table 5.1.3-3 in Ref. [19].

### 3. The decoding algorithm

#### 3.1. WiMAX systems

The decoding architecture consists of two decoding units called constituent decoders. Each such unit receives systematic bits (in natural order or interleaved) and parity bits, as shown in Figure 1.

The block diagram implements a maximum-logarithmic-maximum A posteriori (Max-Log-MAP) algorithm. For the case of turbo binary codes, the decoder scheme will represent, in the log likelihood ratio (LLR) space, each binary symbol as a single likelihood ratio. But in the situation of turbo duo-binary codes, the decoding unit requires three likelihood ratios in the same space. If we consider the duo-binary pair Ak and Bk, the LLRs may be computed as:

$$\Lambda\_{a,b} = (A\_k, B\_k) = \log \frac{P(A\_k = a, B\_k = b)}{P(A\_k = 0, B\_k = 0)} \tag{4}$$

where (a,b) are (0,1), (1,0), or (1,1). The ratio set is updated by each decoding unit (constituent decoder) for each input pair, using the corresponding LLRs and parity bits, also seen as LLRs. Then, the output LLRs minus the input LLRs provides the extrinsic values. The trellis for a duo-binary code contains eight states, each such state with four inputs and four outputs, as presented in Figure 3. Using the systematic and parity pairs LLRs, for each branch, the metric γkðSi ! SjÞ is computed, i.e.,

$$\mathcal{V}\_k(\mathcal{S}\_i \to \mathcal{S}\_j) = A^i\_{a,b}(A\_k, B\_k) + w\Lambda(\mathcal{W}\_k) + y\Lambda(\mathcal{Y}\_k) \tag{5}$$

Figure 3. WiMAX decoder trellis.

The constituent decoder (Figure 4) performs the corresponding processing forward and backward over the trellis. When moving forward, the decoder computes the unnormalized metric α 0 <sup>k</sup>þ<sup>1</sup>ðSj<sup>Þ</sup> corresponding to each computed normalized metric <sup>α</sup>kðSi<sup>Þ</sup> associated with state Si, using (Figure 4)

$$\alpha\_{k+1}^{'}(\mathcal{S}\_{\flat}) = \max\_{\mathcal{S}\_{i}\to\mathcal{S}\_{\flat}} \{ \alpha\_{k}(\mathcal{S}\_{i}) + \mathcal{\mathcal{Y}}\_{k}(\mathcal{S}\_{i}\to\mathcal{S}\_{\flat}) \}\tag{6}$$

Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 31

Figure 4. Decoder block scheme.

<sup>Λ</sup>a, <sup>b</sup> ¼ ðAk, BkÞ ¼ log <sup>P</sup>ðAk <sup>¼</sup> <sup>a</sup>, Bk <sup>¼</sup> <sup>b</sup><sup>Þ</sup>

where (a,b) are (0,1), (1,0), or (1,1). The ratio set is updated by each decoding unit (constituent decoder) for each input pair, using the corresponding LLRs and parity bits, also seen as LLRs. Then, the output LLRs minus the input LLRs provides the extrinsic values. The trellis for a duo-binary code contains eight states, each such state with four inputs and four outputs, as presented in Figure 3. Using the systematic and parity pairs LLRs, for each branch, the metric

The constituent decoder (Figure 4) performs the corresponding processing forward and backward over the trellis. When moving forward, the decoder computes the unnormalized metric

<sup>k</sup>þ<sup>1</sup>ðSj<sup>Þ</sup> corresponding to each computed normalized metric <sup>α</sup>kðSi<sup>Þ</sup> associated with state Si,

α′

<sup>k</sup>þ<sup>1</sup>ðSjÞ ¼ max

Si!Sj

γkðSi ! SjÞ is computed, i.e.,

30 Field - Programmable Gate Array

α 0

using (Figure 4)

Figure 3. WiMAX decoder trellis.

<sup>γ</sup>kðSi ! SjÞ ¼ <sup>Λ</sup><sup>i</sup>

<sup>P</sup>ðAk <sup>¼</sup> <sup>0</sup>, Bk <sup>¼</sup> <sup>0</sup><sup>Þ</sup> (4)

<sup>a</sup>, <sup>b</sup>ðAk, BkÞ þ wΛðWkÞ þ yΛðYkÞ (5)

fαkðSiÞ þ γkðSi ! SjÞg (6)

where the operator "maximum" is executed over all four branches entering the state Sj at the time stamp k + 1. Once the metrics for all states are updated at time stamp k + 1, the normalization versus the state S0 value is made by the decoder. Analogously to forward processing, for backward moving, the decoder computes:

$$\boldsymbol{\beta}\_{k}^{'}(\mathcal{S}\_{i}) = \max\_{\mathcal{S}\_{i} \to \mathcal{S}\_{j}} \{ \boldsymbol{\beta}\_{k+1}(\mathcal{S}\_{j}) + \boldsymbol{\gamma}\_{k}(\mathcal{S}\_{i} \to \mathcal{S}\_{j}) \}\tag{7}$$

where the operator "maximum" and the normalization method are similar to Eq. (6).

The initialization with null values is carried out for all the forward and backward metrics at all states. Once the new values are computing and stored, the decoding unit executes the second step in the decoding procedure, i.e., the LLRs computing as in Eq. (4). The decoding unit starts by computing the likelihood ratio for each branch

$$\mathbf{Z}\_k(\mathbb{S}\_i \to \mathbb{S}\_j) = \alpha\_k(\mathbb{S}\_i) + \boldsymbol{\gamma}\_k(\mathbb{S}\_i \to \mathbb{S}\_j) + \boldsymbol{\beta}\_{k+1}(\mathbb{S}\_j) \tag{8}$$

and continues with the value

$$t\_k(a,b) = \max\_{S\_i \to S\_j : (a,b)} \{ Z\_k \} \tag{9}$$

where the operator "maximum" is computed over all eight branches generated by the pair (a, b). At the end, the output LLR is computed as

$$
\Lambda\_{a,b}^{o}(A\_k, B\_k) = t\_k(a, b) - t\_k(0, 0) \tag{10}
$$

The decoding procedure is executed for a decided number of iterations or until a convergence criterion is reached. Then, a final decision is taken over the bits. This is achieved by computing for each bit from the pair (Ak, Bk) the corresponding LLR:

$$\Lambda(A\_k) = \max\{\Lambda\_{1,0}^o(A\_k, B\_k), \Lambda\_{1,1}^o(A\_k, B\_k)\} - \max\{\Lambda\_{0,0}^o(A\_k, B\_k), \Lambda\_{0,1}^o(A\_k, B\_k)\}\tag{11}$$

$$A(B\_k) = \max\{A\_{0,1}^o(A\_k, B\_k), A\_{1,1}^o(A\_k, B\_k)\} - \max\{A\_{0,0}^o(A\_k, B\_k), A\_{1,0}^o(A\_k, B\_k)\},\tag{12}$$

where Λ<sup>o</sup> <sup>0</sup>,0ðAk, BkÞ ¼ 0. Finally, by comparing each LLR with a null threshold, i.e., looking at the sign, the hard decision is made.

#### 3.2. LTE systems

The decoding architecture for the LTE systems is presented in Figure 5. The two decoding units called recursive systematic convolutional (RSC) use theoretically the MAP algorithm. The MAP solution, a classical one, ensures the best decoding performances. Unfortunately, at the same time, it is characterized by an increased implementation complexity and also it may include variables with a large dynamic range. These are the reasons why the classical solution with the MAP algorithm is used only as a reference for the expected decoding performance. When it comes to real implementation, new suboptimal algorithms have been studied: Logarithmic MAP (Log MAP) [20], Max Log MAP, Constant Log MAP (Const Log MAP) [21] and Linear Log MAP (Lin Log MAP) [22].

Figure 5. LTE turbo decoder.

For the LTE systems, we consider a decoding architecture based on the Max Log MAP algorithm. This suboptimal algorithm overcomes the problems of implementation complexity and dynamic range by paying the price of lower decoding performance when compared with the MAP algorithm. However, this degradation can be maintained inside some accepted limits. Starting from the Jacobi logarithm, only the first term is used by the Max Log MAP algorithm, i.e.,

$$\max^\*(\mathbf{x}, y) = \ln(e^\mathbf{x} + e^y) = \max(\mathbf{x}, y) + \ln(1 + e^{-|y - x|}) \approx \max(\mathbf{x}, y) \,. \tag{13}$$

The trellis diagram for the turbo decoding architecture of the LTE systems contains eight states, as presented in Figure 6. Each state of the diagram has two inputs and two outputs. The branch metric between the states Si and Sj is

Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 33

Figure 6. LTE turbo coder trellis.

where Λ<sup>o</sup>

i.e.,

max�

Figure 5. LTE turbo decoder.

ðx, yÞ ¼ lnðe

The branch metric between the states Si and Sj is

<sup>x</sup> <sup>þ</sup> <sup>e</sup> y

3.2. LTE systems

32 Field - Programmable Gate Array

the sign, the hard decision is made.

Linear Log MAP (Lin Log MAP) [22].

<sup>0</sup>,0ðAk, BkÞ ¼ 0. Finally, by comparing each LLR with a null threshold, i.e., looking at

The decoding architecture for the LTE systems is presented in Figure 5. The two decoding units called recursive systematic convolutional (RSC) use theoretically the MAP algorithm. The MAP solution, a classical one, ensures the best decoding performances. Unfortunately, at the same time, it is characterized by an increased implementation complexity and also it may include variables with a large dynamic range. These are the reasons why the classical solution with the MAP algorithm is used only as a reference for the expected decoding performance. When it comes to real implementation, new suboptimal algorithms have been studied: Logarithmic MAP (Log MAP) [20], Max Log MAP, Constant Log MAP (Const Log MAP) [21] and

For the LTE systems, we consider a decoding architecture based on the Max Log MAP algorithm. This suboptimal algorithm overcomes the problems of implementation complexity and dynamic range by paying the price of lower decoding performance when compared with the MAP algorithm. However, this degradation can be maintained inside some accepted limits. Starting from the Jacobi logarithm, only the first term is used by the Max Log MAP algorithm,

Þ ¼ maxðx, yÞ þ lnð1 þ e

The trellis diagram for the turbo decoding architecture of the LTE systems contains eight states, as presented in Figure 6. Each state of the diagram has two inputs and two outputs.

−jy−xj

Þ ≈ maxðx, yÞ : (13)

$$\boldsymbol{\gamma}\_{i\dot{\boldsymbol{\gamma}}} = \mathbf{V}(\mathbf{X}\_k)\mathbf{X}(\mathbf{i}, \mathbf{j}) + \boldsymbol{\Lambda}^i(\mathbf{Z}\_k)\mathbf{Z}(\mathbf{i}, \mathbf{j})\,,\tag{14}$$

where X(i,j) and Z(i,j) are the data, respectively, the parity bits, both associated with one branch and Λ<sup>i</sup> ðZkÞ is the LLR for the input parity bit. For SISO 1 decoding unit, this input LLR is Λ<sup>i</sup> <sup>ð</sup>ZkÞ, whereas for SISO 2 it becomes <sup>Λ</sup><sup>i</sup> ðZ 0 <sup>k</sup>Þ. For SISO 1, VðXkÞ ¼ V1ðXkÞ <sup>¼</sup> <sup>Λ</sup><sup>i</sup> <sup>ð</sup>XkÞ þ <sup>W</sup>ðXkÞ, whereas for SISO 2, VðXkÞ ¼ V2ðX<sup>0</sup> <sup>k</sup>Þ ¼ ILfΛ<sup>o</sup> <sup>1</sup>ðXkÞ þ WðXkÞg, where "IL" operator denotes the interleaving procedure. In Figure 5, W(Xk) is the extrinsic information, whereas Λ<sup>o</sup> <sup>1</sup>ðXk<sup>Þ</sup> and <sup>Λ</sup><sup>o</sup> 2ðX<sup>0</sup> <sup>k</sup>Þ are the output LLRs generated by the two SISOs.

Looking at the LTE turbo encoder trellis, one can notice that between two states, there are four possible values for the branch metrics:

$$\begin{array}{l} \mathcal{V}\_{0} = 0\\ \mathcal{V}\_{1} = \mathcal{V}(\mathcal{X}\_{k})\\ \mathcal{V}\_{2} = \Lambda^{i}(\mathcal{Z}\_{k})\\ \mathcal{V}\_{2} = \mathcal{V}(\mathcal{X}\_{k}) + \Lambda^{i}(\mathcal{Z}\_{k}) \end{array} \tag{15}$$

The LTE decoding process follows a similar approach as for WiMAX systems, i.e., it moves forward and backward through the trellis.

#### 3.2.1. Backward recursion

The algorithm moves backward over the trellis computing the metrics. The obtained values for each node are stored in a normalized manner. They will be used for the LLR computation once the algorithm will start moving forward through the trellis. We name βkðSiÞ the backward metric computed at the kth stage, for the state Si, where 2 ≤ k ≤ K þ 3 and 0 ≤ i ≤ 7. For the backward recursion, the initialization <sup>β</sup><sup>K</sup>þ<sup>3</sup>ðSiÞ ¼ <sup>0</sup>, <sup>0</sup> <sup>≤</sup> <sup>i</sup> <sup>≤</sup> 7 is used at the stage <sup>k</sup> <sup>=</sup> <sup>K</sup> + 3. For the rest of the stages 2 ≤ k ≤ K + 2, the computed backward metrics are

$$\hat{\boldsymbol{\beta}}\_{k}(\mathbf{S}\_{i}) = \max \{ \left( \boldsymbol{\beta}\_{k+1}(\mathbf{S}\_{\overline{\boldsymbol{\beta}}1}) + \boldsymbol{\gamma}\_{\overline{\boldsymbol{\beta}}1} \right), \left( \boldsymbol{\beta}\_{k+1}(\mathbf{S}\_{\overline{\boldsymbol{\beta}}2}) + \boldsymbol{\gamma}\_{\overline{\boldsymbol{\beta}}2} \right) \}\,,\tag{16}$$

where Sj<sup>1</sup> and Sj<sup>2</sup> are the two states from stage k + 1 connected to the state Si from stage k and β^ <sup>k</sup>ðSi<sup>Þ</sup> represents the unnormalized metric. Once the unnormalized metric <sup>β</sup>^ <sup>k</sup>ðS0Þ is computed for state S0, all the backward metrics for states S1…S<sup>7</sup> are normalized as

$$
\beta\_k(\mathcal{S}\_i) = \hat{\beta}\_k(\mathcal{S}\_i) - \hat{\beta}\_k(\mathcal{S}\_0) \tag{17}
$$

and then stored in the dedicated memory.

#### 3.2.2. Forward recursion

When the backward recursion is finished, the algorithm moves forward through the trellis in the normal direction. This specific phase of the decoding is similar to the one for Viterbi algorithm. In this case, the storing procedure is needed only for the previous stage metrics, i.e., for computing the current stage k metrics, only the forward metrics from the last stage k − 1 are needed. We will name αkðSiÞ the forward metric corresponding to state at the stage k, where 0 ≤ k ≤ K − 1 and 0 ≤ i ≤ 7. For the forward recursion, the initialization α0ðSiÞ ¼ 0, 0 ≤ i ≤ 7 is used at the stage k = 0. For the rest of the stages 1 ≤ k ≤ K, the unnormalized forward metrics are computed as

$$\hat{a}\_k(\mathbf{S}\_{\rangle}) = \max \left\{ \left( a\_{k-1}(\mathbf{S}\_{i1}) + \boldsymbol{\gamma}\_{i1\circ} \right), \left( a\_{k-1}(\mathbf{S}\_{i2}) + \boldsymbol{\gamma}\_{i2\circ} \right) \right\}, \tag{18}$$

where Si<sup>1</sup> and Si<sup>2</sup> are the two states from stage k − 1 connected to the state Sj from stage k. Once the unnormalized metric α^kðS0Þ is computed for state S0, all the forward metrics for states S1…S<sup>7</sup> are normalized as

$$
\alpha\_k(\mathcal{S}\_i) = \hat{\alpha}\_k(\mathcal{S}\_i) \neg \hat{\alpha}\_k(\mathcal{S}\_0) \,. \tag{19}
$$

The decoding algorithm can obtain now an LLR estimated for the data bits Xk since it has for each stage k the forward metrics just computed and also the backward metrics stored in the memory. For the first time, this LLR is obtained by computing the likelihood of the connection between the state Si at stage k − 1 and the state Sj at stage k as

$$Z\_k(\mathcal{S}\_i \to \mathcal{S}\_j) = a\_{k-1}(\mathcal{S}\_i) + \gamma\_{ij} + \beta\_k(\mathcal{S}\_j) \,. \tag{20}$$

The likelihood of having a bit equal to 0 (or 1) is when the Jacobi logarithm of all the branch likelihood corresponds to 0 (or 1) and thus:

$$\Lambda^o(X\_k) = \max\_{(\mathcal{S}\_i \to \mathcal{S}\_j): X\_i = 1} \{ \mathcal{Z}\_k(\mathcal{S}\_i \to \mathcal{S}\_j) \} \{ - \max\_{(\mathcal{S}\_i \to \mathcal{S}\_j): X\_i = 0} \{ \mathcal{Z}\_k(\mathcal{S}\_i \to \mathcal{S}\_j) \} \, \, \, \tag{21}$$

where "max" operator is recursively computed over the branches, which have at the input a bit of 1 fðSi ! SjÞ : Xi ¼ 1g or a bit 0 fðSi ! SjÞ : Xi ¼ 0g.
