6. Implementation results

#### 6.1. WiMAX systems

pð0Þ ¼ 0,

s<sup>1</sup> ¼ f <sup>1</sup> and

( 0, i ¼ 0, f <sup>2</sup>, i ¼ 1,

The multiplications are replaced by additions, which require less hardware resources. Nevertheless, the division is still necessary for the modulo operation. If we consider the modulo

k

we can decrease the arithmetic effort needed to obtain πðiÞ in Eq. (26). The number of modulo operations becomes bigger, but the overall complexity of the corresponding divisions is reduced since smaller quotients are used. Consequently, using Eqs. (25)–(27), one can

> 8 ><

> >:

¼ ½πði−1Þ þ s3ði−1Þ mod K þ 2f <sup>2</sup> mod K� mod K

All of the numerical values added in the last stage of Eq. (29) are lower than K and available recursively (during the processing of a distinct frame), such as πði−1Þ and s3ði−1Þ mod K or they can be predetermined and stored, like the case of 2f <sup>2</sup> mod K:. The overall arithmetic complexity is reduced to 2K additions and 2K simplified modulo operations (i.e., each is resolvable using a comparison and a subtraction) for the address generation module. The method improves the solutions presented in [24, 25], by eliminating any multiplications or divisions. Additionally, the lower numerical range of the operators (with values lower than 2K; i.e., values in the range of thousands) allows the usage of minimal resources for the represen-

πðiÞ ¼ pðiÞ mod K ¼ ½pði−1Þ þ s3ðiÞ� mod K ¼ ½pði−1Þ þ s3ði−1Þ þ 2f <sup>2</sup>Þ� mod K

ck mod K � �

> 0, i ¼ 0, f <sup>1</sup> þ f <sup>2</sup>, i ¼ 1, s3ði−1Þ þ 2f <sup>2</sup>, i > 1

mod K ¼ ∑

s3ðiÞ ¼ s<sup>1</sup> þ s2ðiÞ ¼

s2ði−1Þ þ 2f <sup>2</sup>, i > 1

πðiÞ ¼ pðiÞ mod K ¼ ½pði−1Þ þ s<sup>1</sup> þ s2ðiÞ� mod K (26)

s2ðiÞ ¼

We can rewrite Eq. (3) using Eqs. (23) and (24)

operator applied to a sum of elements expressed as

Using Eq. (29) in Eq. (26), the result is

tation of binary values.

∑ k ck � �

where

46 Field - Programmable Gate Array

write:

<sup>p</sup>ðiÞ ¼ <sup>p</sup>ði−1Þ þ <sup>s</sup><sup>1</sup> <sup>þ</sup> <sup>s</sup>2ðiÞ, <sup>i</sup> <sup>&</sup>gt; <sup>0</sup>, (24)

(25)

(28)

(29)

mod K (27)

The estimated system frequency when implementing the decoding structure on a Xilinx XC4VLX80-11FF1148 chip using the Xilinx ISE 11.1 tool is 125 MHz. The reserved chip area is around 3000 (8.37%) slices from a total of 35,840. The results are comparable with the assessments presented in [26].

The decoding latency and decoding rate corresponding to the above-mentioned clock frequency (see Table 1) are

$$\text{Latency} = \text{2L}(2K + 10) \tag{30}$$

$$R\_b = \frac{2K}{2L(2K+10)T\_{clk}}\tag{31}$$


Table 1. Latency and throughput.

The implementation delay is represented by 10 clock periods per iteration and is added to the theoretical latency of the MAP algorithm (which is 4KN clock periods).

In Figure 16, the decoding performances are presented for a quadrature phase shift keying (QPSK) modulation, ½ rate, 1–4 iterations, a block size of 6 bytes (the smallest possible) and a transmission simulated through an additive white Gaussian noise (AWGN) channel. The results are depicted for the worst case scenarios, considering that the test was performed for the smallest block size.

### 6.2. LTE systems

Figures 11 and 15 show that the decoding latency is reduced in the case of parallel decoding with a factor almost equal to N. The presented implementation has an 11 clock period Delay, which is added for each forward trellis run (when the LLRs are computed). As a consequence, two such values must be considered during each iteration.

For serial decoding, the native latency is computed as follows: at the first semi-iterations, K clock periods required for the backward trellis run and another (K + Delay) clock periods for the forward trellis run and LLR computation. The value is then considered twice in order to take into account the second semi-iteration. By denoting L the number of executed iterations,

Figure 16. The impact of the number of iterations on decoding performances.

the numbers of clock periods required for a serial, respectively, a parallel block decoding operation result as:

$$\text{Latency\\_s} = (4\text{K} + 2\text{Delay})L\tag{32}$$

$$\text{Latency\\_p} = (4\text{K}/\text{N} + 2\text{Delay})\text{L} \tag{33}$$

When performing tests for the parallel decoding performances, a certain level of degradation was observed, since the forward and backward metrics are altered at the data block boundaries. In order to have similar performance as in the serial decoding case, a small overhead is accepted. By introducing an overlap at each parallel block boarder, the metrics computation gains a training phase. The minimum overlap window length is selected to cover the minimum standard defined data block (in this case Kmin = 40 bits).

Figure 17 shows this situation, for the N = 2 setup. If we consider N > 2, which leads to blocks with Kmin at both the left and right sides, the corresponding latency can be expressed as:

$$\text{Latency\\_po} = \left( 4(\text{K/N} + 2\text{K}\_{\text{min}}) + 2\text{Delay} \right) \text{L} \tag{34}$$

For even-odd merge sorting network implementation, we can study the configuration K = 40 bits and N = 8. The input of the ILM content is represented by the 40 interleaved addresses organized in five memory locations and eight addresses for each location. The minimumdetected value for each ILM location (i.e., the natural-order memory location that will be

Efficient FPGA Implementation of a CTC Turbo Decoder for WiMAX/LTE Mobile Systems http://dx.doi.org/10.5772/67017 49

Figure 17. (a) Non overlapping split; (b) overlapping split.

the numbers of clock periods required for a serial, respectively, a parallel block decoding

When performing tests for the parallel decoding performances, a certain level of degradation was observed, since the forward and backward metrics are altered at the data block boundaries. In order to have similar performance as in the serial decoding case, a small overhead is accepted. By introducing an overlap at each parallel block boarder, the metrics computation gains a training phase. The minimum overlap window length is selected to cover the minimum

Figure 17 shows this situation, for the N = 2 setup. If we consider N > 2, which leads to blocks with Kmin at both the left and right sides, the corresponding latency can be expressed

For even-odd merge sorting network implementation, we can study the configuration K = 40 bits and N = 8. The input of the ILM content is represented by the 40 interleaved addresses organized in five memory locations and eight addresses for each location. The minimumdetected value for each ILM location (i.e., the natural-order memory location that will be

4ðK=N þ 2KminÞ þ 2Delay

standard defined data block (in this case Kmin = 40 bits).

Figure 16. The impact of the number of iterations on decoding performances.

Latency\_po ¼

Latency\_s ¼ ð4K þ 2DelayÞL (32)

Latency\_p ¼ ð4K=N þ 2DelayÞL (33)

L (34)

operation result as:

48 Field - Programmable Gate Array

as:

accessed) is contained in the output of the sorting unit. Also, the module provides the order which will be used to send data read from natural-order memory location to the N decoding units. In this example, at the third clock period, the second ILM location is read, i.e., the addresses 6, 31, 36, 21, 26, 11, 16 and 1. The sorting module labels these addresses with an index, obtaining the pairs: (6, 0), (31, 1), (36, 2), (21, 3), (26, 4), (11, 5), (16, 6) and (1, 7). Then the addresses are arranges in an increasing order: (1, 7), (6, 0), (11, 5), (16, 6), (21, 3), (26, 4), (31, 1) and (36, 2). At the same time, the minimum address found at this location is sent at the output, 1 in this example. In conclusion, location number 1 is read from the natural-order data memory. The eight samples from the location 1 are distributed to the eight decoding units as indicated by the output index. The first sample from this location is sent to decoder unit 7, the second sample to decoder unit 0, the third one to decoder unit 5 and so on. As Figure 18 shows that at the register transfer level (RTL), besides flip flops, the sorting unit includes only basic selection elements.


Figure 18. Basic selection element for binary inputs.

It can be seen in Figure 19 that the sorting unit allows a pipeline data processing. Consequently, with a certain implementation delay (7 clock periods in the proposed scheme), the module provides a value belonging to the set of sorted indexes at each clock cycle.


Figure 19. Even-odd merge sort – ModelSim simulation.

It is important to mention that the even-odd merge sorting was selected because it allows a pipeline functioning, consuming also lower resources than the other listed methods. Some comparative results were provided in [11, 27] in terms of used resources for the applicationspecific integrated circuit (ASIC).

In order to evaluate the performances, we used the very high speed hardware description language (VHDL), programming language. The code was tested using ModelSIM 6.5. For the generation of RAM/ROM memory blocks, Xilinx Core Generator 14.7 was employed and the synthesis process was accomplished using Xilinx XST from Xilinx ISE 14.7. Using the abovementioned tools, the resulted values for the decoding structure when implemented on a Xilinx XC5VFX70T-FFG1136 are the following [28]: frequency of 310 MHz and 664 flip flops and 568 LUTs for the sorting unit, respectively, a frequency of 300 MHz, 1578 flip flop registers and 1708 LUTs for the interleaver.

The values listed in Table 2 are obtained using Eqs. (32)–(34), when N = 8 is considered. One can observe that the overhead introduced by the overlapping split method is less important for bigger values of K, this being the scenario when a parallel approach is usually applied. The achieved overall system frequency is 210 MHz, with the longest signal propagation time required for the SISO unit.


Table 3 provides the corresponding throughput rate when the values from Table 2 are used.

Table 2. Latency values for N = 8, L = 3 or 4 and K = 1536, 4096 or 6144.


Table 3. Throughput values for N = 8, L = 3 or 4 and K = 1536, 4096 or 6144.

It is important to mention that the even-odd merge sorting was selected because it allows a pipeline functioning, consuming also lower resources than the other listed methods. Some comparative results were provided in [11, 27] in terms of used resources for the application-

In order to evaluate the performances, we used the very high speed hardware description language (VHDL), programming language. The code was tested using ModelSIM 6.5. For the generation of RAM/ROM memory blocks, Xilinx Core Generator 14.7 was employed and the synthesis process was accomplished using Xilinx XST from Xilinx ISE 14.7. Using the abovementioned tools, the resulted values for the decoding structure when implemented on a Xilinx XC5VFX70T-FFG1136 are the following [28]: frequency of 310 MHz and 664 flip flops and 568 LUTs for the sorting unit, respectively, a frequency of 300 MHz, 1578 flip flop registers and

The values listed in Table 2 are obtained using Eqs. (32)–(34), when N = 8 is considered. One can observe that the overhead introduced by the overlapping split method is less important for bigger values of K, this being the scenario when a parallel approach is usually applied. The achieved overall system frequency is 210 MHz, with the longest signal propagation time

Table 3 provides the corresponding throughput rate when the values from Table 2 are used.

1536 88.08 117.4 11.28 15.04 15.85 21.14 4096 234.3 312.5 29.57 39.42 34.14 45.52 6144 351.4 468.5 44.2 58.9 48.7 56.02

Table 2. Latency values for N = 8, L = 3 or 4 and K = 1536, 4096 or 6144.

Latency\_s [μs] Latency\_p [μs] Latency\_po [μs]

343434

specific integrated circuit (ASIC).

50 Field - Programmable Gate Array

Figure 19. Even-odd merge sort – ModelSim simulation.

1708 LUTs for the interleaver.

required for the SISO unit.

K L

As one can observe from Table 3, the serial decoding performance is similar to the theoretical one. Let us consider, for example, the case L = 3 and K = 6144. Considering the theoretical latency of 4KL clock periods, the theoretical throughput is 17.5 Mbps. After implementation, the obtained result for the proposed serial architecture is 17.48 Mbps.

The following performance graphs were obtained using a finite precision Matlab simulator. This approach was selected because the same outputs as the ModelSIM simulator are obtained in Matlab, while the testing time is considerably smaller.

All the simulation results were generated for the Max Log MAP algorithm. The illustrations present the bit error rate (BER) versus signal-to-noise ratio (SNR) expressed as the ratio between the energy per bit and the noise power spectral density.

Figure 20 presents the attained performances for the case of K = 512, N = 2, L = 3 and QPSK modulation, using the three discussed decoding methods, i.e., the serial one, the parallel without overlapped split one and the parallel with overlapped split one. Figure 21 depicts the same performance comparison, this time for K = 1024 and N = 4.

Figure 20. Comparative decoding results for QPSK, L = 3, K = 512, N = 2.

Figure 21. Comparative decoding results for QPSK, L = 3, K = 1024, N = 4.

Analyzing the results presented in Figures 20 and 21, one can conclude that the decoding performance obtained, when parallel decoding with the overlapped split method is used, is almost similar to the one for serial decoding. In contrast, the parallel decoding without the overlapped split method generates some loss in performance when compared to the serial decoding. This degradation is dependent on the parallelization factor N.
