**10.3 OFDM – FFT example**

146 Wireless Communications and Networks – Recent Advances

There are two basic methods to specify the design (Synfora, 2009). The number of clock cycles between iteration starts is called II (Initiation Interval) and the number of clock cycles to start all iterations is called MITI (Maximum Inter Task Interval). For this example, MITI

The user is able to provide a target maximum number of clock cycles taken per stage MITI and the tool will select from the library of high-speed components the optimum to achieve higher levels of parallelism at the same time of sharing resources and achieving

Fig. 8. Processing pipeline for Greville and Householder decomposition methods.

To provide a tradeoff between complexity and speedup, different implementations with different target MITIs were generated. It was noted that as timing constraints tightened, hardware multipliers were switched from two-cycle to one-cycle and the number of multipliers increased to be able to complete complex multiplications (requiring three

MITI timing constraints were used to determine the lowest complexity implementation for each algorithm. The constraints within these ranges of target clock cycles were then used to produce a tradeoff between complexity and resulting speedup. Resulting ranges of targeted number of clock cycles were 230 to 330 for the Householder implementation and 130 to 210

The resulting speedup was calculated as the ratio of cycles on the DSP-only implementation to the cycles of the DSP-PPA implementation. The resulting silicon area was calculated based on the estimated number of gates given by Pico Extreme and using a characterized CMOS 65nm technology library with an estimate of 854,000 gates per mm2. This technology was selected, given that is the one in which the DSP was manufactured and can provide an estimate of the growth of the silicon area for the DSP to enable MIMO processing. A plot of

The resulting maximum speedups were close to 2.75 for the Greville algorithm and between 4 and 4.7 for the Householder QR decomposition algorithm. This speedup would result in a large reduction (129 μs for the Greville implementation and 521 μs for the Householder implementation) in the amount of time required to compute the channel equalization

speedup vs. complexity for both clocks and both simulators is shown in Figure 9.

can be as small as N\*II (where N is the number of loop iterations).

performance.

multiplies) in a single cycle.

for the Greville implementation.

In (Mondragon-Torres, Kommi, & Bhattacharya, 2011), the author proposes the development of an OFDM educational platform that will make use of all the methodologies and tools presented in this chapter with the objective of creating a single system that will allow students to explore different levels of abstraction on hardware design as well as to quantify the effects of the decisions taken on the fixed point precisions as well as all the intermediate signal processing and conditioning through the datapath.

The heart of the OFDM modulation technique lies in the use of the Fast Fourier Transform (FFT), which is a very structured algorithm to convert a time domain signal into the frequency domain and by taking the inverse FFT (IFFT) can be transformed back into the time domain. In Figure 10, a complete digital communication system that employs OFDM modulation is shown (Cho, 2010).

The approach in OFDM systems is to have digital information encoded by traditional phase modulation techniques such as Quadrature Amplitude Modulation (QAM). This modulation technique maps a series of bits into QAM modulated symbols. The number of symbols used for each OFDM frame is traditionally a power of two. Then the IFFT of a block is performed on the frame to convert it back into a time domain representation that can be further processed and sent through the transmitter chain and through the antenna. On the receiver side the process is reversed after frame synchronization by taking the FFT of the received block and obtaining an estimate of the QAM symbols which are mapped back into a series of bits. This sounds pretty straightforward but there are many subtle details that could be investigated in terms of the effects of: quantization, distortion, channel noise, multipath propagation, fading, Doppler shift, synchronization, etc.

A very simple implementation of a 256 point FFT is presented in this section as shown in Figure 12. No architectural decisions were performed and a regular textbook implementation is used just to demonstrate some of the capabilities of CatapultC. In Figure 11, technology parameters and some common definitions are shown as reference for the reader. Based on the above definitions, we started to change the system parameters to get a feel of their implications.

In Figure 13 it is shown how by unrolling and pipelining the input and output operations we can drastically reduce the latency. What is the price for this? Answer: Memory bandwidth. We can observe that the area has been maintained constant and this is due to the fact that no memories have been considered in these solutions.

Fig. 10. Digital communications system using OFDM modulation.

Figure 14 and 15 shows the complexity of the solution and we can observe that most of the area is being used in multiplexers to route the signals. On the other hand, more memory will be required for unrolling *printing* and pipelining *reading*. So far we have not touched a single line of code and just by modifying the outer input and output loops we have been able to reduce the latency by 2x at the cost of 2x memory. This is a simple illustration of using the same code to tradeoff performance vs. complexity.

**Technology used:** Generic CMOS ASIC 90 nm, 200MHz **Definitions**  *Loop unrolling:* Loop unrolling can be used to compute multiple loop iterations in parallel. *Partial unrolling:* Computes 'n' copies in parallel *Pipelining:* Starts the next loop iteration before the current iteration of the data path contained in the loop has completed

*Initial Interval:* indicated how often to start a new loop iteration

148 Wireless Communications and Networks – Recent Advances

could be investigated in terms of the effects of: quantization, distortion, channel noise,

A very simple implementation of a 256 point FFT is presented in this section as shown in Figure 12. No architectural decisions were performed and a regular textbook implementation is used just to demonstrate some of the capabilities of CatapultC. In Figure 11, technology parameters and some common definitions are shown as reference for the reader. Based on the above definitions, we started to change the system parameters to get a

In Figure 13 it is shown how by unrolling and pipelining the input and output operations we can drastically reduce the latency. What is the price for this? Answer: Memory bandwidth. We can observe that the area has been maintained constant and this is due to the

multipath propagation, fading, Doppler shift, synchronization, etc.

fact that no memories have been considered in these solutions.

Fig. 10. Digital communications system using OFDM modulation.

same code to tradeoff performance vs. complexity.

Figure 14 and 15 shows the complexity of the solution and we can observe that most of the area is being used in multiplexers to route the signals. On the other hand, more memory will be required for unrolling *printing* and pipelining *reading*. So far we have not touched a single line of code and just by modifying the outer input and output loops we have been able to reduce the latency by 2x at the cost of 2x memory. This is a simple illustration of using the

feel of their implications.

*Latency:* Latency refers to the time, in clock cycles, from the first input to the first output

*Throughput:* Throughput, not to be confused with IO throughput, refers to how often, in clock cycles, a function call can complete.

#### Fig. 11. Technology used and some common definitions.

Fig. 12. Program to compute 256 point FFT.

The FFT algorithm itself has not been optimized due to the data dependency among inner and outer loops. Additional pipe stages will need to be implemented in order to break the loop dependency implicit in the direct implementation of the FFT. This probes the point that there the designer has to guide the tool by writing the C code in such a way that the hardware can be inferred.

Another simple tradeoff was executed by increasing the frequency of operation from 100 MHz to 500 MHz as shown in Figure 16. We can observe that the area remained almost constant, while the latency cycles increased by 3% with respect to the 200 MHz implementation baseline, the latency cycles increased by 19%. We can interpret these numbers as the logic required to implement the FFT had a larger critical path, but since the clock was increased 2.5x, the latency time was reduced by 2.0x demonstrating that there is not a linear relationship between the parameters and depends on the implementation given by the particular constraints.

Talking about power, increasing the frequency by 2.5x will have an impact on the power, but at the same time if it is 2.0x faster, we can think for example on reusing the FFT for some other part of the OFDM processor such as computing the IFFT and FFT using the same hardware and sharing it on the time domain rather than have two cores to perform both operations independently.


Fig. 13. Different solutions by selecting different architectural constraints.

Fig. 14. Graphical view plotting Area.

Fig. 15. Graphical View plotting memory usage.


Fig. 16. Change in performance with change in frequency.
