**10.2 Improving the performance of DSP systems for MIMO processing**

In the paper "Improving the performance of DSP systems for MIMO processing" (Horner, Kwasinski, & Mondragon, 2011), we explored the efficient implementation of select Multiple Input Multiple Output (MIMO) communications algorithms. Two implementation approaches were considered: adding new instructions to the DSP instruction set and adding a hardware accelerator to the DSP system. Of the two approaches, the second was concluded to be best, as it resulted in notable processing speedups and a more efficient use of the computational resources.

While the research into MIMO algorithms have reached levels of development that show important wireless systems performance improvements, the development of DSP systems to implement them has limited the realization of these algorithms to the simplest and least performing ones. This example addresses this technological gap by studying how to design DSP systems to better handle the increased complexity arising from the particular operations typical of MIMO processing algorithms.

Two hardware co-processors were designed, as shown in Figure 8 one for a Householder decomposition algorithm and one for a Greville pseudo inverse algorithm. These hardware co-processors resulted in a simulated speedup of 2.7 for the Greville algorithm and between 4 and 4.7 for the Householder algorithm.

For the design of the hardware accelerator, Synfora's Pico Extreme (acquired recently by Synopsys) ESL tool was used. The author had previous experience with the tool and the task performed for this work was limited to architecture exploration and to find which ASIC implementation would result in the best compromise between throughput, area, power, and easy of interfacing. The algorithms were written in floating point C code and then converted to fixed point C code by evaluating the impact in performance due to the hardware implementation.

Pico Extreme is a very versatile tool since it is structured as a series of logical steps from running an untimed sequential ANSI C program, to single-to-multi-threaded transformations; to hierarchical block-level resource sharing & scheduling; to automatic retiming and pipelining; to performance and throughput analysis; to rapid exploration of performance impacts of loop unrolling, scheduling, and other optimizations; and to RTL verification among others. The flow methodology is shown in Figure 7.

Fig. 7. PICO Extreme design flow.

144 Wireless Communications and Networks – Recent Advances

a) b)

c) d)

a) Altera DE-2 with daughtercard dual AD channels with 14-bit resolution and data rate up to 65 MSPS and dual DA channels with 14-bit resolution and data rate up to 125 MSPS.

In the paper "Improving the performance of DSP systems for MIMO processing" (Horner, Kwasinski, & Mondragon, 2011), we explored the efficient implementation of select Multiple Input Multiple Output (MIMO) communications algorithms. Two implementation approaches were considered: adding new instructions to the DSP instruction set and adding a hardware accelerator to the DSP system. Of the two approaches, the second was concluded to be best, as it resulted in notable processing speedups and a more efficient use

c) MOC modulation output when the input is driven by a PRBN sequence generator. d) MOC modulation output snapshot when the input is driven by a PRBN sequence

Fig. 6. MOC hardware implementation on an Altera Cyclone II FPGA.

b) MOC modulation output when the input is a stream of constant zeros.

**10.2 Improving the performance of DSP systems for MIMO processing** 

generator.

of the computational resources.

While this seems to be a dream in which the system designer can implement his design by exploring architectures and trade-offs, then pushing a button and get verified RTL as an output, the reality is that the learning curve of these tools is quite steep and it is not as straight forward as it looks. Even that a very thorough architecture exploration can be performed, the designer still needs to think in terms of hardware when writing the C code to have the same effect as writing in HDL RTL. The C code has to be written in terms of functional units, pipeline stages, memory implementations, operator sharing and general hardware efficiency.

There are two basic methods to specify the design (Synfora, 2009). The number of clock cycles between iteration starts is called II (Initiation Interval) and the number of clock cycles to start all iterations is called MITI (Maximum Inter Task Interval). For this example, MITI can be as small as N\*II (where N is the number of loop iterations).

The user is able to provide a target maximum number of clock cycles taken per stage MITI and the tool will select from the library of high-speed components the optimum to achieve higher levels of parallelism at the same time of sharing resources and achieving performance.

Fig. 8. Processing pipeline for Greville and Householder decomposition methods.

To provide a tradeoff between complexity and speedup, different implementations with different target MITIs were generated. It was noted that as timing constraints tightened, hardware multipliers were switched from two-cycle to one-cycle and the number of multipliers increased to be able to complete complex multiplications (requiring three multiplies) in a single cycle.

MITI timing constraints were used to determine the lowest complexity implementation for each algorithm. The constraints within these ranges of target clock cycles were then used to produce a tradeoff between complexity and resulting speedup. Resulting ranges of targeted number of clock cycles were 230 to 330 for the Householder implementation and 130 to 210 for the Greville implementation.

The resulting speedup was calculated as the ratio of cycles on the DSP-only implementation to the cycles of the DSP-PPA implementation. The resulting silicon area was calculated based on the estimated number of gates given by Pico Extreme and using a characterized CMOS 65nm technology library with an estimate of 854,000 gates per mm2. This technology was selected, given that is the one in which the DSP was manufactured and can provide an estimate of the growth of the silicon area for the DSP to enable MIMO processing. A plot of speedup vs. complexity for both clocks and both simulators is shown in Figure 9.

The resulting maximum speedups were close to 2.75 for the Greville algorithm and between 4 and 4.7 for the Householder QR decomposition algorithm. This speedup would result in a large reduction (129 μs for the Greville implementation and 521 μs for the Householder implementation) in the amount of time required to compute the channel equalization matrices for an entire OFDM channel in MIMO communication. There is an upper limit to the speedup, however. Because the DSP is still required for some pre-processing operations, there is an asymptotic limit on the actual speedup achieved. Once the PPA unit is able to compute one stage of the processing pipeline in the same amount of time as the software pre-process, there is little added benefit to faster clock or higher complexity. There is also not a major advantage in the 1 GHz clock over the 500 MHz. While the slower clock would require the more complex implementations to compute faster than the DSP software, the savings on power consumption could outweigh the cost of higher complexity.

Fig. 9. a) Speedup vs. Complexity for Householder implementation b) Speedup vs. Complexity for Greville implementation.
