**5.1 Architecture**

132 Wireless Communications and Networks – Recent Advances

Both the floating point and the fixed point solutions have to be compared against each other and one of the most common measure of fixed point performance is the signal to

Several tools are available to allow the evaluation of a fixed point implementation against a floating point implementation. One of the most important factors is the dynamic range of the signal in question. Floating point adapts to the signal dynamic range, but when the conversion is to be done, a good set of statistics has to be obtained in order to get the most out of the fixed point implementation. The probability density function of the signal will give insight on the range of values that occur as well as their frequencies of occurrence. It may be acceptable to saturate a signal if overshoots are infrequent. We need to carefully evaluate the penalty imposed by this saturation operation and the ripple effects that it could have. This process allows to use just the necessary number of bits to handle the signal most of the time, thus saving in terms of area, power and timing. In section 5.3, we talk about some of the little steps that have to be taken throughout the design in order to save in power consumption. As mentioned, power consumption savings start at the system level

Sometimes the processed signal could be normalized in order to have a unique universal hardware to handle the algorithm. It is very important to take into consideration the places where the arithmetic operations involve a growth in the number of bits assigned at each operation. For example, for every addition of two operands, a growth of one bit has to be appended to account for the overflow of adding both signals. If four signals are added, only a growth of two bits is expected. On the other hand a multiplication creates larger precisions since the number of bits in the multiplication result is the addition of the number of bits of the operands and also it has to be taken into account if the numbers are

The fixed point resolution at every stage needs to be adapted and maintained by the operations themselves and specific processing needs to be done to generate a common format. These operations are the truncation, rounding, saturation and wrapping covered

A nice framework of the use of fixed point data types that could be incorporated into C/C++ algorithm simulations are the SystemC fixed point types available in the IEEE 1666™-2005: Open SystemC Language Reference Manual (SystemC, 2011). There are some other alternatives to fixed point data types such as the Algorithmic C Datatypes (Mentor-Graphics, 2011) that claim to simulate much faster than the original SystemC types and used in the ESL tool CatapultC. The ESL tool Pico Extreme uses the SystemC fixed point data

Matlab/Simulink also has a very nice framework to explore floating to fixed point conversion. When hardware will be generated directly from Simulink, it is very natural to alternate between floating point and fixed point for system level design. Designs that are targeted for Xilinx or Altera FPGAs could naturally use this flow and reuse the floating point testbench to generate the excitation signals that could be used within the Matlab/Simulink environment in for example Hardware in the Loop (HIL) configuration or

fed externally to the FPGA using an arbitrary waveform pattern generator.

quantization noise ratio (SQNR) (Rappaport, 2001).

architecture throughout the ASIC and FPGA methodologies.

signed or unsigned.

briefly for SystemC data type in section 3.

types as the input to the high level synthesis process.

In this section we will give an overview of the importance of the architecture in RTL design. Examples of different architectures for complex multipliers, finite impulse response (FIR) filters, fast Fourier transforms (FFT) and Turbo Codes will be given comparing their complexity, throughput, maximum frequency of operation and power consumption.

When an efficient architecture is sought, each gate, each register, each adder and each multiplier counts. Sometime it is a good approximation at the system level to count the number of arithmetic operations to get an initial estimate of the silicon area that will be used for the algorithm. While this is a crude approximation it is a very good start point to allocate resources on the System on a Chip (SoC). Many companies have spreadsheets that contain average values for different operations in a particular technology; based on hundredths of designs. The architecture task is to find the optimum implementation of a particular algorithm while accomplishing all the above referred design parameters.

When an algorithm is implemented, what will be the final underlying technology for implementation? ASIC or FPGA; or will it be driven by software and just primitive building blocks will be used as coprocessors or hardware accelerators. Whenever a product needs to be designed on an application area that continues to grow and generate new algorithms and implementation such as video processing, sometime an analytics engine must be architected that will provide co-processing or hardware acceleration by implementing the most common image processing algorithms. This idea could be applied to any communications system or signal processing system where a solution could include a common set of hardware accelerators or coprocessors that realize functions that are basic and will not easily change. One very good example is the TMS320TCI6482 Fixed-Point Digital Signal Processor (Texas-Instruments, 2011) that is used for third generation mobile wireless infrastructure applications and contains three important coprocessors: Rake Search Accelerator, Enhanced Viterbi Decoder Coprocessor and Enhanced Turbo Decoder Coprocessor.

So the question is: When implementing a particular algorithm, how can we architect it such that it is efficient in all senses (are, power, timing) as well as versatile? The answer depends on the application. That is why hardware/software partitioning is a very important stage that has to be developed very carefully by thinking ahead of possible application scenarios. In some cases there is no option, and the algorithm has to be implemented in hardware, otherwise the throughput and performance requirements may not be met. Let's explore briefly some practical examples of blocks used in wireless communication systems and just brainstorm on which architectures may be suitable.

#### *Finite Impulse Response Filters*

An FIR filter implementation can be thought as a trivial task, since it involves the addition of the weighted version of a series of delayed versions of an input signal. While it seems very simple, we have several tradeoffs when selecting the optimum architecture for implementation. For an FIR filter implementation we have for example the following textbook structures: Transversal, linear phase, fast convolution, frequency sample, and cascade (Ifeachor, 1993). When implementing on for example on FPGAs, then we found for example the following forms: Standard, transpose, systolic, systolic with pipelined multipliers(Ascent, 2010).

Most of the FPGA architectures are enhanced to make more efficient the implementation of particular DSP algorithms and the architecture selection may fit into the most efficient configuration for a particular FPGA vendor or family. If we are targeting ASIC, then the architecture will be different depending on the library provided by the technology vendor. When implementing an FIR or any other type of filter or signal processing algorithm, we need to evaluate the underlying implementation technology for tuning the structure for efficient and optimum operation.

#### *Turbo Codes*

One interesting example is on Turbo Codes, while the pseudo-random interleaver is supposed to be "random", there has been a pattern defined on how the data could be efficiently accessed. Some interleavers are contention free, while some others have contentions depending on the standard. For example, one of the major differences on the third generation wireless standards namely 3GPP(W-CDMA) and 3GPP-2 (CDMA2000) is on the type of interleaver generator used, this means that to a certain degree it would be possible to design a Turbo Coder/Decoder that could easily implement both standards.

The purpose of an efficient implementation of an interleaver hardware is to have different processing units accessing different memory banks in parallel, some examples on the search for common hardware that could potentially be used for different standards are shown in (Yang, Yuming, Goel, & Cavallaro, 2008), (Borrayo-Sandoval, Parra-Michel, Gonzalez-Perez, Printzen, & Feregrino-Uribe, 2009) and (Abdel-Hamid, Fahmy, Khairy, & Shalash, 2011). The architecture is a function of the standard and sometimes it is very difficult to find a "one architecture fits all" type of solution and in some case to make the interleaver compatible with multiple standards, on-the-fly generation is the best approach, but there can be irregularities or bubbles inserted into the overall computation. This is one of the challenges in mobile wireless that sometimes is easier to implement complete different subsystems performing efficiently one particular standard, rather than having an architecture that could perform all. This is the case in mobile cellular second generation GSM (Global System for Mobile Communications, originally Groupe Spécial Mobile) and third generation cellular W-CDMA (wideband code division multiple access) that minimum reusability could be achieved and to a certain extent there are two complete wireless modems implemented for each standard.
