**3. Advanced high‐tech fast Fourier transform algorithm**

due to the sequence calculations. **Figure 10** shows the Radix II lite structure when 2 input data

are loaded for FFT calculations.

78 Fourier Transforms - High-tech Application and Current Trends

**Figure 9.** FFT processor with Radix II burst I/O architecture [13].

**Figure 10.** FFT processor with Radix II lite burst I/O architecture [13].

In Section 2, FFT fundamental was discussed and elaborated. Furthermore, different FFT archi‐ tectures were provided with the detail on IO configuration. Here, advance FFT processor with the focus on 1024 floating‐point parallel architecture for high performance application is provided.

## **3.1. Stage realization of 1024‐point parallel pipeline FFT structure**

High‐tech FFT principle is based on Radix II algorithm in floating‐point format to conduct 1024 point FFT structure. **Figure 11** illustrates the main block diagram of the 1024‐point Radix II floating‐point parallel pipeline (FPP) FFT processor in detail.

As shown in **Figure 11**, there are six major subprocessor units in the high‐tech 1024 point Radix II FPP‐FFT algorithm. These units are shared memory, bit reverse, butterfly arithmetic, smart controller, ROM, and finally address generator unit. The floating‐point input data act as a variable streaming configuration into the processor. The variable streaming configuration allows continuous streaming of input data and produces continuous stream of output data. **Figure 12** shows the internal schematic of the pipeline butterfly algorithm with the parallel architecture at a glance.

**Figure 11.** 1024 point Radix II FPP‐FFT block diagram.

To enhance the speed of calculation in Radix II butterfly algorithm, the pipeline registers are located after each addition, subtraction, and multiplication subprocessors. Hence, the pipeline butterfly algorithm keeps the final result in the register to be transferred into the RAM by the next clock cycle. Additionally, the parallel architecture splits the data in real and imaginary format and increases the speed of FFT calculation by 50%. As a result of the design algorithm, Radix II FPP‐FFT processor calculates 1024 point floating‐point FFT exactly after *O*(*N*/2 log2 *N* ) <sup>+</sup> 11 clock a pulse which proves the performance improvement in comparison with similar Radix II FFT architecture. The existence of 11 clock pulses delay is due to 11 pipeline registers in adder, subtraction, and multiplier in a serial butterfly block. Additionally, parallel design of the FFT algorithm decreases the calculation time significantly.

**Figure 12.** Designed FPP Radix II butterfly structure.

Radix II butterfly unit is responsible for calculating the complex butterfly equations as output1 = input1 + *W <sup>k</sup>* × input2 and output2 = input1 − *W <sup>k</sup>* × input. To calculate the butterfly equa‐ tion, it is necessary to initiate the RAM with bit‐reverse format and the external processor loads the data in the RAM. Since butterfly equation deals with complex data, thus each butterfly requires four multiplication units (two for the real and two for the imaginary) and six additional units (three for the real and three for the imaginary part). Fixed point implementation of such complex calculation does not satisfy high‐tech application of FFT processor due to the generated noise of round‐off, overflow, and coefficient quantization errors [14]. Consequently in order to reduce the error as well as to achieve high‐resolu‐ tion output, the floating‐point adders and subtractions are used to replace the fixed‐point arithmetic units.

### *3.1.1. Floating point adder/subtraction*

RAM by the next clock cycle. Additionally, the parallel architecture splits the data in real and imaginary format and increases the speed of FFT calculation by 50%. As a result of the design algorithm, Radix II FPP‐FFT processor calculates 1024 point floating‐point FFT

comparison with similar Radix II FFT architecture. The existence of 11 clock pulses delay is due to 11 pipeline registers in adder, subtraction, and multiplier in a serial butterfly block. Additionally, parallel design of the FFT algorithm decreases the calculation time

Radix II butterfly unit is responsible for calculating the complex butterfly equations as output1 = input1 + *W <sup>k</sup>* × input2 and output2 = input1 − *W <sup>k</sup>* × input. To calculate the butterfly equa‐ tion, it is necessary to initiate the RAM with bit‐reverse format and the external processor loads the data in the RAM. Since butterfly equation deals with complex data, thus each butterfly requires four multiplication units (two for the real and two for the imaginary) and six additional units (three for the real and three for the imaginary part). Fixed point implementation of such complex calculation does not satisfy high‐tech application of FFT processor due to the generated noise of round‐off, overflow, and coefficient quantization errors [14]. Consequently in order to reduce the error as well as to achieve high‐resolu‐ tion output, the floating‐point adders and subtractions are used to replace the fixed‐point

) <sup>+</sup> 11 clock a pulse which proves the performance improvement in

exactly after *O*(*N*/2 log2

significantly.

arithmetic units.

*N*

80 Fourier Transforms - High-tech Application and Current Trends

**Figure 12.** Designed FPP Radix II butterfly structure.

Butterfly processor efficiency greatly depends on its arithmetic units, and high‐speed float‐ ing‐point adder is the bottle neck of butterfly calculation. Based on IEEE‐754 standard [10] for floating‐point arithmetic, 32‐bit data register is considered to allocate mantissa, exponent, and sign bit in a portion of 23, 8, and 1 bits, respectively. The advantages of floating‐point adder are that the bias power is applied to complete the calculation and avoid using unsigned value. Additionally, the floating‐point adder unit performs the addition and subtraction using sub‐ stantially the same hardware as used for the floating‐point operations. This functionality minimizes the core area by minimizing the number of elements. Furthermore, each block of floating‐point adder/subtraction operates the arithmetic calculation within only one clock cycle that results high‐throughput and low latency for the entire FFT processor. **Figure 13** shows the novel structure of the floating‐point adder when it is divided into four separate blocks while detail algorithm is presented in **Figure 14**.

**Figure 13.** Schematic diagram of advance floating‐point adder.

The purpose of having separate blocks is to share the total critical path delay into three equal blocks. These blocks calculate the arithmetic function within one clock cycle. However, the propagation delay can be associated with continuous assignment to increase the overall criti‐ cal path delay and for the slowing down of the throughput. Based on combinational design, the output of each stage depends on its input value at the time. The unique structure of float‐ ing‐point adder enables feeding of the output result in the pipeline registers after every clock cycles. Hence, the sequential structure is applied for the overall pipelined add/subtraction algorithm to combine the stages. The processing flow of the floating‐point addition/subtraction operation consists of comparison, alignment, addition/subtraction, and normalization stages.

The comparison stage compares two input exponents. This unit compares two exponents and provides the result for the next stage. The comparison is made by two subtraction units and the result is revealed by *compare\_sign* bit.

**Figure 14.** Flowchart of advance floating‐point adder.

According to the results of the comparison stage, the alignment stage shifts the mantissa and transfers it to the adder/subtraction stage. The number of shifting will be selected by the comparison stage output. Consequently, each stage of the floating‐point adder algorithm is executed within one clock cycle. Floating‐point adder/subtraction unit satisfies high speed and efficiency of arithmetic unit in cost of die area size. The floating‐point arithmetic unit is designed to calculate entire numbers regardless of the number sign. As shown in **Figure 15**, there is a logic gate involved with the stages, which cause higher delay propagation through the circuit.

Floating‐point numbers are generally stored in registers as normalized numbers. This means that the most significant bit of the mantissa has a nonzero value. Employing this method allows the most accurate value of a number to be stored in a register. For this purpose, the normalized stage is required. This unit is located after the add/sub stage. The output signal representing the add/sub block leads to zero digits of an unnormalized result of the calcula‐ tion operation. The normalized block ignores the digital value of zero from the MSB of the mantissa and shifts the mantissa to imply value of one in digital as MSB in mantissa.

**Figure 15.** Addition/subtraction structure.

### *3.1.2. Floating‐point multiplier*

According to the results of the comparison stage, the alignment stage shifts the mantissa and transfers it to the adder/subtraction stage. The number of shifting will be selected by the comparison stage output. Consequently, each stage of the floating‐point adder algorithm is executed within one clock cycle. Floating‐point adder/subtraction unit satisfies high speed and efficiency of arithmetic unit in cost of die area size. The floating‐point arithmetic unit is designed to calculate entire numbers regardless of the number sign. As shown in **Figure 15**, there is a logic gate involved with the stages, which cause higher delay propagation through the circuit.

**Figure 14.** Flowchart of advance floating‐point adder.

82 Fourier Transforms - High-tech Application and Current Trends

Floating‐point numbers are generally stored in registers as normalized numbers. This means that the most significant bit of the mantissa has a nonzero value. Employing this method allows the most accurate value of a number to be stored in a register. For this purpose, the In a floating‐point multiplier, numbers are represented in single‐precision normalized man‐ tissa and 8‐bit exponent format defined by the IEEE 754 standard. This structure has developed the architecture for partial‐product reduction for the IEEE standard floating‐point multiplica‐ tion, leading to a structured high‐speed floating‐point multiplier. The shortening of the data path is desirable because they require shorter wires and therefore support faster operation. The former approach uses a reduction scheme based on combination unit and connects it as parallel architecture. Implementing floating‐point multiplier is simpler than floating‐point adder since it does not require alignment stage. The processing flow of the floating‐point multiplication operation consists of multiple stage and normalized stage. **Figure 16** shows the overall block diagram of the floating‐point multiplier while the flowchart of the functionality of the multiplier is shown in **Figure 17**.

In a floating‐point multiplier, the bias power format is applied to avoid having negative expo‐ nent in the data format. Additionally, the multiplier is designed as pipelined structure to enhance speed calculation, with the intention of the initial result appearing after the latency period where the result can then be obtained after every clock cycle. The multiplier offers low latency and high throughput and is IEEE 754 compliant. This design allows a trade‐off between the clock frequency and the overall latency by adding the pipeline stage.

**Figure 16.** Floating‐point multiplier block diagram.

### *3.1.3. Smart controller structure*

Smart controller unit significantly affects the efficiency of the 1024 Radix II FPP‐FFT proces‐ sor. As such, small die area can be achieved by designing high performance controller for the FFT processor. In this architecture, FFT controller is designed with the pipeline capabil‐ ity. The global controller unit provides the signal control to the different parts of the FFT processor. Additionally, several paths are switched between the data input and data output in architecture design and the data path is controlled. To calculate the 1024 point Radix II FFT processor, it is necessary to have log 2 *N* stages, which are 10 stages for 1024‐point data. Furthermore, each stage calculates \_\_ *N* 2 butterfly that is 512 butterfly calculations in the design. Hence, there are two counter in corporation with the controller to count the stage number of the processor and the number of butterfly calculation. Smart controller with collaboration of address generator unit calculates 1024 point floating‐point FFT by using only one butterfly structure. This functionality has great contribution on power supply as well as saving die area size. **Figure 18** shows the smart controller state machine, which controls the flow of the 1024 floating‐point Radix II FFT processor.

**Figure 17.** Floating‐point multiplier flow chart.

enhance speed calculation, with the intention of the initial result appearing after the latency period where the result can then be obtained after every clock cycle. The multiplier offers low latency and high throughput and is IEEE 754 compliant. This design allows a trade‐off

Smart controller unit significantly affects the efficiency of the 1024 Radix II FPP‐FFT proces‐ sor. As such, small die area can be achieved by designing high performance controller for the FFT processor. In this architecture, FFT controller is designed with the pipeline capabil‐ ity. The global controller unit provides the signal control to the different parts of the FFT processor. Additionally, several paths are switched between the data input and data output in architecture design and the data path is controlled. To calculate the 1024 point Radix II FFT processor, it is necessary to have log 2 *N* stages, which are 10 stages for 1024‐point data.

Hence, there are two counter in corporation with the controller to count the stage number of the processor and the number of butterfly calculation. Smart controller with collaboration of address generator unit calculates 1024 point floating‐point FFT by using only one butterfly structure. This functionality has great contribution on power supply as well as saving die area size. **Figure 18** shows the smart controller state machine, which controls the flow of the 1024

butterfly that is 512 butterfly calculations in the design.

*N* 2

between the clock frequency and the overall latency by adding the pipeline stage.

*3.1.3. Smart controller structure*

**Figure 16.** Floating‐point multiplier block diagram.

84 Fourier Transforms - High-tech Application and Current Trends

Furthermore, each stage calculates \_\_

floating‐point Radix II FFT processor.

**Figure 18.** Smart controller state machine.

There are several control signals in smart controller to clarify the presence of correct output after finishing the current cycle of FFT calculation. The control signals transfer information through the RAM, ROM, butterfly preprocessor, and address generator. The designed controller oper‐ ates according to the provided state machine (**Figure 18**) and makes the high performance FFT calculation feasible for implementation. The controller unit is structured into the subblocks such as in sequential and combination units. Sequential unit is responsible for updating the state of the processor, while the combinational unit performs the states individually. The state machine waits for processor core to complete the entire FFT calculations and then records data points into the memory. Reset state is received every time the reset input is asserted then holds the entire calculation. The processor gets activated after the reset input signal is removed.

### *3.1.4. Memory and address generator*

Address generator has a significant task in Radix II FFT processor, since it delivers the address of the input/output data for each computational stage in an appropriate way. Address generator architecture consists of ROM address generator, Read address genera‐ tor, and Write address generator. ROM address generator produces the reading address for the ROM module. The reading address represents the address of the twiddle factor, which must be taken to feed the butterfly structure. This address generator is designed to select the specific twiddle factor for the butterfly calculations. Meanwhile, the Write address generator is designed to save the result of the butterfly calculation in the proper location in the com‐ plex RAM. The proposed smart address generator is designed to provide the correct result for the next stage of the butterfly in 1024‐point Radix II FFT calculations. The architecture of the Read address generator is similar to the Write address generator. The butterfly will save the data result after reading from the certain address and input it to the butterfly, in the previous address line. The reading RAM select control signal ensures the correct location of data in the complex RAM. On the other hand, memory modules are used for the storing input and output results with 1024 complex long words of 32‐bit registers. The implemented architecture for the memory is shown in **Figure 19**. The capacity of the memory is 1024‐point data for real and imaginary data. In high‐tech implementation, shared RAM architecture is designed and implemented in a single‐chip FFT processor. The high‐tech design makes the Radix II FFT architecture entirely independent of the type of FPGA board since it has on board memory system. Furthermore, each complex RAM has the capability of saving real and imaginary input data separately. The module is programed with a dual‐in‐line header to provide the appropriate location for storing input and output result in each stage con‐ sequently. It is composed of two delay memories and multiplexer, which allows straight through or crossed input‐output connection as required in the pipeline algorithm. Memory unit similarly contains the controller trig. The controller, which is connected directly to the memory modules, takes the responsibility of transferring data through the memory and arithmetic blocks ensuring that no data conflict occurs within the complete process of the FFT calculations. This is another advantage of high‐tech smart memory modules, by which data can be read and written in the memory simultaneously without sending bubble data in the FFT processor.

**Figure 19.** RAM internal architecture.

There are several control signals in smart controller to clarify the presence of correct output after finishing the current cycle of FFT calculation. The control signals transfer information through the RAM, ROM, butterfly preprocessor, and address generator. The designed controller oper‐ ates according to the provided state machine (**Figure 18**) and makes the high performance FFT calculation feasible for implementation. The controller unit is structured into the subblocks such as in sequential and combination units. Sequential unit is responsible for updating the state of the processor, while the combinational unit performs the states individually. The state machine waits for processor core to complete the entire FFT calculations and then records data points into the memory. Reset state is received every time the reset input is asserted then holds the entire calculation. The processor gets activated after the reset input signal is removed.

Address generator has a significant task in Radix II FFT processor, since it delivers the address of the input/output data for each computational stage in an appropriate way. Address generator architecture consists of ROM address generator, Read address genera‐ tor, and Write address generator. ROM address generator produces the reading address for the ROM module. The reading address represents the address of the twiddle factor, which must be taken to feed the butterfly structure. This address generator is designed to select the specific twiddle factor for the butterfly calculations. Meanwhile, the Write address generator is designed to save the result of the butterfly calculation in the proper location in the com‐ plex RAM. The proposed smart address generator is designed to provide the correct result for the next stage of the butterfly in 1024‐point Radix II FFT calculations. The architecture of the Read address generator is similar to the Write address generator. The butterfly will save the data result after reading from the certain address and input it to the butterfly, in the previous address line. The reading RAM select control signal ensures the correct location of data in the complex RAM. On the other hand, memory modules are used for the storing input and output results with 1024 complex long words of 32‐bit registers. The implemented architecture for the memory is shown in **Figure 19**. The capacity of the memory is 1024‐point data for real and imaginary data. In high‐tech implementation, shared RAM architecture is designed and implemented in a single‐chip FFT processor. The high‐tech design makes the Radix II FFT architecture entirely independent of the type of FPGA board since it has on

*3.1.4. Memory and address generator*

**Figure 18.** Smart controller state machine.

86 Fourier Transforms - High-tech Application and Current Trends

### **3.2. Advantages of 1024‐point parallel pipeline FFT structure**

Design algorithm of the 1024 point Radix II FPP‐FFT processor was based on the smart sub‐ blocks where the result was optimized accordingly. The designed processor takes the advan‐ tages of (i) shared memory to store the input and output data and makes the system as single chip. Hence, it reduces hardware complexity. Furthermore, (ii) the entire individual arith‐ metic unit is designed to operate within one clock cycle to increase the maximum clock fre‐ quency. Additionally, (iii) the butterfly structure is in parallel and pipelined architecture to minimize delay caused by the FFT calculations, and finally, (iv) the strong controller with collaboration of address generator unit ignores the need of using *N* numbers of butterfly unit, since Radix II calculation is carried out within one butterfly unit that results reduction of power consumption, area, and avoid system complexity. The high performance processor is implemented with optimizing the architecture to enable the system in maintaining a reason‐ able clock rate and with low latency of (*N*/2 Log 2 *N* ) + 11 . The throughput of the operation is limited by the amount of available logic in the target device.
