Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal Processing and Its Future Scope

*Atri Sanyal and Amitabha Sinha*

## **Abstract**

We present here Transproc, a reconfigurable generic processor which can execute operations related to linear transformations like FFT, FDCT or FDWT. A graph theoretic lemma is used to find the applicability of such a processor to calculate the flow graph related parallel operations found in these linear transformations. The architecture level design and processing element level design is presented. The primitive instruction set and the control signal implementing the instruction set is proposed. A detailed simulation validating the correctness of PE level and the architecture level data calculation and routing operations are carried out using Xilinx Vivado Webpack. The result related to size, power and timing requirement is presented.

**Keywords:** Transform processor, Graph Theoretic Concept, Design, Primitive Instruction Set, Simulation

## **1. Introduction**

In this paper we have proposed an efficient architecture for implementation of frequently used and computationally intensive linear transformations in signal or image processing. The linear transformations like FFT, FDCT or FDWT are computationally intensive and also critical for the processing applications. The papers proposing different designs in this domain are mainly of three types. The first category papers propose architectures to implement only a single category of linear transformations like FFT or FDCT [1–14]. Since these implementation's primary focus is on speed so they are mainly implemented on ASIC. These include a variety of algorithms to decrease the number of computationally intensive operations. We have seen multiplier less variety, high speed pipeline, data forwarding, step lifting techniques implementing FFT or FDCT algorithms which greatly decrease the computational complexity and increase the speed, and others. The second category of papers propose processors or architectures which can implement a number of general linear transformations like FFT, FDCT, FDWT. Since these architectures include basic building blocks common to all these transformations and so they need to reconfigure itself before executing different transformations, they are mainly implemented using reconfigurable architectures like FPGA [5–17]. Our paper proposes a processor of that category. The third category of papers discuss implementation of more generic image/signal applications [18–20]. While describing a linear transform data flow graph is used extensively in different literatures. It was proved earlier in [21, 22] by graph theoretical and mathematical induction that a MIMD processor consisting of processing elements connected like a completely connected equi- vertex bi partite graph can copy any actions shown in the flow graph of transformations like FFT, FDCT, FDWT etc. of any arbitrary size. This confirms that a processor with such type of architecture can execute the transforms represented using flow graph method. The architecture of processing element and the overall architecture discussed in [21, 22] is described thoroughly here. The architecture of control unit and the data exchange procedure between the main CPU and memory and this processor and its local memory is discussed in detail here. The instruction set for processing element and the overall processor are all described along with their corresponding control lines. The representative examples of each category of the instruction set are considered and the step wise control signal to implement them is discussed. The entire architecture requires reconfigurability as it is capable of implementing several transforms by its own. Then the architecture is coded in VHDL, synthesized and simulated using Xilinx Vivado. The processor is simulated to verify the operations in three stages. First the component inside the processing element (floating point adder and multiplier) is simulated and tested. Then the longest sequence of execution required in Loefflers FDCT algorithm is tested for each and every processing element and finally the testing of the overall architecture and the data routing between different processing element is simulated and tested. The synthesis result showing the size of the architecture in LUT level and the synthesis result of power and time are discussed. The rest of the paper is composed in this way, Section 2 discusses the theoretical background of the architecture, Section 3 discusses the implementation of the processor in a modular way, the overall architecture of the processor and the implementable CU is presented, then the processing element level architecture is presented, instruction set and the control signals implementing some representative examples of the instruction set is shown. Section 4 discusses the step by step synthesis and the simulation results in terms of speed, timing and size. Finally Section 5 discusses the conclusion and future scope of the work.

## **2. Proof of the architecture using graph theoretic approach**

The theoretical proof using mathematical induction is given in [21] in detail. Here in this paper we will just present a brief of the argument.

*Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal… DOI: http://dx.doi.org/10.5772/intechopen.99122*

The flow graph shown in the picture [23] is a widely used method of calculating transformations like FFT, FDCT and FDWT. In FFT or FDCT we can see that the flow graph looks physically like an equi vertex k partite graph where k is equal to the no of stages, the vertices are processing elements and the connections among the processing elements are the edges. Since the stages are mutually exclusive among each other so an equi vertex k partite graph like architecture can be reproduced by a fully connected equi vertex bipartite graph if the vertex set contains an one to one mapping between every stage of the k – partite and two stages of the bipartite graph. So any algorithm which is described by a flow graph of the first category can be described by a graph of the later category since the vertex set has the one to one mapping as described. From this argument it is clear that an architecture representing the second category will be efficient as a transform processor and the reconfigurity will make it easy to switch over from one transform to another making it a general transform processor. The orginal architecture requires two sets of processing elements in both the parts and a fully connected bidirectional communication wire between them. The hardware cost can be largely reduced if instead of that we take one set of processing elements and another set of registers, a fully connected feed forward network from register to processing elements and a single feedback network connecting each processing element to their corresponding register. Then the data exchange between two processing elements Pi !Pj can be rewritten as Pi!Ri!Pj. This will take two clock pulses rather than one but the hardware cost will be significantly reduced.

## **3. Implementation of the architecture**

## **3.1 Implemenation of the overall architecture design**

The fully connected feed forward path described in the previous section is created by 8 multiplexers of size 8 x 3. Each one of them can take input from any eight registers and send the output to any one Processsing Element. The signal lines of the individual multiplexer select the input register loading the value in the Processing Element (PE). This constitutes the most simple but effective feedforward communication lines between the registers and PEs The feedback line is implemented by a combination of 1x2 demultiplexer and 2x1 multiplexer duo which direct the output of the PE to the Input line of the corresponding register.n The same duo can also load the data from the memory in the beginning and once the calculation is complete can store them. The current design is examined with 8 such stages keeping mainly the view of implementing one stage of a FDCT algorithm. The architecture uses 8 bit register sets to latch value while entering or exiting to/from processing element.

## **3.2 Implementation of the processing element inside the processor**

The implementation of the processing element (PE) inside the processor is done keeping in mind the type of operations which are performed to compute these type of transformations. Most of the operations are floating point type. So we used one floating point adder/subtractor and one floating point multiplier inside the PE. We have used commonly found floating point adder and multiplier in this PE. Keeping open the testing of state of the art designs to improve the performance of adder and multiplier in future for this design. There are two registers which will be used to latch source data of adder and similar two registers which will be used to latch source data of multiplier. The result of adder and multiplier is stored in similar two registers. The PE contains multiplexer and demultiplexer inside to route the data from one internal register to another and to send/receive data to/from the registers outside PE.

*Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal… DOI: http://dx.doi.org/10.5772/intechopen.99122*

The **Table 1** lists below the routing control signals and their functions for the processor and the routing and activation control signals and their functions for Processing elements PE.

#### **3.3 Primitive instruction set of the processor**

The primitive instruction set which is formulated for the processor is mainly contains two categories. Category A is for the instructions to implement routing operations of the processor outside PE and category B is for the arithmetic calculation and data movement operations inside PE.

A. Data Loading/Routing Operations Outside PE:

	- 1.Load [D0-D3][PE i] = to load data in any of the registers D0-D3 from outside memory of PE i.

## *Recent Remote Sensing Sensor Applications - Satellites and Unmanned Aerial Vehicles (UAVs)*


#### **Table 1.**

*Name of the control signals, there values and functions used in Trans\_Proc.*


Next we calculate the total number of instruction per PE and the overall architecture in the **Table 2** below for each group as well as the overall total:

We can see that the total numbers of instructions are 472 out of which 48 are for each PE and 88 are for outside PE. The control signals of the different components


*Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal… DOI: http://dx.doi.org/10.5772/intechopen.99122*

#### **Table 2.**

*Total no of instructions of different group.*

and their functions of the processor units are specified in the previous table, from that we can specify the sequence of control signals which will be activated in order to implement each of the instructions of the instruction set. We can see one representative instruction for each group and the corresponding control signals and their sequence of activation to implement the instruction in the following **Table 3**. The table listing all the instructions can be found in appendix.

#### **3.4 Implementation of operations using the instruction set of the architecture**

If we consider the flow graph of the FDCT algorithm of figure taken as an example, we can see that the algorithm is divided into 4 stages and each stage contains 8 PE executing operations which are of three types: floating point addition/ subtraction, floating point multiplication and floating point operation evaluating expression of the type C1\*X + C2\* Y. Next, we see a stage wise operation schedule of the 8 PEs (specifying what each PE does in these 4 stages) in the below **Table 4**:


#### **Table 3.**

*Sequence of operations for implementing C1\*X + C2\*Y.*


#### **Table 4.**

*Stage wise operation schedule 8 PEs performing FDCT algorithm.*

We will list the instructions required to execute three cases as a representative example: a > stage 1 operation of PE 5 b > stage 4 operation of PE 7 and c > stage 2 operations of PE 6. These three cases exhibit three category of floating point operations described previously (**Table 5**).


*Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal… DOI: http://dx.doi.org/10.5772/intechopen.99122*


#### **Table 5.**

*List of instructions for a > stage 1 operation of PE 5 b > stage 4 operation of PE 7 and c > stage 2 03 operations of PE 6.*

### **3.5 Implementation of the control unit of the processor**

Hardwired implementation of the correct control signals, their values and the sequence for total 472 instructions is very difficult physically. Here in this work we have only developed instructions required for proving the correctness of the design, which are of three type. 1. We have developed instructions inside the PE to do a floating point addition and multiplications. 2. We have developed instructions to implement the longest sequence of the FDCT algorithm C1\*X + C2\*Y inside one PE implemented of a single stage. And 3. Next we have done the same implementation of stage 2 for all PEs and routed the output values randomly to prove the correctness of the implementation. So the control unit is partially developed. We require a programming based approach to develop a full grown assembler to generate all the instructions for all the instructions. These is an incomplete design of the TransProc which we presented in the paper but shows that it has the capability which can be used correctly for generationg all the instructions required for all the transform generators as a hardware co processor implemented in FPGA once the CU is finished generating all the instructions.

## **4. Simulation and synthesis**

The first two simulations show the correct floating point implementation of floating point multiplier and adder/subtractor. While the floating point multiplier has lots of scope of improvement but floating point adder/subtractor is quite state of the art.


Here we see the longest sequence of multiplication and adder inside a single PE. Pein1xCein5 + Pein1xCEin6 = 2.0x0.5 + 4.0x8.0 = 33.0.


*Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal… DOI: http://dx.doi.org/10.5772/intechopen.99122*

Here we see the routing correctness of the every PEs of the Trans\_Proc according to the following flow graph shown in a tabular format:

PE1 = 1, PE2 = 2, PE3 = 3, PE4 = 4, PE5 = 5, PE6 = 6, PE7 = 7, PE8 = 8. C1 = 2, C2 = 8. PE1 = PE1x C1 + PE8xC2 = 66. PE2 = PE2x C1 + PE7xC2 = 58. PE3 = PE3x C1 + PE6xC2 = 50. PE4 = PE4x C1 + PE5xC2 = 42. PE5 = PE4x C2 + PE5xC1 = 34. PE6 = PE3x C2 + PE6xC1 = 26. PE7 = PE2x C2 + PE7xC1 = 1. PE8 = PE1x C2 + PE8xC1 = 10.

This is the way the routing correctness among the different PEs of the processor is tested and we can see that it is working.

Once the behavioral simulation is correctly shown, next we present the result the synthesis of the entire processor done by the Xilinx Vivado and comment on the result (**Tables 6**–**9**).

The overall utilization report gives an idea of the size of the processor while the number of primitive blocks used in the processor is also given. Please remember that the study here did not include the CU utilization as that is incomplete but and will be used as an separate design in the future study. Total on chip power with its


#### **Table 6.**

*Summery of utilization report.*


**Table 7.** *Utilization report of primitives block.*



## **Table 9.**

*Timing report summery.*

two components dynamic and static is also suggesting an implementable design. T ming report shows Setup up time, WPWS is 4.650 ns, we calculated by hand that the instruction inside the floating point operations inside the takes maximum 4 clock pulses. This makes the maximum clock frequency as 292 MHZ.

## **5. A discussion on the memory and instruction exchange between the main processor and Trans\_proc**

*Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal… DOI: http://dx.doi.org/10.5772/intechopen.99122*

Here we can see the data transfer procedure between the main processor and Trans-Proc which will be implemented as a future scope of this study. The process uses an linear image RAM (LIRAM) to store the primary data. Then there are two data registers used as buffers while going in and out to the Trans-Proc. There is one counter to count the no of blocks going to Trans-Proc and one address register to store the block of transformed image again back to LIRAM. This will be implemented further as the future scope of this study.

## **Author details**

Atri Sanyal<sup>1</sup> \* and Amitabha Sinha<sup>2</sup>

1 Amity Institute of Information Technology, Amity University, Kolkata, West Bengal, India

2 UGC Adjunct Faculty, Maulana Abul Kalam Azad University of Technology, West Bengal, India

\*Address all correspondence to: atri.sanyal@gmail.com

© 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/ by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **References**

[1] Po-Chih Tseng et al, "Reconfigurable discrete cosine transform processor for object-based video signal processing", in ISCAS '04. Proceedings of the 2004 International Symposium on Circuits and System, 2004.

[2] Po-Chih Tseng, Chao-Tsung Huang, Liang-Gee Chen, "Reconfigurable Discrete Wavelet Transform Processor for Heterogeneous Reconfigurable Multimedia Systems", Journal of VLSI signal processing systems for signal, image and video technology, 2005.

[3] Gregory W. Donohoe, "The Fast Fourier Transform on a Reconfigurable Processor", Proc. NASA Earth Sciences Technology Conference, Pasadena, CA, June 11-13, 2002

[4] Srivatsava P S V, SaradaV, "**Reconfigurable MDC Architecture Based FFT Processor**", International Journal of Engineering Research & Technology, 2014

[5] K. Joe Hass David F. Cox, "Transform Processing on a Reconfigurable Data Path Processor", 7th NASA Symposium on VLSI Design 1998

[6] V. Sarada, T. Vigneswaran, "Reconfigurable FFT Processor – A Broader Perspective Survey", International Journal of Engineering and Technology (IJET) 2013

[7] Asadollah Shahbahrami, Mahmood Ahmadi, Stephan Wong, Koen Bertels, "A New Approach to Implement Discrete Wavelet Transform using Collaboration of Reconfigurable Elements", Proc. 2009 International Conference on Reconfigurable Computing and FPGAs

[8] Konstantinos E. Manolopoulos, Konstantinos G. Nakos, Dionysios I. Reisis and Nikolaos G. Vlassopoulos, "Reconfigurable Fast Fourier Transform Architecture for Orthogonal Frequency Division Multiplexing Systems", 2003, available: https://pdfs.semanticscholar. org/dd5c/263725af00e5dd4d42d573c 269f57d917c8d.pdf?\_ga=2.84059166.64 0751657.1573804365-914446569. 1569299704

[9] Amitabha Sinha, Mitrava Sarkar, Soumojit Acharyya, Suranjan Chakraborty, "A Novel Reconfigurable Architecture of a DSP Processor for Efficient Mapping of DSP Functions using Field Programmable DSP Arrays", ACM SIGARCH Computer Architecture News Vol. 41, No. 2, May 2013

[10] Sumit Wadekar, Laxman P. Thakare, Dr. A.Y. Deshmukh, "Reconfigurable N-Point FFT Processor Design For OFDM System, International Journal of Engineering Research and General Science Volume 3, Issue 2, March-April, 2015

[11] Alexey Petrovsky, Maxim Rodionov and Alexander Petrovsky, "Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications", Design and Architectures for Digital Signal Processing, available: http://www.intechopen.com/books/ design-and-architectures-fordigitalsignal-processing2013.

[12] Sharon Thomas & V Sarada, "Design of Reconfigurable FFT Processor With Reduced Area And Power", ITSI Transactions on Electrical and Electronics Engineering (ITSI-TEEE), 2013.

[13] Uma Rajaram, "Design Of Fir Filter For Adaptive Noise Cancellation Using Context Switching Reconfigurable EHW Architecture", Ph.D dissertation, Anna University, Chennai, 2009, available: https://shodhganga.inflibnet.ac.in/ handle/10603/27245

*Trans\_Proc: A Processor to Implement the Linear Transformations on the Image and Signal… DOI: http://dx.doi.org/10.5772/intechopen.99122*

[14] P. S. Reddy, S. Mopuri and A. Acharyya, "A Reconfigurable High Speed Architecture Design for Discrete Hilbert Transform," in *IEEE Signal Processing Letters*, vol. 21, no. 11, pp. 1413-1417, Nov. 2014, doi: 10.1109/ LSP.2014.2333745

[15] Atri Sanyal, Swapan Kumar Samaddar, Amitabha Sinha, "A Generalized Architecture for Linear Transform", Proc. IEEE International Conference on CNC 2010, Oct 04-05, 2010, Calicut, Kerala, India, IEEE Computer society, pp. 55-60, ISBN: 97-0-7695-4209-6.

[16] A. Sanyal, S. K. Samaddar, "A Combined Architecture for FDCT Algorithm," Proc. 2012 Third International Conference on Computer and Communication Technology, Allahabad, 2012, pp. 33-37, doi: 10.1109/ ICCCT.2012.16

[17] Atri Sanyal, SaloniKumari, Amitabha Sinha, "An Improved Combined Architecture of the Four FDCT Algorithms", International Journal of Research in Electronics and Computer Engineering, (IJRECE), Vol 6 Issue 4 December 2018, ISSN: 2348-2281

[18] Davide Rossi, Fabio Campi, Simone Spolzino, Stefano Pucillo, Roberto Guerrieri, "A Heterogeneous Digital Signal Processor for Dynamically Reconfigurable Computing", IEEE Journal of Solid-State Circuits,Volume: 45, Issue: 8, Aug. 2010

[19] Sohan Purohit, Sai Rahul Chalamalasetti, Martin Margala WimVanderbauwhede, "Throughput/Resource-Efficient Reconfigurable Processor for Multimedia Applications", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Volume: 21, Issue: 7, July 2013

[20] Vikram, K.N., Vasudevan, V. "Mapping Data-Parallel Tasks Onto Partially Reconfigurable Hybrid Processor Architectures", IEEE Transactions On Very Large Scale Integration (VLSI) Systems, Vol. 14, No. 9, September 2006.

[21] Atri Sanyal, Amitabha Sinha, "A Reconfigurable Architecture to Implement Linear Transforms of Image Processing Applications", International Conference on Frontiers in Computing and System (COMSYS 2020), Jalpaiguri, West Bengal, India, January 13-15,2020

[22] B. Heyne, C. C. Sun, J. Goetze, S. J. Ruan, "A Computationally Efficient High-Quality Cordic Based DCT", 14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006

[23] N. Deo, "Graph Theory with applications to engineering and computer science", PHI, 2007

**Chapter 8**
