**3.1.4 Architecture design onto MPPAs**

Massively parallel co-processors are typically part of a heterogeneous hardware/softwaresystem. Each processor is a massive parallel system consisting of an array of PEs. In this study, we propose the MPPA architecture for the selected reconstructive SP matrix-vector operation. This architecture is first modelled in a processor Array (PA) and next, each processor is implemented also with an array of PEs (i.e., in a highly-pipelined bit-level representation). Thus, we achieved the pursued MPPAs architecture following the spacetime mapping procedures.

First, some fundamental proved propositions are given in order to clarify the mapping procedure onto PAs.

Proposition 1. There are types of algorithms that are expressed in terms of regular and localized DG. For example, basic algebraic matrix-form operations, discrete inertial transforms like convolution, correlation techniques, digital filtering, etc. that also can be represented in matrix formats (Moldovan & Fortes, 1986), (Kung, 1988).

Proposition 2. As the DEDR algorithms can be considered as properly ordered sequences vector-matrix multiplication procedures, then, they can be performed in an efficient computational fashion following the PA-oriented HW/SW co-design paradigm (Kung, 1988).

Following the presented above *propositions*, we are ready to derive the proper PA architectures. (Moldovan & Fortes, 1986) proved the mapping theory for the transformation **T** . The transformation ˆ <sup>1</sup> ' : *N N* o **TG G** maps the *N*-dimensional DG ( *<sup>N</sup>* **G** ) onto the (*N*–1) dimensional PA ( ˆ *<sup>N</sup>*<sup>1</sup> **G** ), where *N* represents the dimension of the DG (see proofs in (Kung, 1988) and details in (CastilloAtoche et al., 2010b). Second, the desired linear transformation matrix operator **T** can be segmented in two blocks as follows

$$\mathbf{T} = \begin{bmatrix} \Pi \\ \Sigma \end{bmatrix} \tag{24}$$

144 Applications of Digital Signal Processing

perform more computations per iteration. Unrolling also reduces the overall number of branches significantly and gives the processor more instructions between branches (i.e., it

Next, we examine the computation-related optimizations followed by the memory optimizations. Typically, when we are working with nests of loops, we are working with multidimensional arrays. Computing in multidimensional arrays can lead to non-unit-stride memory access. Many of the optimizations can be perform on loop nests to improve the memory access patterns. The second SW-level transformation consists in to transform the matrix-vector single assignment algorithm in the locally recursive algorithm representation without global data dependencies (i.e. in term of a recursive form). At this stage, nestedloop optimizations are employed in order to avoid large routing resources that are translated into the large amount of buffers in the final processor array architecture. The variable being broadcasted in single assignment algorithms is removed by passing the variable through each of the neighbour processing elements (PEs) in a DG representation. Additionally, loop interchange techniques for rearranging a loop nest are also applied. For performance, the loop interchange of inner and outer loops is performed to pull the

Massively parallel co-processors are typically part of a heterogeneous hardware/softwaresystem. Each processor is a massive parallel system consisting of an array of PEs. In this study, we propose the MPPA architecture for the selected reconstructive SP matrix-vector operation. This architecture is first modelled in a processor Array (PA) and next, each processor is implemented also with an array of PEs (i.e., in a highly-pipelined bit-level representation). Thus, we achieved the pursued MPPAs architecture following the space-

First, some fundamental proved propositions are given in order to clarify the mapping

Proposition 1. There are types of algorithms that are expressed in terms of regular and localized DG. For example, basic algebraic matrix-form operations, discrete inertial transforms like convolution, correlation techniques, digital filtering, etc. that also can be

Proposition 2. As the DEDR algorithms can be considered as properly ordered sequences vector-matrix multiplication procedures, then, they can be performed in an efficient computational fashion following the PA-oriented HW/SW co-design paradigm (Kung,

Following the presented above *propositions*, we are ready to derive the proper PA architectures. (Moldovan & Fortes, 1986) proved the mapping theory for the transformation **T** . The transformation ˆ <sup>1</sup> ' : *N N* o **TG G** maps the *N*-dimensional DG ( *<sup>N</sup>* **G** ) onto the (*N*–1) dimensional PA ( ˆ *<sup>N</sup>*<sup>1</sup> **G** ), where *N* represents the dimension of the DG (see proofs in (Kung, 1988) and details in (CastilloAtoche et al., 2010b). Second, the desired linear transformation

> , <sup>ª</sup> º « » <sup>¬</sup> <sup>¼</sup> **Ʋ**

**ƴ** (24)

**T**

computations into the center loop, where the unrolling is implemented.

represented in matrix formats (Moldovan & Fortes, 1986), (Kung, 1988).

matrix operator **T** can be segmented in two blocks as follows

increases the size of basic blocks).

**3.1.4 Architecture design onto MPPAs** 

time mapping procedures.

procedure onto PAs.

1988).

where **Ʋ** is a (1×*N*)-*D* vector (composed of the first row of **T** ) which (in the segmenting terms) determines the time scheduling, and the (*N* – 1)×*N* sub-matrix **ƴ** in (24) is composed of the rest rows of **T** that determine the space processor specified by the so-called projection vector **d** (Kung, 1988).Next, such segmentation (24) yields the regular PA of (*N*– 1)-*D* specified by the mapping

$$\mathbf{T}\boldsymbol{\Phi} = \mathbf{K}\_{\prime} \tag{25}$$

where **K** is composed of the new revised vector schedule (represented by the first row of the PA) and the inter-processor communications (represented by the rest rows of the PA), and the matrix **Ʒ** specifies the data dependencies of the parallel representation algorithm.

Fig. 2. High-Speed MPPA approach for the reconstructive matrix-vector SP operation

For a more detailed explanation of this theory, see (Kung, 1988), (CastilloAtoche et al., 2010b). In this study, the following specifications for the matrix-vector algorithm onto PAs

High-Speed VLSI Architecture Based on Massively Parallel

D

D Q D Q

Q

D Q

a

b

is factored along two axes in the image plane: the azimuth or cross-range coordinate (horizontal axis, *x*) and the slant range (vertical axis, *y*), respectively. The conventional

fractional parameter *a*, are considered for the SAR range and azimuth ambiguity function (AF), (Wehner, 1994). In analogy to the image reconstruction, we employed the quality

1

¦

¦

1

where *<sup>k</sup> <sup>b</sup>* represents the value of the k*th* element (pixel) of the original image **B**, ˆ( ) *MSF*

represents the value of the k*th* element (pixel) of the degraded image formed applying the

with two developed methods, *p* = 1, 2, where *p* = 1 corresponds to the RSF algorithm and *p* =

The quality metrics defined by (26) allows to quantify the performance of different image enhancement/reconstruction algorithms in a variety of aspects. According to these quality metrics, the higher is the *IOSNR*, the better is the improvement of the image

The reported RS implementation results are achieved with the VLSI-FPGA architecture based on MPPAs, for the enhancement/reconstruction of RS images acquired with different

ˆ

*K MSF k k k K p k k k*

ˆ

metric defined as an improvement in the output signal-to-noise ratio (IOSNR)

a

b

'0'

A Ci Co B

a

a

b

b

ci

ci

Fig. 3. Transistor-level implementation of the Full Adder Cell.

*r*(*y*), and Gaussian approximation,

IOSNR = 10 log10

enhancement/reconstruction with the particular employed algorithm.

2 corresponds to the RASF algorithm, respectively.

*um,1 0*

Bit-Level F

<

MSF technique (19), and ˆ( ) *<sup>p</sup>*

**4.2 RS implementation results** 

triangular,

Processor Arrays for Real-Time Remote Sensing Applications 147

*um,1 1*

co

a b

<

 

*b b*

<sup>2</sup> ( )

<sup>2</sup> ( )

*<sup>k</sup> b* represents a value of the k*th* pixel of the image reconstructed

 

*b b*

ci

'0'

a b ci

A Ci Co B

D

D Q D Q

Q

a b ci

a b ci

*<sup>a</sup>*(*x*)=exp(–(*x*)2/*a*2) with the adjustable

; *p* = 1, 2 (26)

*k b*

D Q

so

are employed: **Ʋ** >1 1@ for the vector schedule, **d** >1 0@ for the projection vector and, **ƴ** >0 1@ for the space processor, respectively. With these specifications the transformation matrix becomes 1 1 0 1 <sup>ª</sup> ºª º « »« » <sup>¬</sup> ¼¬ ¼ **Ʋ T <sup>ƴ</sup>** . Now, for a simplified test-case, we specify the following operational parameters: *m* = *n* = 4, the period of clock of 10 ns and 32 bits data-word length. Now, we are ready to derive the specialized bit-level matrix-format MPPAs-based architecture. Each processor of the vector-matrix PA is next derived in an array of processing elements (PEs) at bit-level scale. Once again, the space-time transformation is employed to design the bit-level architecture of each processor unit of the matrix-vector PA. The following specifications were considered for the bit-level multiply-accumulate architecture: **Ʋ** >1 2@ for the vector schedule, **d** >1 0@ for the projection vector and, **ƴ** >0 1@ for the space processor, respectively. With these specifications the transformation matrix becomes 1 2 0 1 <sup>ª</sup> ºª º « »« » <sup>¬</sup> ¼¬ ¼ **Ʋ T <sup>ƴ</sup>** . The specified operational parameters are the following:

*l=*32 (i.e., which represents the dimension of the word-length) and the period of clock of 10 ns. The developed architecture is next illustrated in Fig. 2.

From the analysis of Fig. 2, one can deduce that with the MPPA approach, the real time implementation of computationally complex RS operations can be achieved due the highlypipelined MPPA structure.

### **3.2 Bit-level design based on MPPAS of the high-speed VLSI accelerator**

As described above, the proposed partitioning of the VLSI-FPGA platform considers the design and fabrication of a low-power high-speed co-processor integrated circuit for the implementation of complex matrix-vector SP operation. Fig. 3 shows the Full Adder (FA) circuit that was constantly used through all the design.

An extensive design analysis was carried out in bit-level matrix-format of the MPPAs-based architecture and the achieved hardware was studied comprehensively. In order to generate an efficient architecture for the application, various issues were taken into account. The main one considered was to reduce the gate count, because it determines the number of transistors (i.e., silicon area) to be used for the development of the VLSI accelerator. Power consumption is also determined by it to some extent. The design has also to be scalable to other technologies. The VLSI co-processor integrated circuit was designed using a Low-Power Standard Cell library in a 0.6µm double-poly triple-metal (DPTM) CMOS process using the Tanner Tools® software. Each logic cell from the library is designed at a transistor level. Additionally, S-Edit® was used for the schematic capture of the integrated circuit using a hierarchical approach and the layout was automatically done through the Standard Cell Place and Route (SPR) utility of L-Edit from Tanner Tools®.

#### **4. Performance analysis**

#### **4.1 Metrics**

In the evaluation of the proposed VLSIޤFPGA architectue, it is considered a conventional side-looking synthethic aperture radar (SAR) with the fractionally synthesized aperture as an RS imaging system (Shlvarko et al., 2008), (Wehner, 1994). The regular SFO of such SAR

146 Applications of Digital Signal Processing

are employed: **Ʋ** >1 1@ for the vector schedule, **d** >1 0@ for the projection vector and, **ƴ** >0 1@ for the space processor, respectively. With these specifications the transformation

operational parameters: *m* = *n* = 4, the period of clock of 10 ns and 32 bits data-word length. Now, we are ready to derive the specialized bit-level matrix-format MPPAs-based architecture. Each processor of the vector-matrix PA is next derived in an array of processing elements (PEs) at bit-level scale. Once again, the space-time transformation is employed to design the bit-level architecture of each processor unit of the matrix-vector PA. The following specifications were considered for the bit-level multiply-accumulate architecture: **Ʋ** >1 2@ for the vector schedule, **d** >1 0@ for the projection vector and, **ƴ** >0 1@ for the space processor, respectively. With these specifications the transformation

*l=*32 (i.e., which represents the dimension of the word-length) and the period of clock of 10

From the analysis of Fig. 2, one can deduce that with the MPPA approach, the real time implementation of computationally complex RS operations can be achieved due the highly-

As described above, the proposed partitioning of the VLSI-FPGA platform considers the design and fabrication of a low-power high-speed co-processor integrated circuit for the implementation of complex matrix-vector SP operation. Fig. 3 shows the Full Adder (FA)

An extensive design analysis was carried out in bit-level matrix-format of the MPPAs-based architecture and the achieved hardware was studied comprehensively. In order to generate an efficient architecture for the application, various issues were taken into account. The main one considered was to reduce the gate count, because it determines the number of transistors (i.e., silicon area) to be used for the development of the VLSI accelerator. Power consumption is also determined by it to some extent. The design has also to be scalable to other technologies. The VLSI co-processor integrated circuit was designed using a Low-Power Standard Cell library in a 0.6µm double-poly triple-metal (DPTM) CMOS process using the Tanner Tools® software. Each logic cell from the library is designed at a transistor level. Additionally, S-Edit® was used for the schematic capture of the integrated circuit using a hierarchical approach and the layout was automatically done through the Standard

In the evaluation of the proposed VLSIޤFPGA architectue, it is considered a conventional side-looking synthethic aperture radar (SAR) with the fractionally synthesized aperture as an RS imaging system (Shlvarko et al., 2008), (Wehner, 1994). The regular SFO of such SAR

**3.2 Bit-level design based on MPPAS of the high-speed VLSI accelerator** 

**<sup>ƴ</sup>** . Now, for a simplified test-case, we specify the following

**<sup>ƴ</sup>** . The specified operational parameters are the following:

1 1 0 1 <sup>ª</sup> ºª º « »« » <sup>¬</sup> ¼¬ ¼ **Ʋ**

1 2 0 1 <sup>ª</sup> ºª º « »« » <sup>¬</sup> ¼¬ ¼ **Ʋ**

ns. The developed architecture is next illustrated in Fig. 2.

circuit that was constantly used through all the design.

Cell Place and Route (SPR) utility of L-Edit from Tanner Tools®.

matrix becomes

matrix becomes

**T**

**T**

pipelined MPPA structure.

**4. Performance analysis** 

**4.1 Metrics** 

Fig. 3. Transistor-level implementation of the Full Adder Cell.

is factored along two axes in the image plane: the azimuth or cross-range coordinate (horizontal axis, *x*) and the slant range (vertical axis, *y*), respectively. The conventional triangular, <*r*(*y*), and Gaussian approximation, <*<sup>a</sup>*(*x*)=exp(–(*x*)2/*a*2) with the adjustable fractional parameter *a*, are considered for the SAR range and azimuth ambiguity function (AF), (Wehner, 1994). In analogy to the image reconstruction, we employed the quality metric defined as an improvement in the output signal-to-noise ratio (IOSNR)

$$\text{IOSNR} = 10\log\_{10}\frac{\sum\_{k=1}^{K} \left(\hat{b}\_{k}^{(\text{MSF})} - b\_{k}\right)^{2}}{\sum\_{k=1}^{K} \left(\hat{b}\_{k}^{(p)} - b\_{k}\right)^{2}}; p = 1, 2 \tag{26}$$

where *<sup>k</sup> <sup>b</sup>* represents the value of the k*th* element (pixel) of the original image **B**, ˆ( ) *MSF k b* represents the value of the k*th* element (pixel) of the degraded image formed applying the MSF technique (19), and ˆ( ) *<sup>p</sup> <sup>k</sup> b* represents a value of the k*th* pixel of the image reconstructed with two developed methods, *p* = 1, 2, where *p* = 1 corresponds to the RSF algorithm and *p* = 2 corresponds to the RASF algorithm, respectively.

The quality metrics defined by (26) allows to quantify the performance of different image enhancement/reconstruction algorithms in a variety of aspects. According to these quality metrics, the higher is the *IOSNR*, the better is the improvement of the image enhancement/reconstruction with the particular employed algorithm.

#### **4.2 RS implementation results**

The reported RS implementation results are achieved with the VLSI-FPGA architecture based on MPPAs, for the enhancement/reconstruction of RS images acquired with different

High-Speed VLSI Architecture Based on Massively Parallel

metric (26), are reported in Table 1 and Fig. 4.

algorithms

simulated scenarios.

**4.3 MPPA analysis** 

illustrated in Fig. 5.

*SNR*  [dB]

Fig. 5. Layout scheme of the proposed MPPA architecture

Processor Arrays for Real-Time Remote Sensing Applications 149

The quantitative measures of the image enhancement/reconstruction performance achieved with the particular employed DEDR-RSF and DEDR-RASF techniques, evaluated via IOSNR

5 4.36 7.94 10 6.92 9.75 15 7.67 11.36 20 9.48 12.72 Table 1. Comparative table of image enhancenment with DEDR-related RSF and RASF

From the RS performance analysis with the VLSI-FPGA platform of Fig.4 and Table 1, one may deduce that the RASF method over-performs the robust non-adaptive RSF in all

The matrix-vector multiplier chip and all of modules of the MPPA co-processor architecture were designed by gate-level description. As already mentioned, the chip was designed using a Standard Cell library in a 0.6µm CMOS process (Weste & D. Harris, 2004), (Rabaey et al., 2003). The resulting integrated circuit core has dimensions of *7.4 mm* x *3.5 mm*. The total gate count is about 32K using approximately 185K transistors. The 72-pin chip will be packaged in an 80 LD CQFP package and can operate both at 5 V and 3 V. The chip is

RSF Method RASF Method *IOSNR* [dB] *IOSNR* [dB]

fractional SAR systems characterized by the PSF of a Gaussian "bell" shape in both directions of the 2-D scene (in particular, of 16 pixel width at 0.5 from its maximum for the 1K-by-1K BMP pixel-formatted scene). The images are stored and loaded from a compact flash device for the image enhancement process, i.e., particularly for the RSF and RASF techniques. The initial test scene is displayed in Fig. 4(a). Fig. 4(b) presents the same original image but degraded with the matched space filter (MSF) method. The qualitative HW results for the RSF and RASF enhancement/reconstruction procedures are shown in Figs. 4(c) and 4(d) with the corresponding IOSNR quantitative performance enhancement metrics reported in the figure captions (in the [dB] scale).

Fig. 4. VLSI-FPGA results for SAR images with 15dB of SNR: (a) Original test scene; (b) degraded MSF-formed SAR image; (c) RSF reconstructed image (IOSNR = 7.67 dB); (d) RASF reconstructed image (IOSNR = 11.36 dB).

The quantitative measures of the image enhancement/reconstruction performance achieved with the particular employed DEDR-RSF and DEDR-RASF techniques, evaluated via IOSNR metric (26), are reported in Table 1 and Fig. 4.


Table 1. Comparative table of image enhancenment with DEDR-related RSF and RASF algorithms

From the RS performance analysis with the VLSI-FPGA platform of Fig.4 and Table 1, one may deduce that the RASF method over-performs the robust non-adaptive RSF in all simulated scenarios.
