1 ( ) **S SS S** is recognized to be the pseudoinverse (i.e., the well known Moore-Penrouse

optimum RSF estimate (20) simply by adding the prescribed mean value **mb** (referred to as

One of the challenging problems of the HW/SW co-design is to perform an efficient HW/SW partitioning of the computational tasks. The aim of the partitioning problem is to find which computational tasks can be implemented in an efficient hardware architecture looking for the best trade-offs among the different solutions. The solution to the problem requires, first, the definition of a partitioning model that meets all the specification

Note that from the formal SW-level co-design point of view, such DEDR techniques (20), (21), (22) can be considered as a properly ordered sequence of the vector-matrix multiplication procedure that one can next perform in an efficient high performance computational fashion following the proposed bit-level high-speed VLSI co-processor architecture. In particular, for implementing the fixed-point DEDR RSF and RASF algorithms, we consider in this partitioning stage to develop a high-speed VLSI co-processor for the computationally complex matrix-vector SP operation in aggregation with a powerful FPGA reconfigurable architecture via the HW/SW co-design technique. The rest of the reconstructive SP operations are

This novel VLSI-FPGA platform represents a new paradigm for real time processing of newer RS applications. Fig. 1 illustrates the proposed VLSI-FPGA architecture for the

Once the partitioning stage has been defined, the selected reconstructive SP sub-task is to be mapped into the corresponding high-speed VLSI co-processor. In the HW design, the precision of 32 bits for performing all fixed-point operations is used, in particular, 9-bit integer and 23-bits decimal for the implementation of the co-processor. Such precision guarantees numerical computational errors less than 10-5 referring to the MATLAB Fixed

This sub-section is focused in how to improve the performance of the complex RS algorithms with the aggregation of parallel computing and mapping techniques onto HW-

. The operator **S**+ operating on **u**<sup>D</sup> is mapped.

(ǂ = 0) <sup>ˆ</sup> **b A q Su** <sup>D</sup> , where matrix

<sup>ǂ</sup> RSF <sup>ˆ</sup> ( <sup>ǂ</sup> ) **b A q SS I q** <sup>D</sup> of the unknown signal

**b**<sup>D</sup> and known the mean value **mb**, we can obtain the

**b = mb** + ˆ

**b**<sup>D</sup> .

At this stage we obtain the vector **q** = **S**<sup>+</sup> **u**<sup>D</sup>

At this stage we obtain the estimate -1 <sup>1</sup>

the non-zero trend) to the reconstructed image frame as ˆ

requirements (i.e., functionality, goals and constraints).

employed in SW with a 32 bits embedded processor (MicroBlaze).

implementation of the RSF/RASF algorithms.

**3.1.3 Aggregation of parallel computing techniques** 

level massively parallel processor arrays (MPPAs).

Point Toolbox (Matlab, 2011).

**3.1.2 (ii) Partitioning process of the computational tasks** 

important to note that in the case D = 0, we have 1 #

c. Third Step: Signal Reconstruction

pseudoinverse) of the SFO matrix **S** . d. Fourth Step: Restoration of the Trend

Having obtained the estimate ˆ

degraded image.


In addition, to achieve such maximum possible parallelism in an algorithm, the so-called data dependencies in the computations must be analyzed (Moldovan & Fortes, 1986), (Kung, 1988). Formally, these dependencies are to be expressed via the corresponding dependence graph (DG). Following (Kung, 1988), we define the dependence graph **G**=[**P**, **E**] as a composite set where **P** represents the nodes and **E** represents the arcs or edges in which each *e***E** connects 1 2 *p p*, **P** that is represented as 1 2 *ep p* o . Next, the data dependencies analysis of the matrix–vector multiplication algorithms should be performed aimed at their efficient parallelization.

For example, the matrix-vector multiplication of an *n*×*m* matrix **A** with a vector **x** of dimension *m,* given by **y**=**Ax**, can be algorithmically computed as , 1,..., *n j ji i y a x for j m* ¦ , where **y** and *ji <sup>a</sup>* represents an *n-*dimensional (*n*-*D*) output

1 *i* vector and the corresponding element of **A**, respectively. The first SW-level transformation is the so-called single assignment algorithm (Kung, 1988), (Castillo Atoche et al., 2010b) that performs the computing of the matrix-vector product. Such single assignment algorithm corresponds to a loop unrolling method in which the primary benefit in loop unrolling is to

High-Speed VLSI Architecture Based on Massively Parallel

Hyper-planes

<sup>0</sup> *y* <sup>1</sup> *y* <sup>2</sup> *y* <sup>3</sup> *y*

1)-*D* specified by the mapping

<sup>03</sup> *a* <sup>13</sup> *a* <sup>23</sup> *a* <sup>33</sup> *a*

03 *x*

02 *x*

01 *x*

00 *x*

0 4 *x*

> 0 3 *x*

> 0 2 *x*

0 1 P

0 2 P

0 3 P

0 4 P <sup>0</sup> 5 P <sup>0</sup> 6 <sup>P</sup> <sup>0</sup> 7

0 1 *x*

<sup>02</sup> *a* <sup>12</sup> *a* <sup>22</sup> *a* <sup>32</sup> *a*

<sup>01</sup> *a* <sup>11</sup> *a* <sup>21</sup> *a* <sup>31</sup> *a*

<sup>00</sup> *a* <sup>10</sup> *a* <sup>20</sup> *a* <sup>30</sup> *a*

Bit-level Multiply-Acumulate DG

00 2 *a* <sup>00</sup> 3 *a* <sup>00</sup> 4 *a*

00 1 *a*

Matrix-vector DG

For *m=4*

0000

Processor Arrays for Real-Time Remote Sensing Applications 145

where **Ʋ** is a (1×*N*)-*D* vector (composed of the first row of **T** ) which (in the segmenting terms) determines the time scheduling, and the (*N* – 1)×*N* sub-matrix **ƴ** in (24) is composed of the rest rows of **T** that determine the space processor specified by the so-called projection vector **d** (Kung, 1988).Next, such segmentation (24) yields the regular PA of (*N*–

where **K** is composed of the new revised vector schedule (represented by the first row of the PA) and the inter-processor communications (represented by the rest rows of the PA), and the matrix **Ʒ** specifies the data dependencies of the parallel representation algorithm.

For *n=m=4*

<sup>P</sup> **Bit-level**

*x* 00 00 00 2 1 " *m a aa*

Data-Skewed

*aaaa* 33 23 13 03 000

*aaaa* 33 23 13 03 0 0

*aaaa* 33 23 13 03 0

2D 0 1

D

**Matrix-Vector Processor Array (PA)**

*aaaa* 33 23 13 03

 >1 0@ *T*

Mapping transformation

**d**

**d** >1 0@

**Ȇ** >1 2@

Mapping transformation

Fig. 2. High-Speed MPPA approach for the reconstructive matrix-vector SP operation

For a more detailed explanation of this theory, see (Kung, 1988), (CastilloAtoche et al., 2010b). In this study, the following specifications for the matrix-vector algorithm onto PAs

**Ȇ** >1 1@

**TƷ Ƭ** , (25)

D

D

y

P3

P2

P1

P0

**Array of PEs for Processor**

P0

2D 0 2 *x*

D

D

D

D

D

D

D

0 *m <sup>x</sup>* 2D

P

D

perform more computations per iteration. Unrolling also reduces the overall number of branches significantly and gives the processor more instructions between branches (i.e., it increases the size of basic blocks).

Next, we examine the computation-related optimizations followed by the memory optimizations. Typically, when we are working with nests of loops, we are working with multidimensional arrays. Computing in multidimensional arrays can lead to non-unit-stride memory access. Many of the optimizations can be perform on loop nests to improve the memory access patterns. The second SW-level transformation consists in to transform the matrix-vector single assignment algorithm in the locally recursive algorithm representation without global data dependencies (i.e. in term of a recursive form). At this stage, nestedloop optimizations are employed in order to avoid large routing resources that are translated into the large amount of buffers in the final processor array architecture. The variable being broadcasted in single assignment algorithms is removed by passing the variable through each of the neighbour processing elements (PEs) in a DG representation.

Additionally, loop interchange techniques for rearranging a loop nest are also applied. For performance, the loop interchange of inner and outer loops is performed to pull the computations into the center loop, where the unrolling is implemented.
