**10. FPGA implementation**

Quick time-to-market, low cost and high performance are typically the treble that digital system designers wish to achieve when developing new products. Although, each goal taken individually is possible, the set of three is generally beyond the capabilities of traditional design and implementation approaches (Villasenor et al., 1995; Villasenor & Mangione-Smith, 1997; Barr, 1998; Ritter & Molitor, 2000; Chrysafis & Ortega, 2000; Lafruit et al., 2000; Russel & Wayne, 2001; Ebrahimi et al., 2002; Nibouche, et al., 2000, 2001a, 2001b, 2001c, 2001d, 2002, 2003; Katona et al., 2006; Angelopoulou et al., 2008 & Lande et al., 2010). Versatile hardware such as general purpose processors (GPP), for example, can perform a wide range of operations and tasks, but fails to reach the system speed of a more specialised hardware. On the contrary, an oriented application-specific hardware, such as Application Specific Integrated Circuits (ASICs), can perform a restricted set of operations/tasks more quickly, however, at the cost of losing in generality. Hence, reconfigurable computing, generally in the form of Field Programmable Gate Arrays (FPGAs), appears to be the promising land for hardware designers. This is old/new paradigm allies the flexibility of software while preserving the hardware performances. This leads to a good trade-off between speed and generality. Unlike the case of custom hardware in the form of ASICs, which cannot be reused for a slightly different problem to the one they were designed for, configurable hardware based FPGAs allows modifications at almost any stage of the design process. In fact, configurable hardware is easily upgraded (due to its inherent nature) to suit any changes of a primal design. Used in a desktop, reconfigurable hardware can be tailored to speed up or accelerate applications, which require a system speed superior to that offered by general purpose processors. The hardware here needs to adapt itself to continual changes in response to end users needs. Obviously, the reconfigurable capabilities of such hardware will not eliminate the need for general-purpose microprocessors running on today's Personal Computers (PCs). In fact, *"FPGAs will never replace microprocessors for general-purpose computing tasks",* as stated by Villasenor J. and Mangione-Smith W. in (Villasenor & Mangione-Smith, 1997).

The idea of reconfigurable computing was introduced first at the late 60s at the University of California at Los Angeles (UCLA) (Villasenor & Mangione-Smith, 1997 & Barr, 1998). However, the real emergence of this new paradigm for hardware computation was piloted by the commercialisation of the first SRAM-based FPGA by Xilinx Corporation in 1986 (Russel & Wayne, 2001). The first configurable devices from both Xilinx Corporation and Altera Corporation, composed typically of a fine grained structure, allowed a system speed in the range of 2MHz – 5MHz and a chip area of less than a 100 of logic blocks (Russel & Wayne, 2001). The efforts deployed by academicians and industrials since then brought to light new developments but also new challenges. In fact, the reconfigurable hardware field

Quick time-to-market, low cost and high performance are typically the treble that digital system designers wish to achieve when developing new products. Although, each goal taken individually is possible, the set of three is generally beyond the capabilities of traditional design and implementation approaches (Villasenor et al., 1995; Villasenor & Mangione-Smith, 1997; Barr, 1998; Ritter & Molitor, 2000; Chrysafis & Ortega, 2000; Lafruit et al., 2000; Russel & Wayne, 2001; Ebrahimi et al., 2002; Nibouche, et al., 2000, 2001a, 2001b, 2001c, 2001d, 2002, 2003; Katona et al., 2006; Angelopoulou et al., 2008 & Lande et al., 2010). Versatile hardware such as general purpose processors (GPP), for example, can perform a wide range of operations and tasks, but fails to reach the system speed of a more specialised hardware. On the contrary, an oriented application-specific hardware, such as Application Specific Integrated Circuits (ASICs), can perform a restricted set of operations/tasks more quickly, however, at the cost of losing in generality. Hence, reconfigurable computing, generally in the form of Field Programmable Gate Arrays (FPGAs), appears to be the promising land for hardware designers. This is old/new paradigm allies the flexibility of software while preserving the hardware performances. This leads to a good trade-off between speed and generality. Unlike the case of custom hardware in the form of ASICs, which cannot be reused for a slightly different problem to the one they were designed for, configurable hardware based FPGAs allows modifications at almost any stage of the design process. In fact, configurable hardware is easily upgraded (due to its inherent nature) to suit any changes of a primal design. Used in a desktop, reconfigurable hardware can be tailored to speed up or accelerate applications, which require a system speed superior to that offered by general purpose processors. The hardware here needs to adapt itself to continual changes in response to end users needs. Obviously, the reconfigurable capabilities of such hardware will not eliminate the need for general-purpose microprocessors running on today's Personal Computers (PCs). In fact, *"FPGAs will never replace microprocessors for general-purpose computing tasks",* as stated by Villasenor J. and Mangione-Smith W. in

The idea of reconfigurable computing was introduced first at the late 60s at the University of California at Los Angeles (UCLA) (Villasenor & Mangione-Smith, 1997 & Barr, 1998). However, the real emergence of this new paradigm for hardware computation was piloted by the commercialisation of the first SRAM-based FPGA by Xilinx Corporation in 1986 (Russel & Wayne, 2001). The first configurable devices from both Xilinx Corporation and Altera Corporation, composed typically of a fine grained structure, allowed a system speed in the range of 2MHz – 5MHz and a chip area of less than a 100 of logic blocks (Russel & Wayne, 2001). The efforts deployed by academicians and industrials since then brought to light new developments but also new challenges. In fact, the reconfigurable hardware field

Fusion Process

Inverse Nth Level wavelet Decomposition

Watermarked Image

Nth Level wavelet Decomposition

Image Source

Watermark

1st Level wavelet Decomposition

Fig. 23. Wavelet-based watermarking system

**10. FPGA implementation** 

(Villasenor & Mangione-Smith, 1997).

has dramatically maturated either by the developments in the microelectronic technology, which led to the emergence of a new range of devices providing a system gate beyond a million (e.g. Xilinx Virtex family) or by the continual emergence of a wide range of FPGA based system.

In general, FPGA devices are organised as 2D arrays of configurable logic blocks or logic elements. The parallel nature of FPGA devices make them very good targets for application that require parallel processing such as in image and video processing. In such applications, these FPGA devices are used either as co-processors or accelerators (real time applications). It is not the aim of this section to survey the field of wavelet based FPGA implementation but rather to highlight some implementation of the DWT for application in the field of image/video processing (in line with section 9).

Due to its high computational complexity, real time video compression has always been a very challenging topic for digital system designers. The implementation of such systems on FPGAs does not fail to the rule. In probably one of the earliest works in the field, Villasenor et al. in (Villasenor et al., 1995) investigated wavelet transforms based video compression algorithms for use in low-power wireless communications. Using this previous work as a basis, the same authors have further described two implementations using a single FPGA (Schoner et al., 1995). In the first approach, the proposed video compression scheme is directed towards low-complexity implementations using a single in system reprogrammable FPGA. The optimisation of the algorithm to fit the system results in an efficient implementation, however, the system is limited to only a single compression algorithm. In the second approach, to allow more flexibility, the FPGA chip is combined with an external special purpose Video Signal Processor (VSP). The FPGA/VSP combination allows the implementation of four common compression algorithms and their execution in real time. The proposed design schemes were both implemented on a Xilinx FPGA. The first design runs at 20 **f**rames **p**er **s**econds (fps) when processing a 256x256 frames with a spacial precision of 8-bits. It includes a wavelet transform, a simplified quantiser and a run-length encoder. The second scheme is capable of implementing a DCT, a 2-D FIR, a Vector Quantisation scheme (VQ) and the wavelet transform using a single generic equation. It delivers different performances: 13.3 fps for 7x7 mask 2-D filter, 55 fps for an 8x8 block DCT, 7.4 fps for a 4x4 VQ (at 1/2 bit per pixel) and 35.7 fps for a single wavelet stage.

Partitionning images prior to computation is a well known technique in the field of image processing. It has been widely used in DCT-based image compression schemes. In the last decade, this technique has been adopted in the wavelet-based JPEG2000 new compression standard (Ebrahimi et al., 2002). In (Ritter & Molitor, 2000), a biorthogonal Cohendaubechies-Fauveau (CDF) 5/3 wavelet pair followed by **E**mbedded **Z**erotree **E**ncoding (EZT) technique is used in a lossy and a lossless compression schemes, respectively. Since the 5/3 pair is an integer-to-integer wavelet, a lifting scheme based architecture is used for the implementation. In the lossless compression scheme, the image is partitioned into a set of 32x32 tiles before processing. The system is then implemented onto an FPGA prototyping board. The system achieved an operating speed of 20MHz. In the second scheme, in order to avoid excessive increase of the internal memory, a rearrangement of the filtered and decimated outputs is proposed (interlocked external memory access. Because of its integer nature (integer to integer), as well as, for its adoption in the JPEG 2000 standard, the biorthogonal 5/3 wavelet is the focus of many studies. Since the wavelet transform

The Wavelet Transform for Image Processing Applications 419

Barr, M. (1998). A Reconfigurable Computing Primer, *Multimedia Systems Design,* pp. 44-47 Bradley, J.; Brislawn, C. & Hopper, T. (1993). *The FBI Wavelet/Scalar Quantization Standard for* 

Brislawn, C. M. (April 2010). Group Lifting Structures for Multirate Filter Banks II: Linear

Burt, P. J. & Adelson, A. E. (1983). The Laplacian pyramid as a compact image code, *IEEE* 

Burrus, C. S.; Gopinath, R. A. & Guo, H. (1998). *Introduction to Wavelets and Wavelet* 

Chappelier, V. & Guillemot, C. (2006). Oriented Wavelet Transform for Image Compression

Chen, G. & Qian, S. (2011). Denoising of Hyperspectral Imagery Using Principal

Chrysafis, C. & Ortega, A. (2000). Line based reduced memory wavelet image compression,

Cunha, A. L.; Zhou, J. & Do, M. N. (October 2006). The nonsubsampled contourlet

Cohen, A.; Daubechies, I. & Feauveau, J. (1992). Biorthogonal bases of compactly supported

Daubechies, I. (Mar. 1993). Orthonormal bases of compactly supported wavelets II,

Do, M. N. & Vetterli, M. (January 2003). The finite ridgelet transform for image

Do, M. N. & Vetterli, M. (December 2005). The contourlet transform: An efficient

Donoho, D. L. & Johnstone, I. M. (1994). Ideal Spatial Adaptation via Wavelet Shrinkag*,*

Cody, M. A. (1994). The Wavelet Packet Transform, *Dr. Dobb's Journal,* Vol. 19, Apr. 1994 Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets, *Communications* 

*Transactions*, Vol. 49, No. 3, pp. 973 – 980, ISSN 0196-2892

Vol. 15, No. 10, pp. 3089–3101, ISSN: 1057-7149

*on Pure and Applied Mathematics,* Vol. 41, pp. 909-996 Daubechies, I. (1992). Ten lectures on Wavelets, *SIAM,* Philadelphia

David F. W. (2002). *Wavelet Analysis,* Birkhauser, ISBN-0-8176-3962-4

Vol. 14, No. 12, pp. 2091-2106, ISSN 1057-7149

21

6778

Nat'l Lab, Los Alamos

2087, ISSN 1053-587X

ISSN 1057-7149

1057-7149

560

519

3444

1057-7149

*Transforms: A primer,* Prentice Hall

computation schedule on FPGAs, *Journal of signal processing systems,* Vol. 51, pp. 3 –

*Gray-scale Fingerprint Image Compression,* Tech. Report LA-UR-93-1659, Los Alamos

Phase Filter Banks, *IEEE Transactions on Signal Processing,* Vol. 58, No. 4, pp. 2078 –

*Transactions on Communications,* Vol. 31, No. 4, (Apr 1983 ), pp. 532-540, ISSN 0090-

and *Denoising, IEEE Transactions on Image Processing,* Vol. 15, No. 10, pp. 2892-2903,

Component Analysis and Wavelet Shrinkage, *Geoscience and Remote Sensing, IEEE* 

*IEEE Transactions on Image Processing,* Vol. 9, No. 3, pp. 378-389, 010-1024, ISSN

transform: Theory, design, and applications, *IEEE Transactions on Image Processing,*

wavelets, *Communications on Pure and Applied Mathematics,* Vol. 45, No. 5, pp. 485-

variations on a theme*, SIAM Journal of Mathematical Analysis,* Vol. 24, No. 2, pp. 499-

representation. *IEEE Transactions on Image Processing,* Vol. 12, No. 1, pp. 6–28, ISSN

directionalmultiresolution image representation. *Transactions on Image Processing,* 

*Biometrika,* Vol. 81, No. 3, pp. 425-455, Online ISSN 1464-3510 - Print ISSN 0006-

algorithms are inherently multi levels, requiring complex computation schedule in hardware, a comparison of different computation schedule algorithms is presented in (Angelopoulou et al., 2008). The most widely used schedule algorithms such as the row column based algorithm (Mallat, 1989), the line based algorithm (Chrysafis & Ortega, 2000) and the block based algorithm (Lafruit et al., 2000) are implemented in FPGA using the lifting scheme and 2D DWT architecture. The 2D DWT FPGA implementation is fully parameterised. Based on the lifting scheme, Lande et al. in (Lande et al., 2010) introduce a robust invisible watermarking method to be used with still images. The scheme is incorporated in the JPEG 2000 lossless algorithm, featuring an integer to integer biorthogonal 5/3 CDF wavelet filters. The proposed algorithm targets the consumer electronics market. The objectives of the proposed FPGA implementation of this wavelet based watermarking scheme include low power usage, real time performance, robustness and ease of integration.

Denoising still images and video sequences is another field of predilection of the wavelet transform (see section 9). Katona et al. (Katona et al., 2006) suggest a real time wavelet based video denoising system and its implementation in FPGA. The method adopts a parallel approach to implement an advanced wavelet domain noise filtering algorithm, which uses a non-decimated wavelet transform. The approach relies on the wavelet "a trous" algorithm and the Daubechies minimum phase wavelet (Daub4). The proposed implementation is decentralised and distributed over two FPGAs. As a proof of concept, digitised television signals are adopted as real time video sources.
