We are IntechOpen, the world's leading publisher of Open Access books Built by scientists, for scientists

4,000+

Open access books available

116,000+

International authors and editors

120M+

Downloads

Our authors are among the

Top 1%

most cited scientists

12.2%

Contributors from top 500 universities

Selection of our books indexed in the Book Citation Index in Web of Science™ Core Collection (BKCI)

## Interested in publishing with us? Contact book.department@intechopen.com

Numbers displayed above are based on latest data collected. For more information visit www.intechopen.com

## **Meet the editors**

Gustavo A. Ruiz was born in Burgos, Spain, in 1962. He received the M.Sc. degree in physics in 1985 from the University of Navarra, Spain, and the Ph.D. in physical science in 1989 from the University of Cantabria, Santander, Spain. Since 1985, he has been with the Department of Electronics and Computers at the University of Cantabria, where he is currently an Associate Profes-

sor. His current research interests are mainly focused on VLSI architectures for signal processing and high-speed arithmetic circuits.

Juan A. Michell was born in Cáceres, Spain, in 1952. He received the M.S. degree and the Ph.D. in physical sciences from the University of Cantabria, Spain, in 1974 and 1977, respectively. Since 1974 he has been with the Department of Electronics and Computers at the University of Cantabria, where he was appointed Professor in Electronics in 1991. His current research interests are

VLSI architectures and integrated circuit design for digital signal processing applications.

Contents

**Preface VII**

**Section 1 Real-Time Audio Applications 1**

**TV-CAR Analysis 31**

**Section 2 Optical Signal Processing 51**

**Section 3 Image and Video Processing 111**

**Multicore Processor 113**

Keiichi Funaki and Takehito Higa

Chapter 3 **Optical Signal Processing: Data Exchange 53** Jian Wang and Alan E. Willner

**Challenges and Opportunities 81**

Chapter 1 **Dynamic Reconfigurable on the Lifting Steps Wavelet Packet**

Chapter 2 **Low Computational Robust F0 Estimation of Speech Based on**

Chapter 4 **All-Optical Quaternary Logic Based Information Processing:**

Jitendra Nath Roy and Tanay Chattopadhyay

Chapter 5 **Video Encoder Implementation on Tilera's TILEPro64™**

Chapter 6 **Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders 137**

Georgios Georgis, George Lentaris and Dionysios Reisis

José Parera-Bermúdez, Javier Casajús-Quirós and Igor Arambasic

**Frequency Tiling for Real-Time Audio Applications 3** Alexey Petrovsky, Maxim Rodionov and Alexander Petrovsky

**Processor with Frame-Based Psychoacoustic Optimized Time-**

### Contents



#### X Contents


Chapter 8 **Algorithms for Efficient Computation of Convolution 179** Karas Pavel and Svoboda David

Preface

domains.

In the twenty-first century context, digital signal processing (DSP) is one of the most promising and powerful emerging technologies, which has caused revolutionary changes in a huge variety of applications. These include information technology, communications, consumer electronics, audio/video and speech processing, medicine, geophysics and security. In an increasingly demanding market, designers must apply innovative methods and architectures to meet the stringent system requirements and performance constraints.

The development of complex signal processing systems is a multidisciplinary task that involves the design of architectures that must be matched to the algorithms intended to be executed on the particular system. Since recent (and foreseen) DSP algorithms have become more complex, their implementation into hardware solutions has ramifications in all areas of design, including architecture, software, circuits, and even modification of the original algorithm. The increasing tendency towards high performance and low power systems has required researchers to come up with innovative design methodologies and architectures that can achieve these goals. In order to develop efficient and cost-effective DSP systems, experts in each of these areas are continuously in demand for the intended application

Motivated by this flurry of activity in the subject of DSP in both the industrial and academic contexts, this book presents some of the recent advances in basic implementations of DSP tasks covering hardware/software solutions for application-specific circuits and programmable DSP devices. Areas covered here include: architectures for basic operations and elementary functions, parallel processing and pipelining, application-specific array processors and programmable digital signal processors. This book is of particular interest to electronic engineering and computer science students and researchers; this will benefit

With this purpose in mind, the book is divided into four different related sections according to the subject matter. The first section has two interesting practical chapters for real-time audio applications. The second section is devoted to the optical signal processing field where new alternative approaches and the challenges to achieve superior performance are presented. Section III is dedicated to the design and implementation of different useful applications in the image and video compression field including the video coding standard H.264/AVC and real-time solutions. The last section concludes the book by outlining some

practitioners in digital signal processor circuit design.


### Preface

Chapter 7 **A Real-Time Video Encoding Scheme Based on the Contourlet**

Chapter 8 **Algorithms for Efficient Computation of Convolution 179**

Chapter 10 **A Digital Signal Processing Architecture for Soft-Output MIMO**

Chapter 12 **FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays 283**

**Section 4 Advanced Architectures and Implementations 209**

**Lattice Reduction Aided Detection 231** Alan T. Murray and Steven R. Weller

Chapter 11 **Progress of Doppler Ultrasound System Design and**

Chapter 9 **Self-Organizing Architectures for Digital Signal**

Daniele Peri and Salvatore Gaglio

Karas Pavel and Svoboda David

Stamos Katsigiannis, Georgios Papaioannou and Dimitris Maroulis

**Transform 155**

**VI** Contents

**Processing 211**

**Architecture 259** Baba Tatsuro

Zbigniew Szadkowski

In the twenty-first century context, digital signal processing (DSP) is one of the most promising and powerful emerging technologies, which has caused revolutionary changes in a huge variety of applications. These include information technology, communications, consumer electronics, audio/video and speech processing, medicine, geophysics and security. In an increasingly demanding market, designers must apply innovative methods and architectures to meet the stringent system requirements and performance constraints.

The development of complex signal processing systems is a multidisciplinary task that involves the design of architectures that must be matched to the algorithms intended to be executed on the particular system. Since recent (and foreseen) DSP algorithms have become more complex, their implementation into hardware solutions has ramifications in all areas of design, including architecture, software, circuits, and even modification of the original algorithm. The increasing tendency towards high performance and low power systems has required researchers to come up with innovative design methodologies and architectures that can achieve these goals. In order to develop efficient and cost-effective DSP systems, experts in each of these areas are continuously in demand for the intended application domains.

Motivated by this flurry of activity in the subject of DSP in both the industrial and academic contexts, this book presents some of the recent advances in basic implementations of DSP tasks covering hardware/software solutions for application-specific circuits and programmable DSP devices. Areas covered here include: architectures for basic operations and elementary functions, parallel processing and pipelining, application-specific array processors and programmable digital signal processors. This book is of particular interest to electronic engineering and computer science students and researchers; this will benefit practitioners in digital signal processor circuit design.

With this purpose in mind, the book is divided into four different related sections according to the subject matter. The first section has two interesting practical chapters for real-time audio applications. The second section is devoted to the optical signal processing field where new alternative approaches and the challenges to achieve superior performance are presented. Section III is dedicated to the design and implementation of different useful applications in the image and video compression field including the video coding standard H.264/AVC and real-time solutions. The last section concludes the book by outlining some practical advanced architectures and implementations for DSP, focusing on applications to real-world problems.

Finally, we would like to extend our gratitude and appreciation to all the authors who have made this humble book possible by contributing their invaluable research. Special thanks to the INTECH publishing process managers and Ms. Natalia Reinić, for the editorial assistance provided, for promoting research and innovation, and for making it available to the community freely through this open access platform.

#### **Gustavo A. Ruiz and Juan A. Michell**

**Section 1**

**Real-Time Audio Applications**

ruizrg@unican.es, michellj@unican.es Department of Electronics and Computers University of Cantabria SPAIN

### **Real-Time Audio Applications**

practical advanced architectures and implementations for DSP, focusing on applications to

Finally, we would like to extend our gratitude and appreciation to all the authors who have made this humble book possible by contributing their invaluable research. Special thanks to the INTECH publishing process managers and Ms. Natalia Reinić, for the editorial assistance provided, for promoting research and innovation, and for making it available to

> **Gustavo A. Ruiz and Juan A. Michell** ruizrg@unican.es, michellj@unican.es Department of Electronics and Computers

> > University of Cantabria

SPAIN

the community freely through this open access platform.

real-world problems.

VIII Preface

**Chapter 1**

**Dynamic Reconfigurable on the Lifting Steps**

**Wavelet Packet Processor with Frame-Based**

**for Real-Time Audio Applications**

Alexey Petrovsky, Maxim Rodionov and

Additional information is available at the end of the chapter

Alexander Petrovsky

http://dx.doi.org/10.5772/51604

**1. Introduction**

**Psychoacoustic Optimized Time-Frequency Tiling**

The discrete wavelet packet transform (DWPT) as a generalization of the standard wavelet transform provides a more flexible choices for time–frequency (time-scale) representation of signals [1] in many applications, such as the design of cost-effective real-time multimedia systems and high quality audio transmission and storage. In parallel to the definition of the ISO/MPEG standards, several audio coding algorithms have been proposed that use the DWPT, in particular, adaptive wavelet packet transform, as the tool to decompose the signal [2],[3]. In practice, DWPT are often implemented using a tree-structured filter bank [2], [3],[4]. The DWPT is a set of transformations that admits any type of tree-structured filter bank, that provides a different time–frequency tiling map. Many architectures have been proposed for computing the discreate wavlet transform in the past. However, it is not the case for the DWPT. There are very few papers regarding the development of specific architectures for the DWPT. In [5] is designed a programmable DWPT processor using two-buffer memory system and a single multiplier–accumulator (MAC) to calculate different subbands. Method [6] exploits the in-place nature of the DWPT algorithm and uses a single processing element consisting of multipliers in parallel and adders for each low-pass and high-pass filters – wavelet butterflies (is the number of filter taps) to increase the throughput. In [7] is also proposed a folded pipelined architecture to speed up the throughput. It consists of MACs communicated by

memory banks to compute each level of the total decomposition levels.

© 2013 Petrovsky et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2013 Petrovsky et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## **Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications**

Alexey Petrovsky, Maxim Rodionov and Alexander Petrovsky

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51604

### **1. Introduction**

The discrete wavelet packet transform (DWPT) as a generalization of the standard wavelet transform provides a more flexible choices for time–frequency (time-scale) representation of signals [1] in many applications, such as the design of cost-effective real-time multimedia systems and high quality audio transmission and storage. In parallel to the definition of the ISO/MPEG standards, several audio coding algorithms have been proposed that use the DWPT, in particular, adaptive wavelet packet transform, as the tool to decompose the signal [2],[3]. In practice, DWPT are often implemented using a tree-structured filter bank [2], [3],[4]. The DWPT is a set of transformations that admits any type of tree-structured filter bank, that provides a different time–frequency tiling map. Many architectures have been proposed for computing the discreate wavlet transform in the past. However, it is not the case for the DWPT. There are very few papers regarding the development of specific architectures for the DWPT. In [5] is designed a programmable DWPT processor using two-buffer memory system and a single multiplier–accumulator (MAC) to calculate different subbands. Method [6] exploits the in-place nature of the DWPT algorithm and uses a single processing element consisting of multipliers in parallel and adders for each low-pass and high-pass filters – wavelet butterflies (is the number of filter taps) to increase the throughput. In [7] is also proposed a folded pipelined architecture to speed up the throughput. It consists of MACs communicated by memory banks to compute each level of the total decomposition levels.

Applying the lifting scheme [8] for the construction of wavelets filter bank allows significantly reduce the number of arithmetic operations that are necessary to compute the transform. A folded parallel architecture for lifting-based DWPT was presented in [9]. It consists of a group of MACs operating in parallel on the data prestored in a memory bank. In [10] is proposed an architecture using a direct implementation of a lifting-based wavelet filter to perform one level of DWPT at a time. The main drawback of these existing architectures is that they all use memory to store the intermediate coefficients and involve intense memory access during the computation. A recursive pyramid algorithm based folded architecture for computing liftingbased multilevel DWPT is presented in [11]. However, the scheduling and control complexity is high, which also introduce large numbers of switches, multiplexers and control signals. The architecture is not regular and need to be modified for different number of level of DWPT computation. A folded architecture for lifting-based wavelet filters is proposed in [12] to compute the wavelet butterflies in different groups simultaneously at each decomposition level. According to the comparison results, the proposed architecture is more efficient than the previous proposed architectures in terms of memory access, hardware regularity and sim‐ plicity, and throughput. It is necessary to notice, that the architecture of the given processor is effective only for calculation of full tree DWPT. Here there is no technique of management for DWPT with best tree searching.

at any level *l* (*n* = 0 .. .2*l*-1

0

*l* = 0

*l* = 1

*l* = 2

*l* = 3

0

0

0

*x*3*,*0*,k*

*x*2*,*0*,k x*2*,*1*,k*

(2,0) (2,1)

(0,0) (1,0) (1,1)

*x*1*,*1*,k*

dual-channel filter bank analysis.

, *l* ∈*Z*) input *xl*, *<sup>n</sup>*,*<sup>k</sup>* , (*k*– signal samples) is separated by low-frequency

*fN*

*xl,n,k*

*h*(*z*)

Time-Frequency Tiling for Real-Time Audio Applications

*xl+*1*,*2*n,k*

2

http://dx.doi.org/10.5772/51604

5

2

*xl+*1*,*2*n+*1,*<sup>k</sup>*

(2,2) (2,3)

), normalized

*g*(*z*)

*fN*

*fN*

*fN fN fN*

*x*2*,*1*,k x*2*,*2*,k x*2*,*3*,k x*2*,*2*,k x*2*,*3*,k*

*x*1*,*0*,k*

(0,0) (1,0) (1,1)

, (*n* + 1) ⋅2-*<sup>l</sup>*

(LF) *xl*+1,2⋅*n*,*k* and high frequency (HF) *xl*+1,2⋅*n*+1,*k* components using a pair of wavelet filters *h* (*z*) and *g*(*z*) with finite impulse response (FIR), after which each subband signal down-sampling by factor of two. Function block that implements this separation of the input signal is called a

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

………... *fN*

*x*0*,*0*,k x*0*,*0*,k x*0*,*0*,k*

to the Nyquist frequency ( *f <sup>N</sup>* ). At each level of decomposition frequency resolution increases twice, but twice the resolving power decreases over time. DWPT is a complete decomposition of the signal in the low and high frequencies. Variation of the resolution in frequency and time domains allows for a more detailed decomposition, for example, in the lower frequencies, which leads to an increase in the frequency resolution and reduced over time. This feature has adapted DWPT. The advantage is the ability to DWPT sufficiently flexible selection of tree decomposition (see figure 2), based on the nature of the signal. The choice of tree structure can

(0,0) (1,0) (1,1) (2,0) (2,1) (2,2) (2,3)

*x*0*,*0*,k*

**Figure 1.** DWPT tree structure (left) and dual-channel filter bank – wavlet baterfily (right)

*x*3*,*0*,k x*3*,*1*,k*

**Figure 2.** DWPT tree structure examples and corresponding magnitude response of then filter bank

Thus, a specific node (*l*, *n*) corresponds to the frequency range (*n* ⋅2-*<sup>l</sup>*

*n* = 0 *n* = 1

*n* = 0 *n* = 1 *n* = 2 *n* = 3

*n* = 0

Algorithm transformation techniques have been employed in high-speed DSP system design is presented in [13]. All of the above mentioned techniques are applied during the processor design phase and their implementation is time invariant. Therefore, this class of signal processing techniques is referred as static techniques. Recently, dynamic techniques both of the circuit level and algorithmic level have been proposed [14]. These techniques are based on the principles that the input signal is usually non-stationary, and hence, it is better (from a coding perspective) to adapt the algorithm and architecture to the input signal. Such systems are referred to as reconfigurable signal processing systems [15],[16]. The key goal of these techniques is to improve the algorithm performance by exploiting variability in the data and channel.

Our approach is to design of dynamic algorithm transform (DAT) for design of applicationspecific reconfigurable lifting-based DWPT pipeline processor, in particular, for audio signal processing in real-time. The principle behind DAT techniques is to define parameter of input audio signals (subband entropy) and output encoded sequences (subband rate) for the given embedded processor architecture. Adaptive wavelet analysis for audio signal processing purposes is particularly interesting if the psychoacoustic information is considered in the DWPT decomposition scale. Due to the lack of selectivity of wavelet filter banks, the psycho‐ acoustic information is computed in the wavelet domain.

#### **2. Flexible tree structured signal expansion based on DWPT**

DWPT algorithm is a generalization of the discrete wavelet transform that can be represented as a filter bank with a tree structure [3] (see figure 1). Within a given node number *n* of the tree Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 5

at any level *l* (*n* = 0 .. .2*l*-1 , *l* ∈*Z*) input *xl*, *<sup>n</sup>*,*<sup>k</sup>* , (*k*– signal samples) is separated by low-frequency (LF) *xl*+1,2⋅*n*,*k* and high frequency (HF) *xl*+1,2⋅*n*+1,*k* components using a pair of wavelet filters *h* (*z*) and *g*(*z*) with finite impulse response (FIR), after which each subband signal down-sampling by factor of two. Function block that implements this separation of the input signal is called a dual-channel filter bank analysis.

**Figure 1.** DWPT tree structure (left) and dual-channel filter bank – wavlet baterfily (right)

Applying the lifting scheme [8] for the construction of wavelets filter bank allows significantly reduce the number of arithmetic operations that are necessary to compute the transform. A folded parallel architecture for lifting-based DWPT was presented in [9]. It consists of a group of MACs operating in parallel on the data prestored in a memory bank. In [10] is proposed an architecture using a direct implementation of a lifting-based wavelet filter to perform one level of DWPT at a time. The main drawback of these existing architectures is that they all use memory to store the intermediate coefficients and involve intense memory access during the computation. A recursive pyramid algorithm based folded architecture for computing liftingbased multilevel DWPT is presented in [11]. However, the scheduling and control complexity is high, which also introduce large numbers of switches, multiplexers and control signals. The architecture is not regular and need to be modified for different number of level of DWPT computation. A folded architecture for lifting-based wavelet filters is proposed in [12] to compute the wavelet butterflies in different groups simultaneously at each decomposition level. According to the comparison results, the proposed architecture is more efficient than the previous proposed architectures in terms of memory access, hardware regularity and sim‐ plicity, and throughput. It is necessary to notice, that the architecture of the given processor is effective only for calculation of full tree DWPT. Here there is no technique of management for

Algorithm transformation techniques have been employed in high-speed DSP system design is presented in [13]. All of the above mentioned techniques are applied during the processor design phase and their implementation is time invariant. Therefore, this class of signal processing techniques is referred as static techniques. Recently, dynamic techniques both of the circuit level and algorithmic level have been proposed [14]. These techniques are based on the principles that the input signal is usually non-stationary, and hence, it is better (from a coding perspective) to adapt the algorithm and architecture to the input signal. Such systems are referred to as reconfigurable signal processing systems [15],[16]. The key goal of these techniques is to improve the algorithm performance by exploiting variability in the data and

Our approach is to design of dynamic algorithm transform (DAT) for design of applicationspecific reconfigurable lifting-based DWPT pipeline processor, in particular, for audio signal processing in real-time. The principle behind DAT techniques is to define parameter of input audio signals (subband entropy) and output encoded sequences (subband rate) for the given embedded processor architecture. Adaptive wavelet analysis for audio signal processing purposes is particularly interesting if the psychoacoustic information is considered in the DWPT decomposition scale. Due to the lack of selectivity of wavelet filter banks, the psycho‐

DWPT algorithm is a generalization of the discrete wavelet transform that can be represented as a filter bank with a tree structure [3] (see figure 1). Within a given node number *n* of the tree

acoustic information is computed in the wavelet domain.

**2. Flexible tree structured signal expansion based on DWPT**

DWPT with best tree searching.

4 Design and Architectures for Digital Signal Processing

channel.

**Figure 2.** DWPT tree structure examples and corresponding magnitude response of then filter bank

Thus, a specific node (*l*, *n*) corresponds to the frequency range (*n* ⋅2-*<sup>l</sup>* , (*n* + 1) ⋅2-*<sup>l</sup>* ), normalized to the Nyquist frequency ( *f <sup>N</sup>* ). At each level of decomposition frequency resolution increases twice, but twice the resolving power decreases over time. DWPT is a complete decomposition of the signal in the low and high frequencies. Variation of the resolution in frequency and time domains allows for a more detailed decomposition, for example, in the lower frequencies, which leads to an increase in the frequency resolution and reduced over time. This feature has adapted DWPT. The advantage is the ability to DWPT sufficiently flexible selection of tree decomposition (see figure 2), based on the nature of the signal. The choice of tree structure can be performed based on pre-known features of the signal, and executed dynamically, "ar‐ ranged" for the current frame processing [2].

#### **3. Dynamic transformation of DWPT decomposition**

We present adaptive DWPT tree derived via DAT's. The principle behind DAT is to define parameter of input signals (subband entropy) and output sequences (subband rate) for the given embedded processor architecture. In other hands, DAT techniques is to construct a minimum cost subband decomposition of DWPT by maximizing the minimum masking threshold (which is limited by the perceptual entropy (*PE*)) in every subband for the given embedded processor architecture and temporal resolution. Achieving this purpose, we suppose that the tree structure of DWPT decomposition is adapted, as closely as possible, to the critical bands (*CB* - *WPD* :(*l*, *n*)∈*ECB*) as shown in [14]. For the DWPT tree structure *Ei* the information density *H* belong to tree *Ei* is estimated as

$$\|H\_{E\_i} = \sum\_{\forall (l,n) \in E\_i} \sum\_k w\_{E\_i}(k) \cdot \ln \langle w\_{E\_i}(k) \rangle. \tag{1}$$

**Figure 3.** DWPT tree growing process

**22kHz**

**(1,1)**

**(1,0)**

**Frequency, 11kHz**

**(2,0)**

**(3,1)**

**(3,3)**

*HE***=6.0864** *HE***=6.0363** *HE***=5.7526** *HE***=5.5917** *HE***=5.5149** *HE***=5.4175**

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

**(3,2)**

**(3,0)**

**(4,0)**

**Time Time Time Time**

**(4,1) (4,2) (4,3)**

**(5,0)**

**(6,0)**

**(6,9)**

**5,5 kHz**

**11 kHz**

**22kHz**

Time-Frequency Tiling for Real-Time Audio Applications

http://dx.doi.org/10.5772/51604

7

**(5,7)**

*...*

**(4,4)**

**(4,5)**

**(2,1)**

**(2,2)**

**11 kHz**

**5,5 kHz**

**000**

**Figure 4.** DWPT tree structure creation and corresponding time-frequency tilling map

**(2,3)**

**(0,0)**

where

$$\text{var}\_{E\_i}(\mathbf{k}) = \frac{\|\mathbf{x}\_{l,n,k}\|}{\frac{\sum\_{l} \|\mathbf{x}\_{l,n,k}\|}{\sum\_{l} \|\mathbf{x}\_{l,n,k}\|}}\,\prime\,\tag{2}$$

here *xl*,*n*,*k* are wavelet coefficients, *l* is a decomposition level, *n* is the node number of decom‐ position level, *k* is the index of the current wavelet coefficient of the node (*l*, *<sup>n</sup>*). *HEi* is estimates based on the wavelet coefficients of terminated nodes (nodes is a grey area in a figure 3).

The growing decision for DWPT tree based on the given *H* is being taken in terms of allowing the further decomposition of the WP tree can be expressed as:

$$H\_{E\_i} \lhd H\_{E\_{i-1}}.\tag{3}$$

If (3) is true we continue the subband splitting process in DWPT tree, otherwise the suboptimal decomposition for the given frame of signal is founded.

The subband splitting process is managed based on the estimated values of *PE* in parent and child nodes of current DWPT tree structure. *PE* estimation is described in [17],[18],[19] and expressed as

$$PE\_{l,n} = \sum\_{k=0}^{K\_{l,n}-1} \log\_2\{\mathbf{2}[n \text{ int}\{\text{SMR}\_{l,n,k}\}] + 1\},\tag{4}$$

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 7

**Figure 3.** DWPT tree growing process

be performed based on pre-known features of the signal, and executed dynamically, "ar‐

We present adaptive DWPT tree derived via DAT's. The principle behind DAT is to define parameter of input signals (subband entropy) and output sequences (subband rate) for the given embedded processor architecture. In other hands, DAT techniques is to construct a minimum cost subband decomposition of DWPT by maximizing the minimum masking threshold (which is limited by the perceptual entropy (*PE*)) in every subband for the given embedded processor architecture and temporal resolution. Achieving this purpose, we suppose that the tree structure of DWPT decomposition is adapted, as closely as possible, to the critical bands (*CB* - *WPD* :(*l*, *n*)∈*ECB*) as shown in [14]. For the DWPT tree structure *Ei* the

is estimated as

(*k*) ⋅ *ln*(*wEi*

(*k*)), (1)

is estimates

<sup>|</sup>*xl* ,*n*,*<sup>k</sup>* <sup>|</sup> , (2)

. (3)

*log*2(2 *n* int(*SMRl*,*n*,*<sup>k</sup>* ) + 1), (4)

∑ *k wEi*

(*k*)= <sup>|</sup>*xl* ,*n*,*<sup>k</sup>* <sup>|</sup> ∑ <sup>∀</sup>(*l*,*n*)∈*Ei*

position level, *k* is the index of the current wavelet coefficient of the node (*l*, *<sup>n</sup>*). *HEi*

*HEi*

here *xl*,*n*,*k* are wavelet coefficients, *l* is a decomposition level, *n* is the node number of decom‐

The growing decision for DWPT tree based on the given *H* is being taken in terms of allowing

based on the wavelet coefficients of terminated nodes (nodes is a grey area in a figure 3).

<*HEi*-1

If (3) is true we continue the subband splitting process in DWPT tree, otherwise the suboptimal

The subband splitting process is managed based on the estimated values of *PE* in parent and child nodes of current DWPT tree structure. *PE* estimation is described in [17],[18],[19] and

*wEi*

the further decomposition of the WP tree can be expressed as:

decomposition for the given frame of signal is founded.

*PEl*,*<sup>n</sup>* = ∑ *k*=0 *Kl*,*n*-1

ranged" for the current frame processing [2].

6 Design and Architectures for Digital Signal Processing

information density *H* belong to tree *Ei*

where

expressed as

*HEi* = ∑ ∀(*l*,*n*)∈*Ei*

**3. Dynamic transformation of DWPT decomposition**

**Figure 4.** DWPT tree structure creation and corresponding time-frequency tilling map

where *SMRl*,*n*,*k* is a ration between the absolute value of the wavelet coefficients *xl*,*n*,*k* in a subbabnd of tree *Ei* (node (*l*, *n*)), and the corresponding masking threshold *Tl*,*n*, which is linearly spread among the *Kl*,*n* coefficients *xl*,*n*,*<sup>k</sup>* , *k* ={0, *Kl*,*<sup>n</sup>*} of node (*l*, *n*). The large magnitude of *SMRl*,*n*,*<sup>k</sup>* determines node (*l*, *n*) significance for *PE* formation.

Each allowed parent node (*l*, *n*) is split on two child nodes (*l* + 1,2*n*) and (*l* + 1,2*n* + 1), if and only if the sum of *PEl*+1,2*n* and *PEl*+1,2*<sup>n</sup>*+1 in the child nodes less than in the current node *PEl*,*n*, that can written as.

$$PE\_{l,n} > PE\_{l+1,2n} + PE\_{l+1,2n} \* 1 \,\text{-}\,\tag{5}$$

where *Xl*+1,2 *<sup>n</sup>*(*z*) and *Xl*+1,2 *<sup>n</sup>*+1

*<sup>e</sup>* (*z*), and *Xl*,*<sup>n</sup>*

*e*(*z*), *<sup>h</sup>*˜

due to parallel computation (see figure 5).

2

*Xl+*1*,*2*n(z)*

*Xl+*1*,*2*n+*1*(z)*

**Figure 5.** The transition to the polyphase implementation of analysis filter bank

**Table 1.** The parameters of lifting scheme for db4 (8 taps)

*o*(*z*) and *g*˜

components, *Xl*, *<sup>n</sup>*

The elements *<sup>h</sup>*˜

correspondingly:

*Xl,n(z) h(z)*

~

*g(z)*

<sup>2</sup> <sup>~</sup>

(*z*) are *z*-representation in the field of low and high frequency

*<sup>o</sup>*(*z*) . (7)

Time-Frequency Tiling for Real-Time Audio Applications

http://dx.doi.org/10.5772/51604

9

(*z* 2), (8)

(*z* 2). (9)

*he*(*z*)

*ho*(*z*)

~

~ **P**(*z*)

~

*Xl,n*(*z*) *Xl+*1*,*2*<sup>n</sup>*(*z*)

*ge*(*z*)

~

~

+

*Xl+*1*,*2*n+*1(*z*)

+

*go*(*z*)

*<sup>o</sup>*(*z*) of polyphase matrix from (7) are in the following

*<sup>o</sup>* (*z*) are representationof sequences,respectively, consistingofthe

even and odd samples the input sequence *xl*,*n*,*<sup>k</sup>* , **P**˜ is a polyphase matrix that can be written as

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

*<sup>e</sup>*(*z*)

dependence to the original coefficients of low-pass *<sup>h</sup>*˜(*z*) and high-pass *g*˜(*z*) wavelet filters

*h*˜ *o*

*g*˜ *o*

2

*Xe l,n*(*z*)

*Xo l,n*(*z*)

2

*s*1(*z*) *s*2(*z*) *s*3(*z*) *t*1(*z*) *t*2(*z*)

*b*<sup>0</sup> 3.1029 -5.1995 0.3141 0.0763 -3.1769 *b*<sup>1</sup> 0 1.6625 0 -0.2920 -0.0379 *u* 0 -1 -3 1 3

This approach does not give the gain to the computational cost, but the hardware implemen‐ tation allows us to reduce the operation frequency in double in compare with input data rate

**P**˜ = *h*˜ *<sup>e</sup>*(*z*) *<sup>g</sup>*˜

*e* (*z*), *g*˜

*<sup>h</sup>*˜(*z*)= *<sup>h</sup>*˜

*g*˜(*z*)= *g*˜ *e*

*e*

(*z* 2) + *z* -1

(*z* 2) + *z* -1

*z* -1

*h*˜ *<sup>o</sup>*(*z*) *<sup>g</sup>*˜

Schematically the DWPT tree growing process is shown in the figure 3. The example of dynamic DWPT tree structure growing level by level based on *H* and corresponding timefrequency tilling map are demonstrated in the figure 4.

Applying the information density *H* , the perception entropy *PE*, the limited WP tree structure *CB* - *WPD* and the maximum allowed computation resource together in DWPT growing procedure allows us to found suboptimal solution for input signal analysis on the given hardware architecture.

#### **4. DWPT implementation based on lifting scheme**

#### **4.1. Factoring wavelet filters in to lifting steps**

In the tree-based scheme of the DWPT, each node of the tree consists of a two-channel filter bank. Each node can be broken down into a finite sequence of simple filtering steps, which are called lifting steps or ladder structures. In [20] for two-channel filter bank proposed a method of transition from the implementation on the basis of FIR filters to architecture at the based on lifting scheme. The decomposition is essentially a factorization of the polyphase matrix of the wavelet filters into elementary matrices. As discussed in [20], the lifting steps scheme consists of three phases: the first step splits the data into two subsets: even and odd; the second step recalculates the coefficients (high-pass) as the failure to predict the odd set based on the even; finally the third step updates the even set using the wavelet coefficients to compute the scaling function coefficients (low-pass). This method allows to reduce on halve the number of multiplications and summations. In terms of the *z*-transform transition to the implementation of the filter bank based on the lifting scheme can be viewed in two steps.

The first step is to move towards the implementation of polyphase filtering algorithm [20]. The process of calculating the LF and HF components of the signal *xl*,*n*,*k* in any node of the tree can be written as the following expression:

$$\begin{bmatrix} X\_{l+1,2n}(z) & X\_{l+1,2n+1}(z) \end{bmatrix} = \begin{bmatrix} X\_{l,n}^{\ e}(z) & z^{-1}X\_{l,n}^{\ o}(z) \end{bmatrix} \cdot \tilde{\mathbf{P}},\tag{6}$$

where *Xl*+1,2 *<sup>n</sup>*(*z*) and *Xl*+1,2 *<sup>n</sup>*+1 (*z*) are *z*-representation in the field of low and high frequency components, *Xl*, *<sup>n</sup> <sup>e</sup>* (*z*), and *Xl*,*<sup>n</sup> <sup>o</sup>* (*z*) are representationof sequences,respectively, consistingofthe even and odd samples the input sequence *xl*,*n*,*<sup>k</sup>* , **P**˜ is a polyphase matrix that can be written as

where *SMRl*,*n*,*k* is a ration between the absolute value of the wavelet coefficients *xl*,*n*,*k* in a

linearly spread among the *Kl*,*n* coefficients *xl*,*n*,*<sup>k</sup>* , *k* ={0, *Kl*,*<sup>n</sup>*} of node (*l*, *n*). The large magnitude

Each allowed parent node (*l*, *n*) is split on two child nodes (*l* + 1,2*n*) and (*l* + 1,2*n* + 1), if and only if the sum of *PEl*+1,2*n* and *PEl*+1,2*<sup>n</sup>*+1 in the child nodes less than in the current node

Schematically the DWPT tree growing process is shown in the figure 3. The example of dynamic DWPT tree structure growing level by level based on *H* and corresponding time-

Applying the information density *H* , the perception entropy *PE*, the limited WP tree structure *CB* - *WPD* and the maximum allowed computation resource together in DWPT growing procedure allows us to found suboptimal solution for input signal analysis on the given

In the tree-based scheme of the DWPT, each node of the tree consists of a two-channel filter bank. Each node can be broken down into a finite sequence of simple filtering steps, which are called lifting steps or ladder structures. In [20] for two-channel filter bank proposed a method of transition from the implementation on the basis of FIR filters to architecture at the based on lifting scheme. The decomposition is essentially a factorization of the polyphase matrix of the wavelet filters into elementary matrices. As discussed in [20], the lifting steps scheme consists of three phases: the first step splits the data into two subsets: even and odd; the second step recalculates the coefficients (high-pass) as the failure to predict the odd set based on the even; finally the third step updates the even set using the wavelet coefficients to compute the scaling function coefficients (low-pass). This method allows to reduce on halve the number of multiplications and summations. In terms of the *z*-transform transition to the implementation

The first step is to move towards the implementation of polyphase filtering algorithm [20]. The process of calculating the LF and HF components of the signal *xl*,*n*,*k* in any node of the tree can

*<sup>e</sup>* (*z*) *<sup>z</sup>* -1*<sup>X</sup> <sup>l</sup>*,*<sup>n</sup>*

*<sup>o</sup>* (*z*) ⋅**P**˜, (6)

(*z*) <sup>=</sup> *Xl*,*<sup>n</sup>*

of *SMRl*,*n*,*<sup>k</sup>* determines node (*l*, *n*) significance for *PE* formation.

frequency tilling map are demonstrated in the figure 4.

**4. DWPT implementation based on lifting scheme**

of the filter bank based on the lifting scheme can be viewed in two steps.

**4.1. Factoring wavelet filters in to lifting steps**

be written as the following expression:

*Xl*+1,2*n*(*z*) *Xl*+1,2*n*+1

(node (*l*, *n*)), and the corresponding masking threshold *Tl*,*n*, which is

*PEl*,*<sup>n</sup>* > *PEl*+1,2*<sup>n</sup>* + *PEl*+1,2*<sup>n</sup>* +1. (5)

subbabnd of tree *Ei*

8 Design and Architectures for Digital Signal Processing

*PEl*,*n*, that can written as.

hardware architecture.

$$
\tilde{\mathbf{P}} = \begin{bmatrix}
\tilde{h}\_{\,\,e}(z) & \tilde{g}\_{\,\,e}(z) \\
\tilde{h}\_{\,\,o}(z) & \tilde{g}\_{\,\,o}(z)
\end{bmatrix}.
\tag{7}
$$

The elements *<sup>h</sup>*˜ *e*(*z*), *<sup>h</sup>*˜ *o*(*z*) and *g*˜ *e* (*z*), *g*˜ *<sup>o</sup>*(*z*) of polyphase matrix from (7) are in the following dependence to the original coefficients of low-pass *<sup>h</sup>*˜(*z*) and high-pass *g*˜(*z*) wavelet filters correspondingly:

$$
\tilde{h}\left(z\right) = \tilde{h}\_{\,\,e}\left(z\,^2\right) + z\,^{-1}\tilde{h}\_{\,\,o}\left(z\,^2\right),
\tag{8}
$$

$$
\tilde{\mathcal{g}}\_{\mathcal{S}}(z) = \tilde{\mathcal{g}}\_{\mathcal{e}}(z^2) + z^{-1} \tilde{\mathcal{g}}\_{\mathcal{o}}(z^2). \tag{9}
$$

This approach does not give the gain to the computational cost, but the hardware implemen‐ tation allows us to reduce the operation frequency in double in compare with input data rate due to parallel computation (see figure 5).

**Figure 5.** The transition to the polyphase implementation of analysis filter bank


**Table 1.** The parameters of lifting scheme for db4 (8 taps)

The second step is a factorization of the polyphase matrix into simpler triangular matrices. The result is, in a general, the original matrix **P**˜ that can be expressed as

$$
\tilde{\mathbf{P}} = \prod\_{i=1}^{l/2} \begin{bmatrix} 1 & \bar{\mathbf{s}}\_i(z) \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ \bar{t}\_i(z) & 1 \end{bmatrix} \begin{bmatrix} c\_1 & 0 \\ 0 & c\_2 \end{bmatrix} \tag{10}
$$

*X* ^ *l*,*n <sup>e</sup>* (*z*) *z* -1*X*

^ *l*,*n*

scheme shown in a figure 7.

*Xl+1,*2*n+*1(*z*)

polynomial *s*˜*<sup>i</sup>*

the polynomials *si*

1*/c*<sup>1</sup>

**…** *Xl+*1*,*2*<sup>n</sup>*(*z*)

(*z*) and *t* ˜ *i*

obtained from the calculating sequences *Xl*,*<sup>n</sup>*

(*z*) and *ti*

*<sup>o</sup>* (*z*) = *Xl*+1,2*n*(*z*) *Xl*+1,2*n*+1

+

<sup>1</sup>*/c*<sup>2</sup> **…**

follows: at first, the input coefficients *Xl*+1,2*n*(*z*) and *Xl*+1, <sup>2</sup>*n*+1

(*z*)

and corresponding block diagram of two-channel synthesis filter bank based on the lifting

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

+

According to the block diagram (see figure 7), the synthesis procedure is implemented as

on the coefficients 1 / *c*1, 1 / *c*2 correspondingly, at second, two-channel synthesis filter bank implements the inverse operations of algorithm analysis. This implementation uses the same

*<sup>e</sup>* (*z*), *Xl*,*<sup>n</sup>*

Together the analysis and synthesis filter bank implementation based on lifting scheme is required the same number of operations (summation and multiplication) in compare with

A number of the target application requirements (work in real time, greater throughput, and other) make it necessary to use fixed-point arithmetic to perform the specified computation. With the implementation of two-channel filter bank based on lifting structures using integer arithmetic, the number of difficulties arise, related to the fact that values of the coefficients of

known negative effect of polyphase matrix factorization). This feature causes to an increase of the arithmetic units and the word length of internal registers. Therefore, in this paper for the implementation of the algorithm DWPT on fixed-point arithmetic the approach based on [21], [22] is used, according to which the format of the numbers involved in the intermediate computation, is variable. This method assumes that the number of bits to be allocated under the integer and fractional parts of numbers in the different nodes of the algorithm is different.

**4.2. The algorithm implementation based on fixed-point variable format arithmetic**

*<sup>o</sup>* (*z*).

~ ~~

**…**

**Figure 7.** The block diagram of two-channel synthesis filter bank based on lifting scheme

analysis only implementation based on regular FIR filter implementation.

1 / *c*<sup>1</sup> 0 0 1 / *c*<sup>2</sup>


+

(*z*) from (10) with opposite signs. Reconstructed signal *Xl*, *<sup>n</sup>*(*z*) is

(*z*) can take both fractional and great integer values (that is well-

*Xo l,n*(*z*)

ˆ

∏ *i*=*I*/2 1

> *Xe l,n*(*z*)

ˆ

( <sup>1</sup> <sup>0</sup> -*t* ˜ *i* (*z*) 1

Time-Frequency Tiling for Real-Time Audio Applications

1 -s˜ *<sup>i</sup>* (*z*) <sup>0</sup> <sup>1</sup> ), (12)

http://dx.doi.org/10.5772/51604

11

2

*z*

*Xl,n*(*z*) ˆ

2 +

(*z*) of each channel are multiplied

where *I* is the number of elementary triangular matrices derived from the factorization of polyphase matrices; *s*˜*<sup>i</sup>* (*z*) and *t* ˜ *i* (*z*) are low-order polynomials; *c*1, *c*<sup>2</sup> are real coefficients. In general, the polynomials *s*˜*<sup>i</sup>* (*z*) and *t* ˜ *i* (*z*) can represented as (*b*<sup>0</sup> <sup>+</sup> *<sup>b</sup>*1*<sup>z</sup>* -1 )*<sup>z</sup> <sup>u</sup>*, where *b*0, *b*<sup>1</sup> are constants, *u* is the integer exponent. For example, the *b*0, *b*1 and *u* parameters of lifting scheme for db4 wavelet mother function are presented in table 1. *K*1 and *K*<sup>2</sup> are equal -0.1202 and -8.3192 correspondingly. For fixed point DWPT implementation an arithmetic with an arbitrary number of integer and fractional bits is used as proposed in [21],[22]. The advantage of this number representation is the fact that it can be realized using conventional integer arithmetic resource.

The scaling *Xl*+1,2*n*(*z*) and wavelet *Xl*+1,2*n*+1 (*z*) coefficients relative to the input signal *Xl*,*n*(*z*) in *z* domain are two-channel analysis filter bank results in according to (6) and (10) can be written as follows:

$$\mathbb{E}\left[X\_{l+1,2n}(z)\mid X\_{l+1,2n+1}(z)\right] = \begin{bmatrix} X\_{l,n,\varepsilon}(z) & X\_{l,n,\varepsilon}(z) \end{bmatrix} \times \prod\_{i=1}^{l/2} \begin{bmatrix} 1 & \bar{s}\_i(z) \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ \bar{t}\_i(z) & 1 \end{bmatrix} \begin{bmatrix} c\_1 & 0 \\ 0 & c\_2 \end{bmatrix} \tag{11}$$

where *Xl*,*n*,*<sup>e</sup>* (*z*) and *Xl*,*n*,*<sup>o</sup>* (*z*) are *z*-representation of two sequences consisting of even and odd samples of the input signal *xl*,*n*,*<sup>k</sup>* . The block diagram for the direct implementation of twochannel analysis filter bank based on lifting scheme is shown on figure 6.

**Figure 6.** Block diagram of two-channel analysis filter bank based on the lifting scheme

The inverse decomposition of a two-channel filter bank in the same terms can be expressed as

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 11

$$\begin{bmatrix} \stackrel{\scriptstyle \mathbf{x}}{\mathbf{X}}\_{l,n}^{\varepsilon}(\mathbf{z}) & \stackrel{\scriptstyle \mathbf{z}}{\mathbf{z}}\_{l,n}^{\varepsilon}(\mathbf{z}) \end{bmatrix} \mathbf{=} \begin{bmatrix} \mathbf{X}\_{l+1,2n}(\mathbf{z}) & \mathbf{X}\_{l+1,2n+1}(\mathbf{z}) \end{bmatrix}^{\llbracket 1/c\_{1} \quad \mathbf{0} } \begin{bmatrix} 1/c\_{1} & \mathbf{0} \\ \mathbf{0} & 1/c\_{2} \end{bmatrix} \stackrel{\scriptstyle \mathbf{1}}{\lvert \mathbf{1} \rvert} \begin{bmatrix} 1 & \mathbf{0} \\ \mathbf{-}\not{i}\_{l}(\mathbf{z}) & \mathbf{1} \end{bmatrix}^{\scriptstyle \mathbf{1}} \tag{12}$$

and corresponding block diagram of two-channel synthesis filter bank based on the lifting scheme shown in a figure 7.

**Figure 7.** The block diagram of two-channel synthesis filter bank based on lifting scheme

The second step is a factorization of the polyphase matrix into simpler triangular matrices. The

1 0

where *I* is the number of elementary triangular matrices derived from the factorization of

constants, *u* is the integer exponent. For example, the *b*0, *b*1 and *u* parameters of lifting scheme for db4 wavelet mother function are presented in table 1. *K*1 and *K*<sup>2</sup> are equal -0.1202 and -8.3192 correspondingly. For fixed point DWPT implementation an arithmetic with an arbitrary number of integer and fractional bits is used as proposed in [21],[22]. The advantage of this number representation is the fact that it can be realized using conventional integer

*z* domain are two-channel analysis filter bank results in according to (6) and (10) can be written

samples of the input signal *xl*,*n*,*<sup>k</sup>* . The block diagram for the direct implementation of two-

*s*1(*z*) *t*1(*z*) *s*<sup>I</sup>

The inverse decomposition of a two-channel filter bank in the same terms can be expressed as

~ ~ ~

**…**

+

(*z*) ×∏ *i*=1 *I*/2 ( <sup>1</sup> s˜ *<sup>i</sup>* (*z*)

0 1

(*z*) are *z*-representation of two sequences consisting of even and odd

(*z*) *Xl*,*n*,*<sup>o</sup>*

channel analysis filter bank based on lifting scheme is shown on figure 6.

(*z*) <sup>1</sup> ) *<sup>c</sup>*<sup>1</sup> <sup>0</sup>

0 *c*<sup>2</sup>

(*z*) can represented as (*b*<sup>0</sup> <sup>+</sup> *<sup>b</sup>*1*<sup>z</sup>* -1

(*z*) are low-order polynomials; *c*1, *c*<sup>2</sup> are real coefficients. In

(*z*) coefficients relative to the input signal *Xl*,*n*(*z*) in

1 0

(*z*) <sup>1</sup> ) *<sup>c</sup>*<sup>1</sup> <sup>0</sup>

0 *c*<sup>2</sup>

*c*1

**…** *Xl+*1*,*2*<sup>n</sup>*(*z*)

, (11)

*Xl+*1*,*2*n+*1(*z*)

*t* ˜ *i*

(*z*)

+

**…** *<sup>c</sup>*<sup>2</sup>

, (10)

)*<sup>z</sup> <sup>u</sup>*, where *b*0, *b*<sup>1</sup> are

*t* ˜ *i*

result is, in a general, the original matrix **P**˜ that can be expressed as

(*z*) and *t* ˜ *i*

(*z*) = *Xl*,*n*,*<sup>e</sup>*

+

**Figure 6.** Block diagram of two-channel analysis filter bank based on the lifting scheme

0 1

**P**˜ =∏ *i*=1 *I*/2 ( <sup>1</sup> s˜ *<sup>i</sup>* (*z*)

(*z*) and *t* ˜ *i*

polyphase matrices; *s*˜*<sup>i</sup>*

arithmetic resource.

as follows:

where *Xl*,*n*,*<sup>e</sup>*

*z* -1

*Xl,n*(*z*)

The scaling *Xl*+1,2*n*(*z*) and wavelet *Xl*+1,2*n*+1

*Xl*+1,2*n*(*z*) *Xl*+1,2*n*+1

(*z*) and *Xl*,*n*,*<sup>o</sup>*

2

2

*Xe l,n*(*z*)

*Xo l,n*(*z*)

general, the polynomials *s*˜*<sup>i</sup>*

10 Design and Architectures for Digital Signal Processing

According to the block diagram (see figure 7), the synthesis procedure is implemented as follows: at first, the input coefficients *Xl*+1,2*n*(*z*) and *Xl*+1, <sup>2</sup>*n*+1 (*z*) of each channel are multiplied on the coefficients 1 / *c*1, 1 / *c*2 correspondingly, at second, two-channel synthesis filter bank implements the inverse operations of algorithm analysis. This implementation uses the same polynomial *s*˜*<sup>i</sup>* (*z*) and *t* ˜ *i* (*z*) from (10) with opposite signs. Reconstructed signal *Xl*, *<sup>n</sup>*(*z*) is obtained from the calculating sequences *Xl*,*<sup>n</sup> <sup>e</sup>* (*z*), *Xl*,*<sup>n</sup> <sup>o</sup>* (*z*).

Together the analysis and synthesis filter bank implementation based on lifting scheme is required the same number of operations (summation and multiplication) in compare with analysis only implementation based on regular FIR filter implementation.

#### **4.2. The algorithm implementation based on fixed-point variable format arithmetic**

A number of the target application requirements (work in real time, greater throughput, and other) make it necessary to use fixed-point arithmetic to perform the specified computation. With the implementation of two-channel filter bank based on lifting structures using integer arithmetic, the number of difficulties arise, related to the fact that values of the coefficients of the polynomials *si* (*z*) and *ti* (*z*) can take both fractional and great integer values (that is wellknown negative effect of polyphase matrix factorization). This feature causes to an increase of the arithmetic units and the word length of internal registers. Therefore, in this paper for the implementation of the algorithm DWPT on fixed-point arithmetic the approach based on [21], [22] is used, according to which the format of the numbers involved in the intermediate computation, is variable. This method assumes that the number of bits to be allocated under the integer and fractional parts of numbers in the different nodes of the algorithm is different. In accordance with this approach, any number represented in two's complement fixed-point format, is given in the form of expression:

$$a = ma \cdot 2^{ex \cdot p\_s}, \quad \text{where } ma = (-1)^s + \sum\_{i=0}^{n \cdot 2} a\_i \cdot 2^{i \cdot ul + 1}. \tag{13}$$

**Analysis bank Syntesis bank**

**Figure 9.** Performing operations of a) addition and b) multiplication

2*·wl*-1

×

*exp*с=*exp*a+*exp*b

,

*yo <sup>k</sup>* <sup>=</sup> *xo <sup>k</sup>* <sup>≫</sup>*s*<sup>0</sup> <sup>+</sup> (*xe <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*<sup>1</sup>

*yo <sup>k</sup>* <sup>=</sup> *xo <sup>k</sup>* <sup>≫</sup>*s*<sup>0</sup> <sup>+</sup> (*xe <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*1+

MATLAB environment the following expression was obtained:

1 (*b*<sup>3</sup> <sup>0</sup> <sup>+</sup> *<sup>b</sup>*<sup>3</sup> 1 *z* -1 )*z* -1

<sup>0</sup> <sup>1</sup> <sup>⋅</sup>

+(*xe <sup>k</sup>* - <sup>1</sup> <sup>⋅</sup>*mb*1)≫*s*2;

T1 *ye <sup>k</sup>* <sup>=</sup> *xe <sup>k</sup>* <sup>≫</sup>*s*<sup>0</sup> <sup>+</sup> (*xo <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*<sup>1</sup>

T2 *ye <sup>k</sup>* <sup>=</sup> *xe <sup>k</sup>* <sup>≫</sup>*s*<sup>0</sup> <sup>+</sup> (*xo <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*1+ +(*xo <sup>k</sup>* - <sup>1</sup> <sup>⋅</sup>*mb*1)≫*s*2;

1 0

*yo k* = *xo k* ;

*yo k* = *xo k* ;

S*1 ye k* = *xe k* ;

**S**

*exp*b ,

**S**

S*2 ye k* = *xe k* ;

format

**<sup>P</sup>**˜ <sup>=</sup> <sup>1</sup> *<sup>b</sup>*<sup>1</sup> 0

0 1

The coefficients *bm*

are presented in table 3.

(*b*2 <sup>0</sup> <sup>+</sup> *<sup>b</sup>*<sup>2</sup> 1 *z* -1 )*z* 1

Symbol Computational operations Symbol Computational operations

*exp*a

0

,

**S**

(a) (b)

**S** *exp*a ,

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

**S**

**S…...**

**Table 2.** A set of processing elements for the based on the lifting structures algorithm using arithmetic variable

For an explanation of the practical aspects related to the use of the proposed arithmetic, below an example of the two-channel analysis bank realization on the mother wavelet function db4 [23] is considered. In result of polyphase matrix factorization for a given wavelet basis in the

> (*b*4 <sup>0</sup> <sup>+</sup> *<sup>b</sup>*<sup>4</sup> 1 *z* -1 )*z* <sup>3</sup> 1

*<sup>n</sup>*, and also their parameters *mb* and *expb*, calculated in accordance with (13),

S*1-1 ye k* = *xe k* ;

S*2-1 ye k* = *xe k* ;

*yo <sup>k</sup>* =(*xo <sup>k</sup>* - (*xe <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*1)≪*s*<sup>0</sup>

,

*exp*с=*exp*b

*exp*b

,

13

**S**

http://dx.doi.org/10.5772/51604

Time-Frequency Tiling for Real-Time Audio Applications

*yo <sup>k</sup>* =(*xo <sup>k</sup>* - (*xe <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*1- -(*xe <sup>k</sup>* - <sup>1</sup> <sup>⋅</sup>*mb*1)≫*s*2))≪*s*<sup>0</sup>

T1-1 *ye <sup>k</sup>* =(*xe <sup>k</sup>* - (*xe <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*1)≪*s*<sup>0</sup>


1 *b*<sup>5</sup> 0 *z* -3 c1 0 0 *c*<sup>2</sup>

. (16)

0 1

*yo k* = *xo k* ;

+ *exp*<sup>b</sup>

**S**

>>

,

*yo k* = *xo k* ;

1 0

T2-1 *ye <sup>k</sup>* =(*xe <sup>k</sup>* - (*xe <sup>k</sup>* <sup>⋅</sup>*mb*0)≫*s*1-

Here, *ma*– value of the number presented in two's complement code that is interpreted as a fraction in the range -1,1); *ex pa* – the order of the scaling factor 2 *ex pa* ; *ai* - value of the *i*-th bit of the number equal to 0 or 1; *s* – the sign bit; *wl* – the word length. Thus for intermediate data in different nodes of the algorithm its value *ex pa*is determined. So, depending on the *ex pa* value the redistribution of bits for fractional and integer part of number at different sites of the algorithm is produced (see figure 8).

**Figure 8.** Data format in fixed-point arithmetic with variable word length

For a given in (13) format, the operations of addition and multiplication of a and b (*ex pb* ≥ *exp* \_*a*) are defined as

$$a = a + b = mc \cdot 2^{ex \cdot p\_c} = \left( ma \cdot 2^{ex \cdot p\_a \cdot ex \cdot p\_b} + mb \right) \cdot 2^{ex \cdot p\_b} \tag{14}$$

$$c = a \cdot b = mc \cdot \mathcal{D}^{ex \, p\_c} = ma \cdot mb \cdot \mathcal{D}^{ex \, p\_a \star ex \, p\_b} . \tag{15}$$

The figure 9 schematically illustrates the process of performing operations described above in (14) and (15).

In this paper a generic set of processing elements is proposed to implement of analysis and synthesis banks on the lifting structures using variable arithmetic format (see table 2). In this table *xe k* , *xo k* – input values, and *ye k* , *yo k* – output values, respectively, in the upper and lower channels filter bank analysis (synthesis) in the *k* -th time value. The parameters *s*0, *s*1, *s*2 define the arithmetic shift values. These parameters are computed according to (14) and (15) for each node of the algorithm.

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 13

**Figure 9.** Performing operations of a) addition and b) multiplication

In accordance with this approach, any number represented in two's complement fixed-point

Here, *ma*– value of the number presented in two's complement code that is interpreted as a

the number equal to 0 or 1; *s* – the sign bit; *wl* – the word length. Thus for intermediate data in different nodes of the algorithm its value *ex pa*is determined. So, depending on the *ex pa* value the redistribution of bits for fractional and integer part of number at different sites of the

For a given in (13) format, the operations of addition and multiplication of a and b

*ex pc* =*ma* ⋅*mb* ⋅2

The figure 9 schematically illustrates the process of performing operations described above in

In this paper a generic set of processing elements is proposed to implement of analysis and synthesis banks on the lifting structures using variable arithmetic format (see table 2). In this table *xe k* , *xo k* – input values, and *ye k* , *yo k* – output values, respectively, in the upper and lower channels filter bank analysis (synthesis) in the *k* -th time value. The parameters *s*0, *s*1, *s*2 define the arithmetic shift values. These parameters are computed according to (14)

*ex pa*-*ex pb* + *mb*) ⋅2

*ex pa*+*ex pb*

*ex pb*

*ex pc* =(*ma* ⋅2

*i*=0 *wl*-2

*ai* <sup>⋅</sup>2*i*-*wl*+1. (13)

; *ai* - value of the *i*-th bit of

, (14)

. (15)

*ex pa*

, where *ma* =(-1)*<sup>s</sup>* + ∑

format, is given in the form of expression:

12 Design and Architectures for Digital Signal Processing

algorithm is produced (see figure 8).

(*ex pb* ≥ *exp* \_*a*) are defined as

and (15) for each node of the algorithm.

(14) and (15).

*a* =*ma* ⋅2

*ex pa*

fraction in the range -1,1); *ex pa* – the order of the scaling factor 2

**Figure 8.** Data format in fixed-point arithmetic with variable word length

с=*a* + *b* =*mc* ⋅2

*c* =*a* ⋅ *b* =*mc* ⋅2


**Table 2.** A set of processing elements for the based on the lifting structures algorithm using arithmetic variable format

For an explanation of the practical aspects related to the use of the proposed arithmetic, below an example of the two-channel analysis bank realization on the mother wavelet function db4 [23] is considered. In result of polyphase matrix factorization for a given wavelet basis in the MATLAB environment the following expression was obtained:

$$\tilde{\mathbf{P}} = \begin{bmatrix} 1 & b\_1^0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ b\_2^0 + b\_2^1 z^{-1} \end{bmatrix} \mathbf{z} \cdot \mathbf{1} \begin{bmatrix} 1 & \left(b\_3^0 + b\_3^1 z^{-1}\right) z^{-1} \\ 0 & 1 \end{bmatrix} \cdot \begin{bmatrix} 1 & 0 \\ \left(b\_4^0 + b\_4^1 z^{-1}\right) z^3 & 1 \end{bmatrix} \mathbf{1} \begin{bmatrix} 1 & b\_5^0 z^{-3} \\ 0 & 1 \end{bmatrix} \mathbf{0} \cdot \mathbf{c}\_2 \mathbf{1} \tag{16}$$

The coefficients *bm <sup>n</sup>*, and also their parameters *mb* and *expb*, calculated in accordance with (13), are presented in table 3.


**Table 3.** The lifting structures parameters calculated for the wavelet filters db4 (8 taps)

In the figure 10 shown a block diagram of the implementation of two-channel filter bank analysis for this example. In this scheme, apart from computing elements *S*1, *S*2, *T* 2 (see table 2) in the upper channel of the bank to satisfy the condition of causality delay registers are inserted (elements *z* -*<sup>l</sup>* , *l* ∈Z).

choice of bit internal registers as a result of passing through the two-channel filter bank analysis / synthesis (in the example used wavelet filters db8). This figure also shows the results of an experiment using FIR filters, the underlying of the algorithm DWPT. It can be noted that FIR filter implementation gives better results while using the same registers word length, but requires twice as many calculations. So in order to achieve the level of error variance in the -70 dB for based on lifting structures implementation requires 16-bit, which are approximately two bits more than the realization based on FIR. But this drawback is compensated by a significant reduction of arithmetic operations compared to the direct implementation. Thus, we conclude that the proposed approach is more efficient in hardware implementation

[-1 1)*·*2<sup>0</sup>

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

[-1 1)*·*2<sup>2</sup>

(a) (b)

S1-1

**+**

**+ <<2**

*-mb*<sup>0</sup> 1

http://dx.doi.org/10.5772/51604

[-1 1)*·*2<sup>0</sup>

Time-Frequency Tiling for Real-Time Audio Applications

[-1 1)*·*2<sup>2</sup>

[-1 1)*·*2<sup>0</sup>

15

[-1 1)*·*2<sup>0</sup>

**Figure 12.** The variance of the reconstruction signal error dependence on the registers bit capacity in analysis/synthe‐ sis two-channel filter bank systems (wavelet filters are db8): the solid line indicates the results for the proposed algo‐

Below we consider another experiment for demonstration of the energy localization properties by using our fixed-point DWPT algorithm realization. For this a polyharmonic signal was generated and passed through a five-level decomposition tree fast wavelet transform (division of the tree is carried out only in the low-frequency components). As an example, the wavelet functions db2 family was chosen. Thus all range of amplitudes of the wavelet coefficients was divided by 40 thresholds (these values are plotted on the X-axis of figure 13). Each threshold

rithm on the lifting structures, dotted - integer implementation of the same system using a FIR filter

compared with the bank on the basis of FIR filters.

**>>2**

[-1 1)*·*2<sup>0</sup>

S1

**+**

[-1 1)*·*2<sup>0</sup>

*mb*<sup>0</sup> 1

**+**

**Figure 11.** First step of the analysis (a) and the last step of the synthesis (b) bank implementation

**Figure 10.** Block diagram of two-channel filter bank based on the lifting structures for db4 (see table 2)

In the figure 11a in more detail the first step realization of the analysis bank is considered and in the figure 11b the last step realization of synthesis bank is shown.

As can be seen from the figure 10 and figure 11, the computing units of analysis and synthesis procedures in terms of implementation differ only in the signs of constant multiplying coefficients and the arithmetic shifts positions and directions.

Based on the materials described above, a concrete realization of two-channel bank can be represented as a vector of parameters containing a set of multiplier constants, shift parameters and some additional information regarding the delay elements in the intermediate nodes of the algorithm.

#### **4.3. Accuracy analysis of the algorithm for fixed-point variable format**

To analyse of the proposed approach in MATLAB function library was written, which simulates the process of fixed-point calculating in filter bank with specified structure of the tree. In the figure 12 is shown the estimation error variance signal recovery, depending on the

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 15

**Figure 11.** First step of the analysis (a) and the last step of the synthesis (b) bank implementation

**Lifting step**

**number, i Type** *<sup>u</sup>*

14 Design and Architectures for Digital Signal Processing

inserted (elements *z* -*<sup>l</sup>*

*Xl,n*(*z*) *Xe*(*z*)

2

2

the algorithm.

*Xo*(*z*)

*z* -1 *bi*

**Table 3.** The lifting structures parameters calculated for the wavelet filters db4 (8 taps)

T2


**+**

**Figure 10.** Block diagram of two-channel filter bank based on the lifting structures for db4 (see table 2)

in the figure 11b the last step realization of synthesis bank is shown.

**4.3. Accuracy analysis of the algorithm for fixed-point variable format**

coefficients and the arithmetic shifts positions and directions.

, *l* ∈Z).

*z*

S1

**+**

 *s*1(*z*) 0 -3,1029 -0,7757 2 0 0 - *t*2(*z*) 1 -0,0763 -0,6104 -3 0,2920 0,5840 -1 *s*3(*z*) -1 5,1995 0,6499 3 -1,6625 -0,8313 1 *t*4(*z*) 3 3,1769 0,7942 2 0,0379 0,6064 -4 *s*5(*z*) -3 0,3141 0,6282 -1 0 0 -

In the figure 10 shown a block diagram of the implementation of two-channel filter bank analysis for this example. In this scheme, apart from computing elements *S*1, *S*2, *T* 2 (see table 2) in the upper channel of the bank to satisfy the condition of causality delay registers are

S2

**+**

In the figure 11a in more detail the first step realization of the analysis bank is considered and

As can be seen from the figure 10 and figure 11, the computing units of analysis and synthesis procedures in terms of implementation differ only in the signs of constant multiplying

Based on the materials described above, a concrete realization of two-channel bank can be represented as a vector of parameters containing a set of multiplier constants, shift parameters and some additional information regarding the delay elements in the intermediate nodes of

To analyse of the proposed approach in MATLAB function library was written, which simulates the process of fixed-point calculating in filter bank with specified structure of the tree. In the figure 12 is shown the estimation error variance signal recovery, depending on the

**<sup>0</sup>** *bi*

**Value** *mb expb* **Value** *mb expb*

T2


**+**

**1**

*c*1

S1

**+**

*Xl+*1*,*2*<sup>n</sup>*(*z*)

*Xl+*1*,*2*n+*1(*z*)

*c*2

choice of bit internal registers as a result of passing through the two-channel filter bank analysis / synthesis (in the example used wavelet filters db8). This figure also shows the results of an experiment using FIR filters, the underlying of the algorithm DWPT. It can be noted that FIR filter implementation gives better results while using the same registers word length, but requires twice as many calculations. So in order to achieve the level of error variance in the -70 dB for based on lifting structures implementation requires 16-bit, which are approximately two bits more than the realization based on FIR. But this drawback is compensated by a significant reduction of arithmetic operations compared to the direct implementation. Thus, we conclude that the proposed approach is more efficient in hardware implementation compared with the bank on the basis of FIR filters.

**Figure 12.** The variance of the reconstruction signal error dependence on the registers bit capacity in analysis/synthe‐ sis two-channel filter bank systems (wavelet filters are db8): the solid line indicates the results for the proposed algo‐ rithm on the lifting structures, dotted - integer implementation of the same system using a FIR filter

Below we consider another experiment for demonstration of the energy localization properties by using our fixed-point DWPT algorithm realization. For this a polyharmonic signal was generated and passed through a five-level decomposition tree fast wavelet transform (division of the tree is carried out only in the low-frequency components). As an example, the wavelet functions db2 family was chosen. Thus all range of amplitudes of the wavelet coefficients was divided by 40 thresholds (these values are plotted on the X-axis of figure 13). Each threshold

has been mapped to a vector of the obtained analysis wavelet coefficients on condition that these coefficients are greater than this threshold. Otherwise, the values of the coefficients were replaced by zeros (i.e. in each vector were discarded unimportant relative to a given threshold values). For all vectors was performed reconstruction procedure by synthesis filter bank. In figure 13a, figure 13b for the floating-and fixed-point implementations, respectively, are shown: solid line – the reconstructed to original signal energy relation (in percentage) de‐ pending on the threshold values; dotted line – the percentage of "discarded" wavelet coeffi‐ cients depending on the chosen threshold.

dynamic reconfiguration for implementing adaptive DWPT. The length of the pipeline is obtained from the limited DWPT tree structure (*CB* - *WPD*). A great dependence of the process on the DWPT structure grows leads to the necessity of introducing an easily reconfigurable parallel-pipeline structure with computation resource *C*. Thus, the DSP system for audio

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

Data Bus

The pipeline architecture is applied for effective implementation of the DWPT algorithm. We suggest the pipeline architecture for constructing the DWPT lifting based processor. This architecture integrates the sequential connection of the homogeneous block (buffer/switch unit (BSU) and processing unit (PU)) that implement a two-channel filter bank that allows parallel calculating DWPT with an arbitrary tree structure. The maximum number of decomposition level that can be realize is 8, it is associate with the depth of *CB* - *WPD*. The basic decomposition of DWPT expressed as PU which acts a two-channel filter bank based on the lifting scheme. The reconfiguration vector *rl*,*n* decoding, memory address generation, PU enabling, data exchange controlling and the pipe-line synchronizing are performed by the control units (CU). All this functions in CU at each DWPT processor stages is carried out in parallel. The pipe-line

The block diagram of PU is shown in the figure 15. The input sequence *xl*,*n*,*k* is split into even *Xe* and odd *Xo* samples in PU before the processing is started according to the lifting scheme. The structure of PU has the following abbreviations (see figure 15): *wl* is a bit capacity; *I* is a number of elementary steps of the lifting scheme; *VPE* is a vectors, each element of it is a set of the parameters of the same elementary step of the lifting scheme; *VBUF* is a vector, each element of it specify the number of delays, respectively in the upper and lower channels after the same elementary step. The present elements corresponds to FIFO registers that, on the one

realization in architecture for throughput performance increasing. The coefficients *c*1 and *c*<sup>2</sup> applies to the result of lifting scheme as it described in (11). The estimated hardware resources

in the algorithm and, on the other hand, makes possible a pipelined

DWPT processor stages are synchronized according to the DAT's techniques.

**5.2. DWPT lifting based pipeline Processing Unit (PU)**

required for PU implementation are shown in a table 4.

BSU 8 PU 8

CU 8

Control Bus

WP PROCESSOR

... DSP

BSU 9

Time-Frequency Tiling for Real-Time Audio Applications

http://dx.doi.org/10.5772/51604

17

processing based on DAT-approach consists as shown on figure 14.

BSU 2 PU 2

**Figure 14.** DAT-based reconfigurable signal processing system

CU 2

BSU 1 PU 1

hand are delay elements *z* -1

*x*0,0*,k*

CU 1

Based on these results, we note almost complete compliance of floating-point model with the proposed fixed-point approach. Thus, the fixed-point variable format algorithm DWPT implementation preserves the energy localization inherent to the wavelet packets.

**Figure 13.** Energy estimates of signal reconstruction, depending on the threshold of significant wavelet coefficients for the based on a floating-point (a) and proposed fixed-point (b) DWPT algorithm implementations

#### **5. DWPT pipeline processor with dynamic reconfigurable architecture**

#### **5.1. DAT based reconfigurable signal processing system**

The structure of reconfigurable DSP system for signal analysis based on DAT approach consists of the specific microprocessor oriented on the signal processing (DSP microprocessor) and DWPT processor itself with the reconfigurable architecture. The DSP microprocessor perform several task, such as: processing wavelet coefficients *Xl*,*n*,*<sup>k</sup>* in subbands (*l*, *n*) that corresponds to the current DWPT tree structure *Ei* ; estimate *HEi* and *PEl*,*n*; obtain the reconfiguration vector for DWPT processor *rl*,*<sup>n</sup>*, (*l*, *n*)∈*Ei* . DWPT processor is realized on pipeline architecture with

#### Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 17

dynamic reconfiguration for implementing adaptive DWPT. The length of the pipeline is obtained from the limited DWPT tree structure (*CB* - *WPD*). A great dependence of the process on the DWPT structure grows leads to the necessity of introducing an easily reconfigurable parallel-pipeline structure with computation resource *C*. Thus, the DSP system for audio processing based on DAT-approach consists as shown on figure 14.

**Figure 14.** DAT-based reconfigurable signal processing system

has been mapped to a vector of the obtained analysis wavelet coefficients on condition that these coefficients are greater than this threshold. Otherwise, the values of the coefficients were replaced by zeros (i.e. in each vector were discarded unimportant relative to a given threshold values). For all vectors was performed reconstruction procedure by synthesis filter bank. In figure 13a, figure 13b for the floating-and fixed-point implementations, respectively, are shown: solid line – the reconstructed to original signal energy relation (in percentage) de‐ pending on the threshold values; dotted line – the percentage of "discarded" wavelet coeffi‐

Based on these results, we note almost complete compliance of floating-point model with the proposed fixed-point approach. Thus, the fixed-point variable format algorithm DWPT

(a) (b)

**Figure 13.** Energy estimates of signal reconstruction, depending on the threshold of significant wavelet coefficients

**5. DWPT pipeline processor with dynamic reconfigurable architecture**

The structure of reconfigurable DSP system for signal analysis based on DAT approach consists of the specific microprocessor oriented on the signal processing (DSP microprocessor) and DWPT processor itself with the reconfigurable architecture. The DSP microprocessor perform several task, such as: processing wavelet coefficients *Xl*,*n*,*<sup>k</sup>* in subbands (*l*, *n*) that corresponds

and *PEl*,*n*; obtain the reconfiguration vector

. DWPT processor is realized on pipeline architecture with

; estimate *HEi*

for the based on a floating-point (a) and proposed fixed-point (b) DWPT algorithm implementations

**5.1. DAT based reconfigurable signal processing system**

to the current DWPT tree structure *Ei*

for DWPT processor *rl*,*<sup>n</sup>*, (*l*, *n*)∈*Ei*

implementation preserves the energy localization inherent to the wavelet packets.

cients depending on the chosen threshold.

16 Design and Architectures for Digital Signal Processing

The pipeline architecture is applied for effective implementation of the DWPT algorithm. We suggest the pipeline architecture for constructing the DWPT lifting based processor. This architecture integrates the sequential connection of the homogeneous block (buffer/switch unit (BSU) and processing unit (PU)) that implement a two-channel filter bank that allows parallel calculating DWPT with an arbitrary tree structure. The maximum number of decomposition level that can be realize is 8, it is associate with the depth of *CB* - *WPD*. The basic decomposition of DWPT expressed as PU which acts a two-channel filter bank based on the lifting scheme. The reconfiguration vector *rl*,*n* decoding, memory address generation, PU enabling, data exchange controlling and the pipe-line synchronizing are performed by the control units (CU). All this functions in CU at each DWPT processor stages is carried out in parallel. The pipe-line DWPT processor stages are synchronized according to the DAT's techniques.

#### **5.2. DWPT lifting based pipeline Processing Unit (PU)**

The block diagram of PU is shown in the figure 15. The input sequence *xl*,*n*,*k* is split into even *Xe* and odd *Xo* samples in PU before the processing is started according to the lifting scheme. The structure of PU has the following abbreviations (see figure 15): *wl* is a bit capacity; *I* is a number of elementary steps of the lifting scheme; *VPE* is a vectors, each element of it is a set of the parameters of the same elementary step of the lifting scheme; *VBUF* is a vector, each element of it specify the number of delays, respectively in the upper and lower channels after the same elementary step. The present elements corresponds to FIFO registers that, on the one hand are delay elements *z* -1 in the algorithm and, on the other hand, makes possible a pipelined realization in architecture for throughput performance increasing. The coefficients *c*1 and *c*<sup>2</sup> applies to the result of lifting scheme as it described in (11). The estimated hardware resources required for PU implementation are shown in a table 4.

**x***l,n,1*

**x***l,n,1*

*xl,n,*<sup>0</sup> *xl,n,*<sup>1</sup>

(*l,n*)

distribution of blocks of memory for the structure of the tree.

**Figure 17.** The parallel pipeline architecture for three-level DWPT tree structure

**5.4. Rapid prototyping algorithm of pipeline DWPT processor**

*xl,n,*<sup>0</sup> *xl,n,*<sup>1</sup>

(*l,n*)

**Figure 16.** Unified block diagram of the BSU

*x0,0,k*

(1,0) (1,1)

(2,0) (2,1) (2,2) (2,3)

(0,0)

*x2,1,k x2,2,k x2,3,k*

*x3,0,k x3,1,k*

functions.

(3,0) (3,1)

(*l,n+*1) (*l,n+*1)

reg

*xl,n xl,n+*<sup>1</sup>

**PU**

frame *m+*1 frame *m* frame *m-*1

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

*xl+*1*,n*

*xl+*1*,n+*<sup>1</sup>

reg

to data bus

In the figure 17 is shown an example of the tree decomposition, given by the set of nodes {(0,0), (1,0), (1,1), (2,0), (2,1), (2,2), (2.3), (3,0), (3,1)}. It also schematically illustrates the principle of

The prototype of the DWPT processor can be specified as parameters describing the structure

**1.** Calculating the lifting structure of the dual filter bank based on the original wavelet basis

of two-channel filter bank and the vector that defines the limit tree decomposition.

The method of rapid prototyping can be described by the following sequence of actions.

**x***l+1,2n,1*

*xl+*1*,*2*n,*<sup>0</sup> *xl+*1*,*2*n,*<sup>1</sup>

Time-Frequency Tiling for Real-Time Audio Applications

**x***l+1,2n,1*

*xl+*1*,*2*n,*<sup>0</sup> *xl+*1*,*2*n,*<sup>1</sup>

http://dx.doi.org/10.5772/51604

**LOW**

**x***l+1,2n+1,0* **x***l+1,2n+1,1*

*xl+*1*,*2*n+*1*,*<sup>0</sup> *xl+*1*,*2*n+*1*,*<sup>1</sup>

(*l+*1*,*2*n*)

*xl+*1*,*2*n+*2*,*<sup>0</sup> *xl+*1*,*2*n+*2*,*<sup>1</sup>

to data bus

19

2·(2·K/8)

(*l+*1*,*2*n+*1)

**HIGH**

**LOW**

(*l+*1*,*2*n*)

**x***l+1,2n+1,0*

*xl+*1*,*2*n+*1*,*<sup>0</sup> *xl+*1*,*2*n+*1*,*<sup>1</sup>

**HIGH**

(*l+*1*,*2*n+*1)

*xl+*1*,*2*n+*2*,*<sup>0</sup> *xl+*1*,*2*n+*2*,*<sup>1</sup>

**PU PU PU**

2·K 2·(2·K/2) 2·(4·K/4)

**Figure 15.** Block diagram of the PU


**Table 4.** Estimation of hardware resources for PU implementation (N – is the number of filter taps)

#### **5.3. Buffer/Switch Unit (BSU)**

The BSU realizes double buffering scheme known as "ping-pong" for providing parallel access to the data for storing results and getting source data from/for PU. The additional channel is for outputting the result data. The two output streams of samples *xl*+1,2*n*,*k* and *xl*+1,2*n*+1,*k* from *l*-th PU are stored in BSU and simultaneously *l* + 1-th PU can get the samples for the next processing stage. Unified block diagram of BSU is represented in the figure 16. Each BSU in parallel-pipeline architecture has addressed a different memory size that depends on the DWPT decomposition level.

The momory amount *MV* (taking into account the requirement of double buffering) and the number of processing units of *L* , can be expressed as

$$M\_V = 2 \cdot \sum\_{j=1}^{l} \frac{K}{2^{l\_j}} \cdot L = \max\_{j=1\dots l} l\_j \tag{17}$$

where *J* is amount of all nodes CB-WPD, *l <sup>j</sup>* is a decomposition level, *K* is a initial frame length of input signal.

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 19

**Figure 16.** Unified block diagram of the BSU

**FIFO reg.**

*wl*

*wl*

*wl*

*wl*

**{RESET, CLK, EN}**

**Figure 15.** Block diagram of the PU

**5.3. Buffer/Switch Unit (BSU)**

DWPT decomposition level.

of input signal.

number of processing units of *L* , can be expressed as

where *J* is amount of all nodes CB-WPD, *l <sup>j</sup>*

*MV* =2⋅ ∑ *j*=1 *J K* 2

**Xo**

**Spli**

**t**

*x*

*l,n,k*

**Xe**

*wl*

18 Design and Architectures for Digital Signal Processing

*wl*

3

**Lift. step** **FIFO reg.**

*wl*

*wl*

**FIFO reg.**

*wl*

*K***1**

*xl+1,2n,k*

*xl+1,2n+1,k*

*K***2**

*wl*

*wl*

*wl*

**Lift. step**

**FIFO reg.**

*l <sup>j</sup>* (17)

is a decomposition level, *K* is a initial frame length

**FIFO reg.**

*i=1 i=I*

*I* **steps of lifting scheme**

**Multiplexers 2-in-1 (wl – bit capacity)** 1

The BSU realizes double buffering scheme known as "ping-pong" for providing parallel access to the data for storing results and getting source data from/for PU. The additional channel is for outputting the result data. The two output streams of samples *xl*+1,2*n*,*k* and *xl*+1,2*n*+1,*k* from *l*-th PU are stored in BSU and simultaneously *l* + 1-th PU can get the samples for the next processing stage. Unified block diagram of BSU is represented in the figure 16. Each BSU in parallel-pipeline architecture has addressed a different memory size that depends on the

The momory amount *MV* (taking into account the requirement of double buffering) and the

*<sup>l</sup> <sup>j</sup>* , *L* =max *j*=1..*J*

**Table 4.** Estimation of hardware resources for PU implementation (N – is the number of filter taps)

**Resource type Utilized Multipliers wl×wl** *N*+2 **Adders(wl – bit capacity)** N **Registers(wl – bit capacity)** *N*+1

... ...

*wl*

**Lift. step** *wl*

*VPE***(***i***)** *VBUF***(***i***)**

**FIFO reg.**

> In the figure 17 is shown an example of the tree decomposition, given by the set of nodes {(0,0), (1,0), (1,1), (2,0), (2,1), (2,2), (2.3), (3,0), (3,1)}. It also schematically illustrates the principle of distribution of blocks of memory for the structure of the tree.

**Figure 17.** The parallel pipeline architecture for three-level DWPT tree structure

#### **5.4. Rapid prototyping algorithm of pipeline DWPT processor**

The prototype of the DWPT processor can be specified as parameters describing the structure of two-channel filter bank and the vector that defines the limit tree decomposition.

The method of rapid prototyping can be described by the following sequence of actions.

**1.** Calculating the lifting structure of the dual filter bank based on the original wavelet basis functions.

**2.** Translating the mathematical model for fixed-point arithmetic with the requirements of accuracy, and limitation of hardware resources (registers and bit computing units).

resource. For example, the 512 samples frame size (~11.6 ms) processing take approximately 0.064 ms on presented DWPT lifting based pipeline processor. The rest time is distributed between the dynamic DWPT tree decomposition algorithms, wavelet coefficients post-

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

SDRAM ADC/DAC

Suppose that for some audio input frame is the space of trees structures *E*, which is processing a stream-flow or parallel reconfigurable processor (*m*, *rl*,*n*), where *m* is a number of processor stages, *rl*,*n* is a processor reconfigurable parameters vector of the structure corresponding to

*CB* - *WPD* : (*l*, *n*)∈*ECB*. Next, on the basis of the growing algorithm, described in section 3, the DWPT tree structures are formed, for example, *E*1, *E*2, *E*<sup>3</sup> for which restrictions are checked

the frame. Reconfigurable processor DWPT determined by the current vector of reconfigurable

Parameters *α<sup>l</sup>* determine the transition to a new level of large-scale tree DWPT *l*, i.e. include

<*HEi*-1 *and Ei* ∉*ECB*

, the structure of the *E*<sup>3</sup> to the required frequency-time resolution processing of

*rl*,*<sup>n</sup>* = (*α*1, *β*0, *β*1), (*α*2, *β*0, *β*1, *β*2, *β*3), … (*α*2, *β*0, …, *βn*) (18)

*otherwise* . (19)

Xilinx: XC3S2000

Time-Frequency Tiling for Real-Time Audio Applications

http://dx.doi.org/10.5772/51604

21

. The limit corresponds to a tree

. Based on that, if it turns out

FLASH memory

processing and transfer operation. 1

DSP: TMS320C6713

**Figure 18.** Protopyte board of dynamic reconfigurable pipeline DWPT processor

**5.6. DAT based dynamic reconfigurable architecture algorithm**

the decomposition of tree DWPT (*l*, *n*)∈*Ei*

and *βn* takes the values 0 or 1.

signal processing in the next processor step *m*:

*<sup>α</sup><sup>l</sup>* ={ 1, 0,

*if HEi*

*ECB*, as well as the calculated information density *HEi*

2

3

*HE*<sup>3</sup> <*HE*<sup>2</sup>

parameters:

where *α<sup>l</sup>*

<*HE*<sup>1</sup>


#### **5.5. FPGA based hardware implementation of the pipeline DWPT processor**

For estimation of performance and resource utilization the present architecture has been implemented on Xilinx FPGA XC3s2000. The realized pipeline DWPT processor has following features. The number of decomposition levels is limited by eight. The mother wavelet function Db8 (16 taps) transformed into nine lifting steps is used. The input and output data has the 16 bits word length, the capacity of internal computing is 18 bits. The present implementation doesn't have FIFO stages in PU that allows minimizing hardware resources. The processed frame size can be selected in a range from 128 to 1024 samples. Each BSU contains the pair of two 1024×16 bits block RAMs that is used for realizing double buffering scheme. The PU hardware resources utilization are shown in a table 5 and complete processor implementation resources are presented in table 6.


**Table 5.** Estimations of hardware resource for FPGA-based PU implementation.


**Table 6.** Hardware resource estimations for WP processor implementation on XC3s2000

In the figure 18 the protopyte board of dynamic reconfigurable pipeline DWPT processor is shown.

The implemented design performance is 8 MSPS. So, if the sample rate of input audio signal is 44100 Hz then the time cost for computation of wavelet coefficients is 0.6% from all time resource. For example, the 512 samples frame size (~11.6 ms) processing take approximately 0.064 ms on presented DWPT lifting based pipeline processor. The rest time is distributed between the dynamic DWPT tree decomposition algorithms, wavelet coefficients postprocessing and transfer operation. 1

**Figure 18.** Protopyte board of dynamic reconfigurable pipeline DWPT processor

2

3

**2.** Translating the mathematical model for fixed-point arithmetic with the requirements of accuracy, and limitation of hardware resources (registers and bit computing units).

**6.** Generating the output files of the synthesized VHDL-description of the DWPT processor.

For estimation of performance and resource utilization the present architecture has been implemented on Xilinx FPGA XC3s2000. The realized pipeline DWPT processor has following features. The number of decomposition levels is limited by eight. The mother wavelet function Db8 (16 taps) transformed into nine lifting steps is used. The input and output data has the 16 bits word length, the capacity of internal computing is 18 bits. The present implementation doesn't have FIFO stages in PU that allows minimizing hardware resources. The processed frame size can be selected in a range from 128 to 1024 samples. Each BSU contains the pair of two 1024×16 bits block RAMs that is used for realizing double buffering scheme. The PU hardware resources utilization are shown in a table 5 and complete processor implementation

**Resource type Utilized**

**flip-flops** 226 **MULT18x18s** 18

**Resource type Utilized, pcs. Percentage wise, %**

**4 input look-up tables** 31356 76

**flip-flops** 3037 7 **RAMB16** 16 40 **MULT18x18s** 40 100

In the figure 18 the protopyte board of dynamic reconfigurable pipeline DWPT processor is

The implemented design performance is 8 MSPS. So, if the sample rate of input audio signal is 44100 Hz then the time cost for computation of wavelet coefficients is 0.6% from all time

**4 input look-up tables** 788

**Table 5.** Estimations of hardware resource for FPGA-based PU implementation.

**Table 6.** Hardware resource estimations for WP processor implementation on XC3s2000

**3.** Forming a parameters vector for configure a DWPT processor prototype.

**5.** Estimating computation characteristics of the DWPT procressor prototype.

**5.5. FPGA based hardware implementation of the pipeline DWPT processor**

**4.** Estimating the cost of hardware prototype implementation.

resources are presented in table 6.

20 Design and Architectures for Digital Signal Processing

shown.

#### **5.6. DAT based dynamic reconfigurable architecture algorithm**

Suppose that for some audio input frame is the space of trees structures *E*, which is processing a stream-flow or parallel reconfigurable processor (*m*, *rl*,*n*), where *m* is a number of processor stages, *rl*,*n* is a processor reconfigurable parameters vector of the structure corresponding to the decomposition of tree DWPT (*l*, *n*)∈*Ei* . The limit corresponds to a tree *CB* - *WPD* : (*l*, *n*)∈*ECB*. Next, on the basis of the growing algorithm, described in section 3, the DWPT tree structures are formed, for example, *E*1, *E*2, *E*<sup>3</sup> for which restrictions are checked *ECB*, as well as the calculated information density *HEi* . Based on that, if it turns out *HE*<sup>3</sup> <*HE*<sup>2</sup> <*HE*<sup>1</sup> , the structure of the *E*<sup>3</sup> to the required frequency-time resolution processing of the frame. Reconfigurable processor DWPT determined by the current vector of reconfigurable parameters:

$$r\_{l, \mu} = \left[ \begin{matrix} \alpha\_{1\prime} & \beta\_{0\prime} \ \beta\_{1} \end{matrix} \; \begin{matrix} \alpha\_{2\prime} & \beta\_{0\prime} \ \beta\_{1\prime} \ \beta\_{2\prime} \ \beta\_{3} \end{matrix} \; \dots \; \begin{matrix} \begin{matrix} \alpha\_{2\prime} & \beta\_{0\prime} \ \dots \ \beta\_{n} \end{matrix} \right] \tag{18}$$

where *α<sup>l</sup>* and *βn* takes the values 0 or 1.

Parameters *α<sup>l</sup>* determine the transition to a new level of large-scale tree DWPT *l*, i.e. include signal processing in the next processor step *m*:

$$\alpha\_l = \begin{cases} 1, & \text{if } H\_{E\_i} \le H\_{E\_{i-1}} \text{ and } E\_i \notin E\_{CB} \\ 0, & \text{otherwise} \end{cases} \tag{19}$$

In turn, a group of parameters *βn* includes *n* nodes at the level *l*:

$$\beta\_n = \begin{vmatrix} \mathbf{1}\_{\prime} \ \mathbf{ifP} \ E\_{l,n} > \mathbf{P} \ E\_{l+1,2n} + \mathbf{P} \ E\_{l+1,2n+1} \\ \mathbf{0}\_{\prime} \ otherwise \end{vmatrix} \tag{20}$$

The profile of the time parameters *α<sup>l</sup>*

indicates the informative density *HEm*,*<sup>i</sup>*

**Figure 20.** DWPT tree structures for figure 19

Thus, the DWPT trees structure *Ei*

*r*<sup>1</sup> = (1,1, 1), (1,1, 0,1, 0), (0) ;

implemented in a reconfigurable hardware.

*l=1*

*l=2*

*l=3*

ineffectiveness of further growth of the tree structure.

*n=1 n=2 n=3 n=0*

reconfiguration shown in the figure 21 and can be written as following:

*n=0 n=1*

and *βn*, the transformation vector processor *rl*,*n*, in

Time-Frequency Tiling for Real-Time Audio Applications

http://dx.doi.org/10.5772/51604

23

of the resulting decomposition tree DWPT shows the

*n=0 n=1 n=0 n=1*

, described by the nodes (*l*, *n*), as well as the corresponding

accordance with the tree structure (*l*, *n*)∈*Ei* for three consecutive frames of the audio signal is shown in figure 19. DWPT tree structures that you see dotted line in figure 20 for the respective frames, determine options for their future growth in accordance with the obtained values of the perceptual entropy *PEl*,*n*at each node of the tree, but, for example, a value that

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

*n=0 n=1 n=2 n=3*

*n=0 n=1 n=4 n=5 n=2 n=3 n=6 n=7*

reconfiguration vector DWPT processor are obtained according to the algorithm of dynamic

for 1st frame: *E*<sup>1</sup> ={ (1,0);(1,1) , (2,0);(2,1);(2,2)(2,3) } and vector is

This algorithm of dynamic reconfiguration allows obtaining a suboptimal solution for DWPT analysis. The advantages of the above algorithm can be summarized as: pruning method is a top-down method, DWPT pruning can be viewed as a split process, i.e. we have the temporal construction DWPT tree for each signal frame that is ideal decision for real time processing

The processing of the first nine frames in the pipeline DWPT processor is shown in the figure 22 where *j* is a number of the frame which was loaded to DWPT processor in the current time for processing according to the DWPT tree structure *Em*,*<sup>i</sup>* of the current frame *i*. The compu‐ tation process at each stage of pipeline DWPT processor schematically shows with cubes, where cube mean a frame processing on corresponding DWPT processor stage. The cube "Master" means that on this stage a current frame is used for actual DWPT tree structure creation. The cube "Slave" means that on this stage a current frame is processed according to the actual DWPT tree structure. The cube "Master (suboptimal decomposition)" means that

for 2nd frame: *E*<sup>2</sup> ={ (1,0);(1,1) , (2,0);(2,1) }, and vector is *r*<sup>2</sup> = (1,1, 0), (1,0, 1, ×, ×), (0) ;

for 3rd frame *E*<sup>3</sup> ={ (1,0);(1,1) , (2,2);(2,3) }, and vector is *r*<sup>3</sup> = (1,0, 1), ( ×, ×, 0,1), (0) .

Thus, the transition of signal processing according to the DWPT tree structure *Ei* on the architecture of the processor *Em*,*<sup>i</sup>* for processing on the architecture of *Em*+1,*i*+1, in accordance with the tree structure *Ei*+1 is the vector according to the reconfigurable parameters *rl*,*n*, (*l*, *n*)∈*Ei*+1:

$$E\_{m+1,i+1} = r\_{l,n} \cdot E\_{m,i} \cdot \tag{21}$$

From basic principles of psychoacoustics follows that human perception of acoustic informa‐ tion is quite inert, from 5 ms to 300 ms. Masking forward and backward is approximately 20 ms. With a input audio signal frame length of 5 ms. and processing delay determined by a single stage parallel pipeline processor, we can assume that the delay in processing the input signal (*l* - 2) th levels of the processor (the maximum value of *l* =8 for *CB* - *WPD*) much smaller than the temporal instability signal perceived by man. This allows you to organize multi-frame processing on the basis of parallel pipeline processors, a reconfiguration of the structure DWPT processor to determine the variability of the current signal frame - a frame for which to calculate the cost function *HEi* .

**Figure 19.** The timing diagram of control signals change for three consecutive frames of the audio

The profile of the time parameters *α<sup>l</sup>* and *βn*, the transformation vector processor *rl*,*n*, in accordance with the tree structure (*l*, *n*)∈*Ei* for three consecutive frames of the audio signal is shown in figure 19. DWPT tree structures that you see dotted line in figure 20 for the respective frames, determine options for their future growth in accordance with the obtained values of the perceptual entropy *PEl*,*n*at each node of the tree, but, for example, a value that indicates the informative density *HEm*,*<sup>i</sup>* of the resulting decomposition tree DWPT shows the ineffectiveness of further growth of the tree structure.

**Figure 20.** DWPT tree structures for figure 19

In turn, a group of parameters *βn* includes *n* nodes at the level *l*:

*ifPEl*,*<sup>n</sup>* >*PEl*+1,2*<sup>n</sup>* + *PEl*+1,2*n*+1

architecture of the processor *Em*,*<sup>i</sup>* for processing on the architecture of *Em*+1,*i*+1, in accordance with the tree structure *Ei*+1 is the vector according to the reconfigurable parameters *rl*,*n*,

From basic principles of psychoacoustics follows that human perception of acoustic informa‐ tion is quite inert, from 5 ms to 300 ms. Masking forward and backward is approximately 20 ms. With a input audio signal frame length of 5 ms. and processing delay determined by a single stage parallel pipeline processor, we can assume that the delay in processing the input signal (*l* - 2) th levels of the processor (the maximum value of *l* =8 for *CB* - *WPD*) much smaller than the temporal instability signal perceived by man. This allows you to organize multi-frame processing on the basis of parallel pipeline processors, a reconfiguration of the structure DWPT processor to determine the variability of the current signal frame - a frame for which to calculate

*1*

*0*

*0*

*1*

*0*

*1 1*

Thus, the transition of signal processing according to the DWPT tree structure *Ei*

*Em*+1,*i*+1 =*rl*,*<sup>n</sup>* ⋅ *Em*,*<sup>i</sup>*

*1*

**Figure 19.** The timing diagram of control signals change for three consecutive frames of the audio

*otherwise* . (20)

. (21)

*0*

*frame 2 frame 3 l*=1 *l*=2 *l*=3 *l*=1 *l*=2 *l*=3

*1*

*0*

*1*

*1*

*0*

on the

*<sup>β</sup><sup>n</sup>* ={1, 0,

22 Design and Architectures for Digital Signal Processing

(*l*, *n*)∈*Ei*+1:

the cost function *HEi*

*1*

*1*

*1*

1

0

1

2

0

3

 2 3

1 .

*frame 1*

*l*=1 *l*=2 *l*=3

*1*

*1*

*1*

*0*

*0*

*0*

Thus, the DWPT trees structure *Ei* , described by the nodes (*l*, *n*), as well as the corresponding reconfiguration vector DWPT processor are obtained according to the algorithm of dynamic reconfiguration shown in the figure 21 and can be written as following:

for 1st frame: *E*<sup>1</sup> ={ (1,0);(1,1) , (2,0);(2,1);(2,2)(2,3) } and vector is *r*<sup>1</sup> = (1,1, 1), (1,1, 0,1, 0), (0) ; for 2nd frame: *E*<sup>2</sup> ={ (1,0);(1,1) , (2,0);(2,1) }, and vector is *r*<sup>2</sup> = (1,1, 0), (1,0, 1, ×, ×), (0) ; for 3rd frame *E*<sup>3</sup> ={ (1,0);(1,1) , (2,2);(2,3) }, and vector is *r*<sup>3</sup> = (1,0, 1), ( ×, ×, 0,1), (0) .

This algorithm of dynamic reconfiguration allows obtaining a suboptimal solution for DWPT analysis. The advantages of the above algorithm can be summarized as: pruning method is a top-down method, DWPT pruning can be viewed as a split process, i.e. we have the temporal construction DWPT tree for each signal frame that is ideal decision for real time processing implemented in a reconfigurable hardware.

The processing of the first nine frames in the pipeline DWPT processor is shown in the figure 22 where *j* is a number of the frame which was loaded to DWPT processor in the current time for processing according to the DWPT tree structure *Em*,*<sup>i</sup>* of the current frame *i*. The compu‐ tation process at each stage of pipeline DWPT processor schematically shows with cubes, where cube mean a frame processing on corresponding DWPT processor stage. The cube "Master" means that on this stage a current frame is used for actual DWPT tree structure creation. The cube "Slave" means that on this stage a current frame is processed according to the actual DWPT tree structure. The cube "Master (suboptimal decomposition)" means that


length in a time is equal 22.3 ms. At the same time for frames to be processed under the current structure of the tree *Em*,*<sup>i</sup>* on the steps *m* of DWPT processor, a DSP processor shall monitor the implementation of the procedures such as: the masking threshold calculation algorithm as it described in appendix A [24] (procedure 1), perceptual entropy *PEl*,*<sup>n</sup>* assessment based on (4)- (5) (procedure 2) and the entropy of the DWPT tree structure *HE* estimation according to (1)- (3) (procedure 3). The time schedule for DSP processor (250 MHz, 32 bit floating point DSP microprocessor) based on the listened above procedures are shown in the figure 24. The run time of procedure 1 and procedure 2 are showed in the figure 25 and figure 26 correspondingly as it is mentioned the computational time is not a constant, it depends on the number of stages

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

The output signal synthesis in DWPT processor is demonstrated in the Figure 27. The moni‐ toring system loads the input frame *i* to the appropriate level *m* of DWPT processor. Move the frame *i* to the next stage of the processor is executed when the monitoring system takes the next frame *i* + 1, which will need to get involved is to step DWPT processor. To coordinate the work performed at each stage of the processor, it is necessary to introduce a delay, multiple processing time of one frame at one stage, the most rhythmic work will be provided by the

*i*=1, *t*

*i=*3, 24 *t*

*i=*2, 23 *t*

*i=*1, 22 *t*

*i=*2, *t*

*Delay*

time *Reconstructed output audio signal*

In the given paper the dynamic reconfigurable lifting based adaptive DWPT processor was presented. The lifting scheme allows to reduce on halve the number of multiplications and summations and increase the processing speed. Appling DAT-based approach as the design techniques for time-varying DWPT decomposition allows us to construct dynamically adapted to input signal DWPT analysis. The reconfigurable system offers several advantages over competing alternatives: faster and smaller than general purpose hardware solutions; lower development cost than dedicated hardware solutions; dynamic reconfigurable supports multiple algorithms within a single application; multi-purpose architecture generates volume

*i=*3, *t*

*i=*4, 2<sup>5</sup> *t*

*i=*4, *t*

*i=*5, 2<sup>5</sup> *t*

*i=*6, 2<sup>5</sup> *t i=*7, 2<sup>5</sup> *t*

time time time time

*m=*5 *m=*4 *m=*3 *m=*2 *m=*1

*i=*4, 2<sup>4</sup> *t i=*5, 2<sup>4</sup> *t i=*6, 2<sup>4</sup> *t*

Time-Frequency Tiling for Real-Time Audio Applications

http://dx.doi.org/10.5772/51604

27

*i=*3, 2<sup>3</sup> *t i=*4, 2<sup>3</sup> *t i=5*, 2<sup>3</sup> *t i=2*, 2<sup>2</sup> *t i=*3, 2<sup>2</sup> *t i=4*, 2<sup>2</sup> *t i=1*, 2<sup>1</sup> *t i=2*, 2<sup>1</sup> *t i=3*, 2<sup>1</sup> *t*

*i=5*, *t*

*i=6*, *t*

*m* in DWPT processor involving in the input frames processing.

parallel pipeline structure of DWPT processor.

*Delay*

**Figure 27.** The output signal synthesis in DWPT processor

*i=*2, 2<sup>4</sup> *t*

*Concatenation of reconstructed audio signal* 

*frames*

**Synthesis filter bank on WP processor**

**6. Conclusion**

*i*=1, 2<sup>3</sup> *t*

**Figure 24.** Time schedule operation in the DSP processor

**Figure 25.** Run-time of the procedure 1, depending on a number of stages *m*

**Figure 26.** Run-time of the procedure 2, depending on the number of stages *m*

The complete input signal analysis in DWPT processor is demonstrated in the figure 23. The input signal is segmented on a frame with minimal overlapping and analysed. The frame Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized Time-Frequency Tiling for Real-Time Audio Applications http://dx.doi.org/10.5772/51604 27

length in a time is equal 22.3 ms. At the same time for frames to be processed under the current structure of the tree *Em*,*<sup>i</sup>* on the steps *m* of DWPT processor, a DSP processor shall monitor the implementation of the procedures such as: the masking threshold calculation algorithm as it described in appendix A [24] (procedure 1), perceptual entropy *PEl*,*<sup>n</sup>* assessment based on (4)- (5) (procedure 2) and the entropy of the DWPT tree structure *HE* estimation according to (1)- (3) (procedure 3). The time schedule for DSP processor (250 MHz, 32 bit floating point DSP microprocessor) based on the listened above procedures are shown in the figure 24. The run time of procedure 1 and procedure 2 are showed in the figure 25 and figure 26 correspondingly as it is mentioned the computational time is not a constant, it depends on the number of stages *m* in DWPT processor involving in the input frames processing.

The output signal synthesis in DWPT processor is demonstrated in the Figure 27. The moni‐ toring system loads the input frame *i* to the appropriate level *m* of DWPT processor. Move the frame *i* to the next stage of the processor is executed when the monitoring system takes the next frame *i* + 1, which will need to get involved is to step DWPT processor. To coordinate the work performed at each stage of the processor, it is necessary to introduce a delay, multiple processing time of one frame at one stage, the most rhythmic work will be provided by the parallel pipeline structure of DWPT processor.

**Figure 27.** The output signal synthesis in DWPT processor

#### **6. Conclusion**

0 23.2 ms

0 ms 10.7 ms

8,7 ms 9,5 ms 9,9 ms 10,1 ms 10,3 ms 10,5 ms

*m=1,*2

*m=1,...,*8

*3*

14.5 ms

16.2 ms

time

time

time

time

14.5 ms

time

*2*

*1*

*1*

*2*

*m=*1

12.98 ms 13.51 ms 13.89 ms 14.19 ms 14.38 ms 14.45 ms

*m=1,*2

The complete input signal analysis in DWPT processor is demonstrated in the figure 23. The input signal is segmented on a frame with minimal overlapping and analysed. The frame

*m=1,...,*8

11.84 ms

**Figure 25.** Run-time of the procedure 1, depending on a number of stages *m*

10.7 ms

**Figure 26.** Run-time of the procedure 2, depending on the number of stages *m*

*m=*1

5,7 ms

**Figure 24.** Time schedule operation in the DSP processor

26 Design and Architectures for Digital Signal Processing

10.7 ms

In the given paper the dynamic reconfigurable lifting based adaptive DWPT processor was presented. The lifting scheme allows to reduce on halve the number of multiplications and summations and increase the processing speed. Appling DAT-based approach as the design techniques for time-varying DWPT decomposition allows us to construct dynamically adapted to input signal DWPT analysis. The reconfigurable system offers several advantages over competing alternatives: faster and smaller than general purpose hardware solutions; lower development cost than dedicated hardware solutions; dynamic reconfigurable supports multiple algorithms within a single application; multi-purpose architecture generates volume demand for a single hardware design. The proposed techniques optimize system performance and, in addition, provide a convenient framework within which on-going research in the areas of non-uniform filter bank applied to speech/audio coding algorithms and reconfigurable architectures can be synergistically combined to enable the design of reconfigurable highperformance DSP systems.

[4] Cohen, I, Raz, S, & Malah, D. Orthonormal shift-invariant adaptive wavelet packet decomposition and representation. Signal processing, 57 (3), march (1997).

Dynamic Reconfigurable on the Lifting Steps Wavelet Packet Processor with Frame-Based Psychoacoustic Optimized

Time-Frequency Tiling for Real-Time Audio Applications

http://dx.doi.org/10.5772/51604

29

[5] Wu, X, Li, Y, & Chen, H. Programmable wavelet packet transform processor. IEE

[6] Trenas, M. A, Lopez, J, Sanchez, M, Arguello, F, & Zapata, E. L. Architecture for wavelet packet transform with best tree searching. Proceedings of IEEE international conference on Application-Specific Systems, Architectures and Processors, ASAP'00,

[7] Trenas, M. A, Lopez, J, & Zapata, E. L. A configurable architecture for the wavelet packet transform. Journal of Signal Processing Systems, (2002). doi:A:1020221003822.,

[8] Sweldens, W. The lifting scheme: A construction of second generation wavelets.

[9] Arguello, F, Lopez, J, Trenas, M. A, & Zapata, E. L. Architecture for wavelet packet transform based on lifting steps. Parallel Computting, (2002).

[10] Aroutchelvame, S. M, & Raahemifar, K. Architecture of wavelet packet transform for 1-D signal. Proceedings of IEEE Canadian Conference on Electical and Computer En‐

[11] Paya, G, Peiro, M. M, Ballester, F, Herrero, V, & Mora, F. Lifting folded pipelined dis‐ crete wavelet packet transform architecture. VLSI Circuits and Systems. Edited by Lopez, Jose Fco.; Montiel-Nelson, Juan A.; Pavlidis, Dimitris. Proceedings of the

[12] Wang, C, & Gan, W. S. Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Packet Transform. IEEE transactions on circuits and system- II: Express briefs, may

[13] Parhi, K. K. Algorithm transformation techniques for concurrent processors. Pro‐

[14] Petrovsky Al Petrovsky A. Dynamic algorithm transforms for reconfigurable realtime audio coding processor. Proceedings of the International Conference on Parallel Computing in Electrical Engineering, PARELEC'02, Warsaw, Poland, sep. 22-25, 2002, IEEE Computer Society Press, Los Alamitos, California, (2002). doi:PCEE.

[15] Ackenhusen, J. G. Real-time signal processing: design and implementation of signal

SIAM: SIAM Journal on Mathematical Analysis, 29 (2) ((1997). , 511-546.

gineering, CCECE'05, may (2005). doi:CCECE.2005.1557216., 1304-1307.

doi:S0165-1684(97)00007-8., 251-270.

(2000). doi:ASAP.2000.862399., 289-298.

doi:S0167-8191(02)00101-1., 28(7-8), 1023-1037.

(2007). doi:TCSII.2007.892410., 54(5), 422-426.

processing systems. Printice Hall, NJ, (1999). p.

ceedings of the IEEE, dec. (1989). doi:, 77(12), 1879-1895.

SPIE, (2003). doi:, 5117, 321-328.

2002.1115317., 422-424.

32(3), 255-273.

electronics letters, (1999). doi:el:19990330., 35(6), 449-450.

Thus, the proposed dynamic reconfigurable DWPT processor with frame-based psychoacous‐ tic optimized time-frequency tilling is successfully applicable for several application such as monophonic full-duplex audio coding system [18] and scalable audio coding based on hybrid signal decomposition where the transient part of the signal is modelled on psychoacoustic motivated frame based adaptive DWPT in marching pursuit algorithm [24]. The advantages of this DWPT processor is better viewed by considering the DWPT growing as a splitting process, i.e. the temporal construction DWPT tree created for each signal frame presents an ideal decision for real time processing implemented in a reconfigurable hardware.

#### **Acknowledgements**

This work was supported in part by Belarusian republican fund for fundamental research under the grants T04-217 and T08MC-040.

#### **Author details**

Alexey Petrovsky\* , Maxim Rodionov and Alexander Petrovsky\*

\*Address all correspondence to: {petrovsky,post-rodmax,palex}@bsuir.by

Department of Computer Engineering, Belarusian State University of Informatics and Radi‐ oelectronics, Minsk, Belarus

#### **References**


[4] Cohen, I, Raz, S, & Malah, D. Orthonormal shift-invariant adaptive wavelet packet decomposition and representation. Signal processing, 57 (3), march (1997). doi:S0165-1684(97)00007-8., 251-270.

demand for a single hardware design. The proposed techniques optimize system performance and, in addition, provide a convenient framework within which on-going research in the areas of non-uniform filter bank applied to speech/audio coding algorithms and reconfigurable architectures can be synergistically combined to enable the design of reconfigurable high-

Thus, the proposed dynamic reconfigurable DWPT processor with frame-based psychoacous‐ tic optimized time-frequency tilling is successfully applicable for several application such as monophonic full-duplex audio coding system [18] and scalable audio coding based on hybrid signal decomposition where the transient part of the signal is modelled on psychoacoustic motivated frame based adaptive DWPT in marching pursuit algorithm [24]. The advantages of this DWPT processor is better viewed by considering the DWPT growing as a splitting process, i.e. the temporal construction DWPT tree created for each signal frame presents an

This work was supported in part by Belarusian republican fund for fundamental research

Department of Computer Engineering, Belarusian State University of Informatics and Radi‐

[1] Mallat, S. A Theory of Multiresolution signal decomposition: the wavelet representa‐ tion. IEEE transactions on pattern analysis and machine intelligence, july (1989). doi:,

[2] Coifman, R, & Wickerhauser, M. Entropy based algorithms for best basis selection.

[3] Wickerhauser, M. V. Adapted wavelet analysis from theory to software. A Peters

IEEE transaction of information theory, March (1992). doi:, 38, 712-718.

ideal decision for real time processing implemented in a reconfigurable hardware.

, Maxim Rodionov and Alexander Petrovsky\*

\*Address all correspondence to: {petrovsky,post-rodmax,palex}@bsuir.by

performance DSP systems.

28 Design and Architectures for Digital Signal Processing

**Acknowledgements**

**Author details**

Alexey Petrovsky\*

**References**

oelectronics, Minsk, Belarus

11(7), 674-693.

Wellesley, Massachusetts, (1994). p.

under the grants T04-217 and T08MC-040.


[16] Villasenor, J, & Hutchings, B. The flexibility of configurable computing. IEEE Signal Processing Magasine, sep. (1998). doi:, 15(5), 67-84.

**Chapter 2**

**Low Computational Robust F0 Estimation of Speech**

The *F* 0 estimation determines a performance of speech processing such as speech coding, tonal speech recognition, speaker recognition, and speech enhancement. *F* <sup>0</sup> estimation named "YIN" has been proposed [1] and it is being prevalently used around the world due to its high performance and open-source policy. Speech processing is commonly applied in realistic noisy environments; hence, the performance is degraded seriously. It is well known that YIN does not perform well for noisy speech although it does perform best for clean speech. Accordingly, more robust *F* 0 estimation algorithm is desired and the robust *F* <sup>0</sup> esti‐ mation is long lasting problem in speech processing. We have already proposed robust *F* <sup>0</sup> estimation algorithm based on time-varying complex speech analysis for analytic speech sig‐ nal [2][3]. Analytic signal is a complex-valued signal in which its real part is speech signal and its imaginary part is Hilbert transform of the real part. Since the analytic signal provides the spectrum only on positive frequencies, the signals can be decimated by a factor of two with no degradation. As a result, the complex analysis offers attractive features, for exam‐ ple, more accurate spectral estimation in low frequencies. In [2] and [3], complex LPC resid‐ ual is used to calculate the criterion of weighted autocorrelation function (AUTOC) with a reciprocal of Average Magnitude Difference Function (AMDF) [6]. The complex residual is calculated from analytic speech signal by means of time-varying complex AR (TV-CAR) speech analysis method [4][5]. In [2], MMSE-based TV-CAR speech analysis [4] is intro‐ duced and in [3], ELS-based TV-CAR speech analysis [5] is introduced to calculate complex LPC residual signal. It has been reported in [2] that the method can estimate more accurate *F* 0 for IRS (Intermediate Reference System) filtered speech corrupted by white Gauss noise. Moreover, it has been reported in [3] that the ELS-based complex speech analysis can per‐ form better even for additive pink noise. Furthermore, in order to investigate the effective‐

> © 2013 Funaki and Higa; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Funaki and Higa; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Based on TV-CAR Analysis**

Keiichi Funaki and Takehito Higa

http://dx.doi.org/10.5772/51694

**1. Introduction**

Additional information is available at the end of the chapter


### **Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis**

Keiichi Funaki and Takehito Higa

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51694

#### **1. Introduction**

[16] Villasenor, J, & Hutchings, B. The flexibility of configurable computing. IEEE Signal

[17] Johnston, J. D. Transform coding of audio signals using perceptual noise criteria. IEEE Journal on Selected Areas in Communications, feb. (1988). doi:, 6(2), 314-323.

[18] Petrovsky Al Krahe D., Petrovsky A.A., Real-time wavelet packet-based low bit rate audio coding on a dynamic reconfigurable system. 114th AES Convention preprint

[19] Petrovsky Al A multiresolution auditory model using adaptive WP excitation scalo‐

[20] Daubechies, I, & Sweldens, W. Factoring wavelet transforms into lifting steps. Jour‐

[21] Coors, M, Keding, H, Luethje, O, & Meyr, H. Design and DSP implementation of fixed-point systems. EURASIP journal on advances in signal processing, (2002).

[22] Menard, D, Chillet, D, & Sentieys, O. Floating-to-fixed-point conversion for digital signal processors, EURASIP journal on applied signal processing, article ID 96421,

[23] Daubechies, I. Orthonormal bases of compactly supported wavelets II. Variations on theme. Communications on pure and applied mathematics, (1988). doi:, 41, 909-996.

[24] Petrovsky Al Azarov E., Petrovsky A. Hybrid signal decomposition based on instan‐ taneous harmonic parameters and perceptually motivated wavelet packets for scala‐ ble audio coding. Special Issue: "Fourier Related Transforms for Non-Stationary Signals", Signal Processing, june (2011). doi:j.sigpro.2010.09.005., 91(6), 1489-1504.

Processing Magasine, sep. (1998). doi:, 15(5), 67-84.

30 Design and Architectures for Digital Signal Processing

5778, March (2003). Amsterdam, Netherlands, 22 p.

doi:S1110865702205065., 908-925.

doi:ASP/2006/96421., 2006, 1-19.

grams. ELEKTRONIKA, PAN, Warsaw, (2008). , 49(4), 65-70.

nal of Fourier Analysis and Applications, (1998). , 4(3), 247-269.

The *F* 0 estimation determines a performance of speech processing such as speech coding, tonal speech recognition, speaker recognition, and speech enhancement. *F* <sup>0</sup> estimation named "YIN" has been proposed [1] and it is being prevalently used around the world due to its high performance and open-source policy. Speech processing is commonly applied in realistic noisy environments; hence, the performance is degraded seriously. It is well known that YIN does not perform well for noisy speech although it does perform best for clean speech. Accordingly, more robust *F* 0 estimation algorithm is desired and the robust *F* <sup>0</sup> esti‐ mation is long lasting problem in speech processing. We have already proposed robust *F* <sup>0</sup> estimation algorithm based on time-varying complex speech analysis for analytic speech sig‐ nal [2][3]. Analytic signal is a complex-valued signal in which its real part is speech signal and its imaginary part is Hilbert transform of the real part. Since the analytic signal provides the spectrum only on positive frequencies, the signals can be decimated by a factor of two with no degradation. As a result, the complex analysis offers attractive features, for exam‐ ple, more accurate spectral estimation in low frequencies. In [2] and [3], complex LPC resid‐ ual is used to calculate the criterion of weighted autocorrelation function (AUTOC) with a reciprocal of Average Magnitude Difference Function (AMDF) [6]. The complex residual is calculated from analytic speech signal by means of time-varying complex AR (TV-CAR) speech analysis method [4][5]. In [2], MMSE-based TV-CAR speech analysis [4] is intro‐ duced and in [3], ELS-based TV-CAR speech analysis [5] is introduced to calculate complex LPC residual signal. It has been reported in [2] that the method can estimate more accurate *F* 0 for IRS (Intermediate Reference System) filtered speech corrupted by white Gauss noise. Moreover, it has been reported in [3] that the ELS-based complex speech analysis can per‐ form better even for additive pink noise. Furthermore, in order to investigate the effective‐

© 2013 Funaki and Higa; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Funaki and Higa; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

ness of the time-varying analysis, the performance was compared for the frame with respect to degree of voiced nature [7]. The experiments using IRS filtered speech corrupted by white Gauss noise or pink noise demonstrate that ELS-based robust time-varying complex speech analysis can perform better for stationary voiced speech and ELS-based time-invariant speech analysis can perform better for ordinary voiced frame. However the computational cost turns to be larger by introducing time-varying analysis. In this paper, in order to reduce the computational cost, pre-selection is introduced. The pre-selection is performed by peak picking of speech spectrum based on the TV-CAR analysis [8]. The evaluation is carried out using Keele Pitch Database [9]. The reminder of the chapter is organized as follows. In Sec‐ tion 2, TV-CAR speech analysis is explained. Analytic signal and Time-Varying Complex AR (TV-CAR) model are explained. Two kinds of the TV-CAR parameter estimation algo‐ rithms from an analytic signal, viz., MMSE and ELS methods are explained. In Section 3, *F* <sup>0</sup> estimation algorithm is explained in detail. Sample-based pre-selection is explained and frame-based final–selection is explained. In Section 4, experimental results are explained and these confirm the effectiveness of the proposed method.

#### **2. TV-CAR speech analysis**

In this section, ELS-based robust TV-CAR speech analysis method is explained. Before the explanation, analytic signal and TV-CAR model is explained, in which analytic signal is out‐ put of the TV-CAR model. In 2.6, the benefit of the robust TV-CAR analysis is explained by showing the estimated sprctra from natural speech.

#### **2.1. Analytic speech signal**

Target signal of the time-varying complex AR (TV-CAR) method is an analytic signal that is complex-valued signal defined by an all-pole model as follows.

$$\mathbf{y}^c \left( t \right) = \frac{\mathbf{y} \left( 2t \right) + \mathbf{j} \cdot \mathbf{y}\_H \left( 2t \right)}{\sqrt{2}} \tag{1}$$

( ) <sup>1</sup>

features, the TV-CAR model employs a complex basis expansion shown as

( )


*Y z*

1

1 0

*i l*

filtering. Eq.(5) can be represented by vector-matrix notation as

= =

*I L*


1 *TVCAR I L*

=


*LPC*

where ai

where ai c

(t) ,*I*,*L*, gi,l

<sup>c</sup> *,l* and fl c

fer function. Eq.(4) means the TV-CAR model.

The input-output relation is defined as

*Y z*

1

1

=

1

1

=

tional LPC model cannot express the time-varying spectrum, LPC analysis cannot extract the time-varying spectral features from speech signal. In order to represent the time-varying

> ( ) ( ) <sup>1</sup> , 0

finite order of complex basis expansion, complex parameter, and a complex-valued basis function, respectively. By substituting Eq.(3) into Eq.(2), one can obtain the following trans‐

*L c c c i il l l at gf t* -

=

*i i*

and *I* are *i*-th order LPC coefficient and LPC order, respectively. Since the conven‐

*a z*

1

<sup>+</sup>å (2)

http://dx.doi.org/10.5772/51694

33

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

<sup>=</sup> å (3)

<sup>+</sup>åå (4)

(t) are taken to be *i*-th complex AR coefficient at time *t*, AR order,

( )


= -åå - + (5)

*cc i il l*

*g f tz*

,

1

1 1

= =

*i l*

( ) ( ) ( ) ( ) <sup>1</sup> ,

where uc(t) and yc(t) are taken to be complex-valued input and analytic speech signal shown in Eq.(1), respectively. In the TV-CAR model, the complex AR coefficient is modeled by a finite number of arbitrary complex basis functions such as Fourier basis, wavelet basis or so on. Note that Eq.(3) parameterizes the AR coefficient trajectories that continuously change as a function of time so that the time-varying analysis is feasible to estimate continuous time-varying speech spectrum. In addition, as mentioned above, the complex-valued analy‐ sis facilitates accurate spectral estimation in the low frequencies, as a result, this feature al‐ lows for more accurate *F* 0 estimation if formant structure is removed by the inverse

*c cc c c il l*

*y t g f ty t i u t*


where y<sup>c</sup> (t) ,y(t) and yH(t) denote an analytic signal at time *t*, an observed signal at time *t*, and a Hilbert transformed signal for the observed signal, respectively. Notice that super‐ script *c* denotes complex value in this paper. Since analytic signals provide the spectra only over the range of (0*, π*) analytic signals can be decimated by a factor of two. 2*t* means the decimation. The term of 1/ 2 is multiplied in order to adjust the power of an analytic signal with that of the observed one.

#### **2.2. Time-varying complex AR (TV-CAR) model**

Conventional LPC model is defined by

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis http://dx.doi.org/10.5772/51694 33

$$Y\_{LPC}\left(z^{-1}\right) = \frac{1}{1 + \sum\_{i=1}^{l} a\_i z^{-1}}\tag{2}$$

where ai and *I* are *i*-th order LPC coefficient and LPC order, respectively. Since the conven‐ tional LPC model cannot express the time-varying spectrum, LPC analysis cannot extract the time-varying spectral features from speech signal. In order to represent the time-varying features, the TV-CAR model employs a complex basis expansion shown as

$$a\_i^c\left(t\right) = \sum\_{l=0}^{L-1} \mathbf{g}\_{\iota,l}^c f\_l^c\left(t\right) \tag{3}$$

where ai c (t) ,*I*,*L*, gi,l <sup>c</sup> *,l* and fl c (t) are taken to be *i*-th complex AR coefficient at time *t*, AR order, finite order of complex basis expansion, complex parameter, and a complex-valued basis function, respectively. By substituting Eq.(3) into Eq.(2), one can obtain the following trans‐ fer function. Eq.(4) means the TV-CAR model.

$$Y\_{TVCAR}\left(z^{-1}\right) = \frac{1}{1 + \sum\_{l=1}^{l} \sum\_{l=1}^{L} \mathbf{g}\_{i,l}^{c} f\_{l}^{c}\left(t\right) z^{-l}}\tag{4}$$

The input-output relation is defined as

ness of the time-varying analysis, the performance was compared for the frame with respect to degree of voiced nature [7]. The experiments using IRS filtered speech corrupted by white Gauss noise or pink noise demonstrate that ELS-based robust time-varying complex speech analysis can perform better for stationary voiced speech and ELS-based time-invariant speech analysis can perform better for ordinary voiced frame. However the computational cost turns to be larger by introducing time-varying analysis. In this paper, in order to reduce the computational cost, pre-selection is introduced. The pre-selection is performed by peak picking of speech spectrum based on the TV-CAR analysis [8]. The evaluation is carried out using Keele Pitch Database [9]. The reminder of the chapter is organized as follows. In Sec‐ tion 2, TV-CAR speech analysis is explained. Analytic signal and Time-Varying Complex AR (TV-CAR) model are explained. Two kinds of the TV-CAR parameter estimation algo‐ rithms from an analytic signal, viz., MMSE and ELS methods are explained. In Section 3, *F* <sup>0</sup> estimation algorithm is explained in detail. Sample-based pre-selection is explained and frame-based final–selection is explained. In Section 4, experimental results are explained

In this section, ELS-based robust TV-CAR speech analysis method is explained. Before the explanation, analytic signal and TV-CAR model is explained, in which analytic signal is out‐ put of the TV-CAR model. In 2.6, the benefit of the robust TV-CAR analysis is explained by

Target signal of the time-varying complex AR (TV-CAR) method is an analytic signal that is

and a Hilbert transformed signal for the observed signal, respectively. Notice that super‐ script *c* denotes complex value in this paper. Since analytic signals provide the spectra only over the range of (0*, π*) analytic signals can be decimated by a factor of two. 2*t* means the decimation. The term of 1/ 2 is multiplied in order to adjust the power of an analytic signal

(t) ,y(t) and yH(t) denote an analytic signal at time *t*, an observed signal at time *t*,

+ × <sup>=</sup> (1)

( ) (2 2 ) ( ) 2

*<sup>c</sup> <sup>H</sup> y t jy t y t*

and these confirm the effectiveness of the proposed method.

showing the estimated sprctra from natural speech.

complex-valued signal defined by an all-pole model as follows.

**2. TV-CAR speech analysis**

32 Design and Architectures for Digital Signal Processing

**2.1. Analytic speech signal**

with that of the observed one.

**2.2. Time-varying complex AR (TV-CAR) model**

Conventional LPC model is defined by

where y<sup>c</sup>

$$\mathbf{y}^c\left(t\right) = -\sum\_{l=1}^{l} \sum\_{l=0}^{L-1} \mathbf{g}\_{l,l}^c f\_l^c\left(t\right) \mathbf{y}^c\left(t - l\right) + \mathbf{u}^c\left(t\right) \tag{5}$$

where uc(t) and yc(t) are taken to be complex-valued input and analytic speech signal shown in Eq.(1), respectively. In the TV-CAR model, the complex AR coefficient is modeled by a finite number of arbitrary complex basis functions such as Fourier basis, wavelet basis or so on. Note that Eq.(3) parameterizes the AR coefficient trajectories that continuously change as a function of time so that the time-varying analysis is feasible to estimate continuous time-varying speech spectrum. In addition, as mentioned above, the complex-valued analy‐ sis facilitates accurate spectral estimation in the low frequencies, as a result, this feature al‐ lows for more accurate *F* 0 estimation if formant structure is removed by the inverse filtering. Eq.(5) can be represented by vector-matrix notation as

*d* ¯ =*i*, *l <sup>f</sup> θ*¯*<sup>T</sup>* = *g* ¯ 0 *<sup>T</sup>* , *g* ¯ 1 *<sup>T</sup>* , ⋯, *g* ¯ *l <sup>T</sup>* , ⋯, *g* ¯ *L* −1 *T g* ¯ *l <sup>T</sup>* <sup>=</sup> *<sup>g</sup>*1,*<sup>l</sup> <sup>c</sup>* , *<sup>g</sup>*2,*<sup>l</sup> <sup>c</sup>* , <sup>⋯</sup>, *gi*,*<sup>l</sup> <sup>c</sup>* , <sup>⋯</sup>, *gI* ,*<sup>l</sup> c y*¯ *f <sup>T</sup>* = *y <sup>c</sup>*(*I*), *y <sup>c</sup>*(*I* + 1), *y <sup>c</sup>*(*I* + 2), ⋯ *y <sup>c</sup>*(*N* −1) *u*¯ *f <sup>T</sup>* = *u <sup>c</sup>*(*I*), *u <sup>c</sup>*(*I* + 1), *u <sup>c</sup>*(*I* + 2), ⋯*u <sup>c</sup>*(*N* −1) *Φ*¯ *<sup>f</sup>* <sup>=</sup> *<sup>D</sup>*¯ 0 *f* , *D*¯ 1 *<sup>f</sup>* , <sup>⋯</sup>*<sup>D</sup>*¯ *l <sup>f</sup>* , <sup>⋯</sup>, *<sup>D</sup>*¯ *L* −1 *f D*¯ *l <sup>f</sup>* = *d* ¯ 1,*l <sup>f</sup>* , ⋯*d* ¯ *i*,*l <sup>f</sup>* , ⋯, *d* ¯ *I* ,*l f d* ¯ *i*,*l f* = *<sup>y</sup> <sup>c</sup>*(*<sup>I</sup>* <sup>−</sup>*i*) *<sup>f</sup> <sup>l</sup> <sup>c</sup>*(*I*), *<sup>y</sup> <sup>c</sup>*(*<sup>I</sup>* <sup>+</sup> <sup>1</sup>−*i*) *<sup>f</sup> <sup>l</sup> <sup>c</sup>*(*I* + 1) <sup>⋯</sup>, *<sup>y</sup> <sup>c</sup>*(*<sup>N</sup>* <sup>−</sup>1−*i*) *<sup>f</sup> <sup>l</sup> <sup>c</sup>*(*N* −1)*<sup>T</sup>* (6)

AR coefficients, we minimize the MSE criterion. Minimizing the MSE criterion of Eq.(9) with

Superscript H denotes Hermitian transposition. After solving the linear equation of Eq.(10),

Figure 1 shows block diagram of ELS estimation. If the equation error shown as in Eq.(8) is white Gaussian, the MMSE estimation is optimal, however, it is rare case. As a result, MMSE estimation suffers from biased estimation. In the ELS method, an AR filter is adopted to

( ) ( ) ( )

sian of equation error at time *t*. The inverse filter of Eq.(11) is called a whiten filter. The TV-

( ) ( ) ( ) ( ) ( ) <sup>1</sup>

Eq.(12) is the ELS model shown as in Figure 1(3). The parameter is estimated so as minimize the MSE for the whitened equation error in the ELS algorithm whereas the parameter is esti‐ mated so as minimize the MSE for the equation error in the MMSE algorithm shown as in

*y t g f t y t i br t k e t*

*y Rb e R e <sup>f</sup> f f f f* ( *f f* )( *<sup>b</sup>* )

= -F - + = - F +

is *k*-th parameter of the AR filter whose order is *K* and ec(t) is 0-mean white Gaus‐

= -åå -- - + å (12)

q

(13)

*r t br t k e t* =

1

1 0 1

= = =

*i l k*

*I L K c cc c c c c il l k*

CAR model can be represented using Eq.(5) and Eq.(11) as follows.

,

Eq.(12) can be expressed by the following vector-matrix notation.

q


*K c c c c k k*

c

*y* (10)

http://dx.doi.org/10.5772/51694

35

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

(t) ) at time *t* by calculating the Eq.(3) with the

=- - + å (11)

respect to the complex parameter leads to the following MMSE algorithm.

we can get the complex AR parameter ( ai

whiten the equation error as follows (Figure 1(2)).

^ i,l c .

estimated complex parameter g

**2.4. ELS-based algorithm [5]**

where bk c

Figure 1(1).

Where

( ) \$ *H H ff f <sup>f</sup>* F F = -F q

where *N* is analysis interval, *y* f is (*N − I,* 1) column vector whose elements are analytic speech signal, θ is (*L* ・ *I,* 1) column vector whose elements are complex parameters, Φ f is (*N −I,L*・ *I*) matrix whose elements are weighted analytic speech signal by the complex basis. Superscript T denotes transposition.

#### **2.3. MMSE-based algorithm [4]**

There are several algorithms that estimate the TV-CAR model parameter from complex-val‐ ued signal such as MMSE, WLS(Weighted Least Square), M-estimation, GLS(Generalized Least Square), and ELS(Extended Least Square). The MMSE-algorithm is basic algorithm and used for initial estimation of the ELS. Before explaining the ELS, the MMSE algorithm is explained.

MSE criterion is defined by

$$\overline{r}\_f = \left[ r^\varepsilon \left( I \right), r^\varepsilon \left( I+1 \right), \dots, r^\varepsilon \left( N-1 \right) \right]^T = \overline{\mathcal{y}}\_f + \overline{\Phi}\_f \widehat{\theta} \tag{7}$$

$$\mathbf{y}^c\left(t\right) = \mathbf{y}^c\left(t\right) + \sum\_{l=1}^{I} \sum\_{l=0}^{L-1} \hat{\mathbf{g}}\_{l,l}^c f\_l^c\left(t\right) \mathbf{y}^c\left(t - i\right) \tag{8}$$

$$E = \overline{r}\_f^H \overline{r}\_f = \left(\overline{\mathcal{y}}\_f + \overline{\Phi}\widehat{\theta}\right)^H \left(\overline{\mathcal{y}}\_f + \overline{\Phi}\widehat{\theta}\right) \tag{9}$$

Where g ^ i,l c is the estimated complex parameter, r c(t) is an equation error, or complex AR re‐ sidual and *E* is Mean Squared Error (MSE) for the equation error. To obtain optimal complex AR coefficients, we minimize the MSE criterion. Minimizing the MSE criterion of Eq.(9) with respect to the complex parameter leads to the following MMSE algorithm.

$$\left(\overline{\boldsymbol{\Phi}}\_{f}^{H}\overline{\boldsymbol{\Phi}}\_{f}\right)\hat{\boldsymbol{\theta}} = -\overline{\boldsymbol{\Phi}}\_{f}^{H}\overline{\boldsymbol{\mathcal{Y}}}\_{f} \tag{10}$$

Superscript H denotes Hermitian transposition. After solving the linear equation of Eq.(10), we can get the complex AR parameter ( ai c (t) ) at time *t* by calculating the Eq.(3) with the estimated complex parameter g ^ i,l c .

#### **2.4. ELS-based algorithm [5]**

*d* ¯ =*i*, *l <sup>f</sup> θ*¯*<sup>T</sup>* = *g* ¯ 0 *<sup>T</sup>* , *g* ¯ 1 *<sup>T</sup>* , ⋯, *g* ¯ *l <sup>T</sup>* , ⋯, *g* ¯ *L* −1 *T*

34 Design and Architectures for Digital Signal Processing

*g* ¯ *l <sup>T</sup>* <sup>=</sup> *<sup>g</sup>*1,*<sup>l</sup>*

*y*¯ *f*

*u*¯ *f*

*Φ*¯ *<sup>f</sup>* <sup>=</sup> *<sup>D</sup>*¯ 0 *f* , *D*¯ 1 *<sup>f</sup>* , <sup>⋯</sup>*<sup>D</sup>*¯ *l <sup>f</sup>* , <sup>⋯</sup>, *<sup>D</sup>*¯

*D*¯ *l <sup>f</sup>* = *d* ¯ 1,*l <sup>f</sup>* , ⋯*d* ¯ *i*,*l <sup>f</sup>* , ⋯, *d* ¯ *I* ,*l f*

*d* ¯ *i*,*l f* =

where *N* is analysis interval, *y*

Superscript T denotes transposition.

**2.3. MMSE-based algorithm [4]**

MSE criterion is defined by

 is (*L* ・

speech signal, θ

*−I,L*・

explained.

Where g ^ i,l *<sup>c</sup>* , *<sup>g</sup>*2,*<sup>l</sup>*

*<sup>y</sup> <sup>c</sup>*(*<sup>I</sup>* <sup>−</sup>*i*) *<sup>f</sup> <sup>l</sup>*


<sup>⋯</sup>, *<sup>y</sup> <sup>c</sup>*(*<sup>N</sup>* <sup>−</sup>1−*i*) *<sup>f</sup> <sup>l</sup>*

*<sup>c</sup>* , <sup>⋯</sup>, *gi*,*<sup>l</sup>*

*<sup>c</sup>* , <sup>⋯</sup>, *gI* ,*<sup>l</sup> c*

*<sup>c</sup>*(*I*), *<sup>y</sup> <sup>c</sup>*(*<sup>I</sup>* <sup>+</sup> <sup>1</sup>−*i*) *<sup>f</sup> <sup>l</sup>*

*<sup>c</sup>*(*N* −1)*<sup>T</sup>*

 *I*) matrix whose elements are weighted analytic speech signal by the complex basis.

There are several algorithms that estimate the TV-CAR model parameter from complex-val‐ ued signal such as MMSE, WLS(Weighted Least Square), M-estimation, GLS(Generalized Least Square), and ELS(Extended Least Square). The MMSE-algorithm is basic algorithm and used for initial estimation of the ELS. Before explaining the ELS, the MMSE algorithm is

( ) ( ) ( ) \$ , 1, , 1 *<sup>T</sup> cc c f f <sup>f</sup> r rIrI rN y* <sup>=</sup> é ù + - = +F

( ) ( ) <sup>µ</sup> ( ) ( ) <sup>1</sup> ,

\$ ( ) \$ ( ) *<sup>H</sup> <sup>H</sup> f f E rr y y f f* = = +F +F q

sidual and *E* is Mean Squared Error (MSE) for the equation error. To obtain optimal complex

*i l l*

1 0 *I L <sup>c</sup> c c c c*

= =

*I l r t y t g f ty t i* -

c is the estimated complex parameter, r

*L* −1 *f*

*<sup>c</sup>*(*I* + 1)

 *I,* 1) column vector whose elements are complex parameters, Φ

f is (*N − I,* 1) column vector whose elements are analytic

q

(9)

c(t) is an equation error, or complex AR re‐

ë û L (7)

 q

<sup>=</sup> + - åå (8)

(6)

 f is (*N*

*<sup>T</sup>* = *y <sup>c</sup>*(*I*), *y <sup>c</sup>*(*I* + 1), *y <sup>c</sup>*(*I* + 2), ⋯ *y <sup>c</sup>*(*N* −1)

*<sup>T</sup>* = *u <sup>c</sup>*(*I*), *u <sup>c</sup>*(*I* + 1), *u <sup>c</sup>*(*I* + 2), ⋯*u <sup>c</sup>*(*N* −1)

Figure 1 shows block diagram of ELS estimation. If the equation error shown as in Eq.(8) is white Gaussian, the MMSE estimation is optimal, however, it is rare case. As a result, MMSE estimation suffers from biased estimation. In the ELS method, an AR filter is adopted to whiten the equation error as follows (Figure 1(2)).

$$r^{c}\left(t\right) = -\sum\_{k=1}^{K} b\_{k}^{c} r^{c}\left(t - k\right) + e^{c}\left(t\right) \tag{11}$$

where bk c is *k*-th parameter of the AR filter whose order is *K* and ec(t) is 0-mean white Gaus‐ sian of equation error at time *t*. The inverse filter of Eq.(11) is called a whiten filter. The TV-CAR model can be represented using Eq.(5) and Eq.(11) as follows.

$$\mathbf{y}^c\left(t\right) = -\sum\_{l=1}^{l} \sum\_{l=0}^{L-1} \mathbf{g}\_{l,l}^c f\_l^c\left(t\right) \mathbf{y}^c\left(t - l\right) - \sum\_{k=l}^{K} b\_k^c r^c\left(t - k\right) + \mathbf{e}^c\left(t\right) \tag{12}$$

Eq.(12) is the ELS model shown as in Figure 1(3). The parameter is estimated so as minimize the MSE for the whitened equation error in the ELS algorithm whereas the parameter is esti‐ mated so as minimize the MSE for the equation error in the MMSE algorithm shown as in Figure 1(1).

Eq.(12) can be expressed by the following vector-matrix notation.

$$
\overline{\boldsymbol{y}}\_{f} = -\overline{\boldsymbol{\Phi}}\_{f}\overline{\boldsymbol{\theta}} - \overline{\boldsymbol{R}}\_{f}\overline{\boldsymbol{b}} + \overline{\boldsymbol{e}}\_{f} = -\left(\overline{\boldsymbol{\Phi}}\_{f}\overline{\boldsymbol{R}}\_{f}\right)\left(\frac{\overline{\boldsymbol{\theta}}}{\overline{\boldsymbol{b}}}\right) + \overline{\boldsymbol{e}}\_{f} \tag{13}
$$

Where

$$
\overline{R}\_f = \begin{pmatrix} r^\varepsilon \left(I - 1\right) & r^\varepsilon \left(I - 2\right) & \cdots & r^\varepsilon \left(I - K\right) \\ & r^\varepsilon \left(I\right) & r^\varepsilon \left(I - 1\right) & \cdots & r^\varepsilon \left(I + 1 - K\right) \\ \vdots & \vdots & \ddots & \vdots \\ & r^\varepsilon \left(t\right) & r^\varepsilon \left(t - 1\right) & \cdots & r^\varepsilon \left(t - K\right) \\ \vdots & \vdots & \ddots & \vdots \\ & \vdots & \vdots & \ddots & \vdots \\ r^\varepsilon \left(N - 2\right) & r^\varepsilon \left(N - 3\right) & \cdots & r^\varepsilon \left(N - 1 - K\right) \end{pmatrix} \tag{14}
$$

$$
\overline{b} = \left[b\_1^\varepsilon, b\_2^\varepsilon, \cdots, b\_K^\varepsilon\right]^T
$$

$$
\overline{e}\_f = \left[e^\varepsilon \left(I\right), e^\varepsilon \left(I + 1\right), e^\varepsilon \left(I + 2\right), \cdots, e^\varepsilon \left(N - 1\right)\right]^T
$$

( ) ( )

*dz RzBz*

The procedures from 2 to 5 are iterated with the pre-determined number. The ELS algorithm estimates two kinds of AR filters, A(*z*) and B(*z*), iteratively. Since the ELS algorithm can esti‐ mate unbiased and less effected speech spectrum against additive noise, more accurate *F* <sup>0</sup> and formants frequencies can be estimated. Thus, more accurate *F* <sup>0</sup> trajectories can be esti‐

In this paragraph, we explain the benefit of robust TV-CAR speech analysis by showing the estimated speech spectrum and explain its effectiveness on *F* <sup>0</sup> estimation of speech. Figure 2 shows example of the estimated speech spectra of natural Japanese vowel /o/ for analytic

<sup>1</sup> <sup>0</sup>

1

2 *<sup>z</sup>*

p

In Eq.(18), R(*z*) is *z*-transform of *r* <sup>c</sup>

mated than the MMSE estimation.

**Figure 1.** Block diagrams of MMSE and ELS estimation.

**2.5. Benefit of robust TV-CAR speech analysis**

signal and conventional LPC analysis for speech signal.

2

*j z* <sup>=</sup> <sup>=</sup> Ñò (18)

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

http://dx.doi.org/10.5772/51694

37

(*t*) and B(*z*) is the transfer function of the whiten filter.

By minimizing the MSE for Eq.(13), one can get the following equation.

$$
\begin{pmatrix}
\left(\overline{\Phi}\_{\prime}^{\prime\prime}\overline{\Phi}\_{\prime}\,\overline{\Phi}\_{\prime}^{\prime\prime}\overline{\}}\_{\mathcal{R}\_{\prime}}\right)\begin{pmatrix}
\hat{\boldsymbol{\theta}}\\ \hat{\boldsymbol{b}}
\end{pmatrix} = -\begin{pmatrix}
\overline{\Phi}\_{\prime}^{\prime\prime}\overline{\boldsymbol{y}}\_{\prime}\\ \overline{\mathbf{g}}\_{\prime}^{\prime\prime}\overline{\boldsymbol{y}}\_{\prime}
\end{pmatrix} \tag{15}
$$

By applying the well-known inversion Matrix lemma to Eq.(15), one can obtain the follow‐ ing equation.

$$\left(\overline{\Phi}\_f^H \overline{\Phi}\_f\right)\hat{\theta}\_{\text{bias}} = \overline{\Phi}\_f^H \overline{R}\_f \hat{b} \tag{16}$$

$$
\hat{\boldsymbol{\theta}} = \hat{\boldsymbol{\theta}}\_0 - \hat{\boldsymbol{\theta}}\_{\text{bias}} \tag{17}
$$

The MMSE estimated parameter θ ^ 0 contains the biased element θ ^ bias . The unbiased estima‐ tion of θ ^ is calculated by *<sup>θ</sup>* ^ <sup>0</sup> −*θ* ^ <sup>|</sup>*bias* . The ELS algorithm is equivalent to the GLS (General‐ ized Least Square) algorithm and more sophisticated algorithm. Since the equation error *r* c (*t*) cannot be observed, the iteration algorithm is required by estimating the A(*z*) and B(*z*). The iteration procedure is shown as follows.


Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis http://dx.doi.org/10.5772/51694 37

$$\frac{1}{2\pi \, j} \oint\_{|\mathbf{r}| = 1} \left| R\left( z \right) B\left( z \right) \right|^2 \frac{dz}{z} = 0 \tag{18}$$

In Eq.(18), R(*z*) is *z*-transform of *r* <sup>c</sup> (*t*) and B(*z*) is the transfer function of the whiten filter. The procedures from 2 to 5 are iterated with the pre-determined number. The ELS algorithm estimates two kinds of AR filters, A(*z*) and B(*z*), iteratively. Since the ELS algorithm can esti‐ mate unbiased and less effected speech spectrum against additive noise, more accurate *F* <sup>0</sup> and formants frequencies can be estimated. Thus, more accurate *F* <sup>0</sup> trajectories can be esti‐ mated than the MMSE estimation.

**Figure 1.** Block diagrams of MMSE and ELS estimation.

( ) ( ) ( ) ( ) ( ) ( )

L L

L

L

L

æ ö = -ç ÷

è ø \$ (15)

*bias f f R b*\$ (16)

= -<sup>0</sup> *bias* (17)

^

<sup>|</sup>*bias* . The ELS algorithm is equivalent to the GLS (General‐

bias . The unbiased estima‐

æ ö -- - ç ÷ - + -

*rI rI rIK rI rI rI K*

1 1

(14)

1 2

*<sup>f</sup> cc c*

*cc c*

<sup>=</sup> - -

*cc c cc c*

( ) ( ) ( )

1

M MO M

*rt rt rtK*

*rN rN rN K*

M MO M


( ) ( ) ( )

23 1

( ) ( ) ( ) ( )

*<sup>T</sup> cc c c <sup>f</sup>*

( ) *H H <sup>H</sup> f f fff f H H H f fff f f R y R RR b R y*

( ) \$ *H H* F F =F *f f* q

> \$ \$ \$ qq

 q

ized Least Square) algorithm and more sophisticated algorithm. Since the equation error *r*

(*t*) cannot be observed, the iteration algorithm is required by estimating the A(*z*) and B(*z*).

0 contains the biased element θ

(*t*).

qF

By applying the well-known inversion Matrix lemma to Eq.(15), one can obtain the follow‐

= ++ - é ù ë û

*e eIeI eI eN*

By minimizing the MSE for Eq.(13), one can get the following equation.

( ) \$

F FF

F

^

is calculated by Eq.(16).

^ is calculated by Eq.(17).

^ <sup>0</sup> −*θ* ^

<sup>0</sup> is estimated by MMSE (Eq.(10)).

is estimated so as to minimize Eq.(18) using *r*<sup>c</sup>

, 1, 2, , 1

1 2

*b bb b*

*R*

36 Design and Architectures for Digital Signal Processing

ing equation.

tion of θ

1. Initial θ ^

6. Go to 2.

4. The bias parameter *b*

5. The unbiased parameter θ

c

3. *b* ^

The MMSE estimated parameter θ

^ is calculated by *<sup>θ</sup>*

The iteration procedure is shown as follows.

2. The equation error is calculated by Eq.(8).

^

,,,

= é ù ë û

*<sup>T</sup> cc c K*

L

#### **2.5. Benefit of robust TV-CAR speech analysis**

In this paragraph, we explain the benefit of robust TV-CAR speech analysis by showing the estimated speech spectrum and explain its effectiveness on *F* <sup>0</sup> estimation of speech. Figure 2 shows example of the estimated speech spectra of natural Japanese vowel /o/ for analytic signal and conventional LPC analysis for speech signal.

**Figure 2.** Estimated Spectra of vowel /o/ with complex and conventional LPC analysis.

In Figure 2, left side denote the estimated spectra. Upper is for real-valued LPC analysis. Lower is for complex-valued LPC analysis. Blue line means estimated spectrum by LPC analysis and green line means estimated DFT spectrum. Right side means estimated poles from the estimated AR filter. Figure 3 shows the estimated running spectrum for clean natu‐ ral speech /arayu/ and for the speech corrupted by white Gaussian (10[dB]). In Figure 3, (1) means speech waveform, (2),(3),(4),(5) and (6) mean the estimated spectrum by MMSEbased time-invariant real-valued AR speech analysis, by MMSE-based time-invariant com‐ plex-valued AR speech analysis (L=1), by MMSE-based time-varying complex AR (TV-CAR) speech analysis (L=2), by ELS-based time-invariant complex-valued AR speech analysis (L=1), and by ELS-based time-varying complex AR (TV-CAR) speech analysis (L=2), respec‐ tively. Analysis order *I* is 14 for real analysis and 7 for complex analysis. Basis function is 1st order polynomial function (1,t). One can observe that the complex analysis can estimate more accurate spectrum in low frequencies whereas the estimation accuracy is down in high frequencies. Since speech spectrum provides much energy in low frequencies, it is expected that the high spectral estimation accuracy in low frequencies makes it possible to improve the performance on *F* <sup>0</sup> estimation. Furthermore, the ELS analysis can estimate more accu‐ rate spectrum than MMSE, so that the ELS analysis makes it possible to estimate more accu‐ rate *F* 0. Time-varying analysis can estimate tive-varying spectrum from speech. It is expected that the time-varying analysis enables to estimate more accurate *F* 0 since *F* <sup>0</sup> is varying in the analysis interval.

**Figure 3.** Estimated spectrum for noise corrupted speech /arayu/ (10[dB]).

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

http://dx.doi.org/10.5772/51694

39

**Figure 3.** Estimated spectrum for noise corrupted speech /arayu/ (10[dB]).

**Figure 2.** Estimated Spectra of vowel /o/ with complex and conventional LPC analysis.

38 Design and Architectures for Digital Signal Processing

varying in the analysis interval.

In Figure 2, left side denote the estimated spectra. Upper is for real-valued LPC analysis. Lower is for complex-valued LPC analysis. Blue line means estimated spectrum by LPC analysis and green line means estimated DFT spectrum. Right side means estimated poles from the estimated AR filter. Figure 3 shows the estimated running spectrum for clean natu‐ ral speech /arayu/ and for the speech corrupted by white Gaussian (10[dB]). In Figure 3, (1) means speech waveform, (2),(3),(4),(5) and (6) mean the estimated spectrum by MMSEbased time-invariant real-valued AR speech analysis, by MMSE-based time-invariant com‐ plex-valued AR speech analysis (L=1), by MMSE-based time-varying complex AR (TV-CAR) speech analysis (L=2), by ELS-based time-invariant complex-valued AR speech analysis (L=1), and by ELS-based time-varying complex AR (TV-CAR) speech analysis (L=2), respec‐ tively. Analysis order *I* is 14 for real analysis and 7 for complex analysis. Basis function is 1st order polynomial function (1,t). One can observe that the complex analysis can estimate more accurate spectrum in low frequencies whereas the estimation accuracy is down in high frequencies. Since speech spectrum provides much energy in low frequencies, it is expected that the high spectral estimation accuracy in low frequencies makes it possible to improve the performance on *F* <sup>0</sup> estimation. Furthermore, the ELS analysis can estimate more accu‐ rate spectrum than MMSE, so that the ELS analysis makes it possible to estimate more accu‐ rate *F* 0. Time-varying analysis can estimate tive-varying spectrum from speech. It is expected that the time-varying analysis enables to estimate more accurate *F* 0 since *F* <sup>0</sup> is

### **3. F0 Estimation method**

Proposed method employs two-stage search of F0. In first stage, pre-selection, F0 and F1 are estimated by using sample-based F0 contour estimation [8]. In second stage, final-selection, F0 is estimated by using frame-based F0 estimation [3] within limited range based on the preestimated F0 and F1. The two-stage estimation makes it possible to reduce the computation with less degradation In 3.1, pre-selection algorithm is explained. In 3.2, final-selection algo‐ rithm is explained.

#### **3.1. Sample-based pre-selection**

F0 and F1 are estimated as the lowest two peak frequency, viz., glottal and first formant fre‐ quencies by peak-picking for the estimated time-varying speech spectrum. The procedure of F0 and F1 contour estimation is shown as in Figure 4

**Figure 5.** Peak Picking

is explained.

**3.2. Frame-based final-selection**

AMDF is defined as follows.

value of 0 at the denominator.

Autocorrelation function (AUTOC) is defined by

In frame-based F0 estimation, autocorrelation or AMDF is commonly used. In this para‐ graph, autocorrelation and AMDF are explained and then adopted weighted autocorrelation

> ( ) ( ) ( ) <sup>1</sup> 0

( ) ( ) ( ) <sup>1</sup> 0

F0 is selected as notch frequency for Eq.(20) within certain range of F0. In Shimamura method [6], the AUTOC is weighted by a reciprocal of the AMDF shown as Eq.(21). Since the weighting makes it possible to suppress other peaks, the method can estimate more accurate F0 than AUTOC or AMDF. The value of m is set to be 1 in order to avoid the

1 *<sup>N</sup> t p xt xt <sup>N</sup>*


=

t

where x(t) is target signal such as speech signal, LPC residual or so on, N is frame length and τ means delay. F0 is selected as peak frequency for Eq.(19) within certain range of F0.


=

 t

> t

= + å (19)

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

http://dx.doi.org/10.5772/51694

41

= -+ å (20)

1 *<sup>N</sup> t f xtxt <sup>N</sup>*

t


The peak-pinking is carried out from low frequency to high frequency shown as in Figure 5. The estimated two peaks correspond to glottal formant (F0) and first formant (F1). The for‐ mant frequencies are estimated by solving the equation of the reciprocal of Eq.(4).

**Figure 4.** Flow of F0 and F1 contour estimation

**Figure 5.** Peak Picking

**3. F0 Estimation method**

40 Design and Architectures for Digital Signal Processing

**3.1. Sample-based pre-selection**

F0 and F1 contour estimation is shown as in Figure 4

**2.** By using Eq.(3) and Eq.(4) with the estimated parameter g

**1.** The set of complex-valued parameter g

searched by the peak-picking.

**Figure 4.** Flow of F0 and F1 contour estimation

rithm is explained.

analysis frame.

Proposed method employs two-stage search of F0. In first stage, pre-selection, F0 and F1 are estimated by using sample-based F0 contour estimation [8]. In second stage, final-selection, F0 is estimated by using frame-based F0 estimation [3] within limited range based on the preestimated F0 and F1. The two-stage estimation makes it possible to reduce the computation with less degradation In 3.1, pre-selection algorithm is explained. In 3.2, final-selection algo‐

F0 and F1 are estimated as the lowest two peak frequency, viz., glottal and first formant fre‐ quencies by peak-picking for the estimated time-varying speech spectrum. The procedure of

> ^ i,l

trum for each sample t is calculated, and the two peaks of the estimated spectrum are

The peak-pinking is carried out from low frequency to high frequency shown as in Figure 5. The estimated two peaks correspond to glottal formant (F0) and first formant (F1). The for‐

mant frequencies are estimated by solving the equation of the reciprocal of Eq.(4).

c is estimated by the ELS algorithm for each

<sup>c</sup> , the speech power spec‐

^ i,l

#### **3.2. Frame-based final-selection**

In frame-based F0 estimation, autocorrelation or AMDF is commonly used. In this para‐ graph, autocorrelation and AMDF are explained and then adopted weighted autocorrelation is explained.

Autocorrelation function (AUTOC) is defined by

$$f\left(\tau\right) = \frac{1}{N} \sum\_{\iota=0}^{N-1} x(\iota)x(\iota + \tau) \tag{19}$$

where x(t) is target signal such as speech signal, LPC residual or so on, N is frame length and τ means delay. F0 is selected as peak frequency for Eq.(19) within certain range of F0.

AMDF is defined as follows.

$$p\left(\tau\right) = \frac{1}{N} \sum\_{t=0}^{N-1} \left| \mathbf{x}\left(t\right) - \mathbf{x}\left(t + \tau\right) \right| \tag{20}$$

F0 is selected as notch frequency for Eq.(20) within certain range of F0. In Shimamura method [6], the AUTOC is weighted by a reciprocal of the AMDF shown as Eq.(21). Since the weighting makes it possible to suppress other peaks, the method can estimate more accurate F0 than AUTOC or AMDF. The value of m is set to be 1 in order to avoid the value of 0 at the denominator.

$$G(\tau) = \frac{f(\tau)}{p(\tau) + m} \tag{21}$$

*e n Fn Fn p et* ( ) = - ( ) ( ) (23)

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

(n) ×THR / 100 then the estimation error

http://dx.doi.org/10.5772/51694

43

(n) is true F0 value and Fe(n) is the estimated one. The true values are derived by

is regarded as ERROR and GPE is the probability of the error frames. Otherwise, the estima‐ tion is regarded as SUCCESS and FPE is standard deviation of the error. Figures 7,8,9 and 10 show the experimental results setting the THR as 10[%]. Figure 7 and 9 means the results for male speech. Figure 8 and 10 means the results for female speech. In Figures, (1) shows the results of GPEs or FPEs for additive white Gauss noise. (2) shows the results of GPEs or FPEs for additive pink noise. PROPOSED means the GPEs or FPEs for the proposed method with δ being 25. SP means the Shimamura method [6], viz., Shimamura criterion for speech signal. Other lines mean the GPEs or FPEs for the analysis method shown in Table 2. In all figures, X-axis means noise level of 30, 20, 10, 5, 0,−5[dB]. Y-axis means GPE[%] or FPE[Hz].

Figures 7 and 8 demonstrate that the proposed method can perform slightly better than the full-search method(TVC\_E) for male speech while it can perform equivalently to the full search method(TVC\_E) for female speech. Figures 9 and 10 show that the proposed one does not perform well in terms of FPE although the Shimamura method performs better in terms

where Ft

of FPE.

pitch file in Keele database. In Eq.(14), if |ep(n)| ≥Ft

**Speech data** Keele Pitch database [9]

**IRS filter** 64-th FIR [10]

*F0* **search range** 50 to Eq.Eq.(22) **Complex-valued AR** I=7, L=2 (time-varying)

**Criterion** AUTOC/AMDF [6] **Noise** (1)white Gauss noise

**Interpolation** 3 point Lagrange's

**Table 1.** Experimental Conditions

**Target signal** complex AR residual

**Analysis window** Window Length: 25.6[ms]

Male 5 long sentences Female 5 long sentences

Sampling 10kHz/16bit

Shift Length: 10.0[ms]

Pre-emphasis 1 − z−1

(2)pink noise [11]

Noise Level 30,20,10,5,0,-5[dB]

where f(τ ) and p(τ ) are AUTOC shown as in Eq.(19) and AMDF shown as in Eq.(20), re‐ spectively. In the frame-based method, Shimamura criterion shown as Eq.(21) is applied to complex AR residual extracted by the ELS-based TV-CAR speech analysis. The time-varying complex parameter is estimated and complex AR residual is calculated with the estimated complex parameter with Eq.(17). Note that pre-emphasis is operated for speech analysis such as real-valued AR or TV-CAR speech analysis, and inverse filtering is applied for the non pre-emphasized speech signal so as not to eliminate F0 spectrum on the residual signal. Real part of AUTOC is used to calculate the AUTOC for complex-valued signal. F0 is esti‐ mated within the range corresponding to 50-400[Hz]. In order to reduce the computational amount, the range is shortened by setting the upper value as follows.

$$\min \left( F\_0^S + \left( F\_1^S - F\_0^S \right) \delta / 100, 400 \right) \tag{22}$$

where F0 <sup>s</sup> and F1 s are estimated F0 and F1 by the sample-based pre-selection. Setting upper bound below F1 can not only reduce the computational cost but also can reduce the estima‐ tion error.

#### **4. Experiments**

Speech signals used in the experiment are 5 long sentences uttered by 5 male speaker and 5 long sentences uttered by 5 female speaker of Keele pitch database [9]. Speech signals are filtered by an IRS filter [10]. The IRS filter is band pass FIR filter whose frequency response corresponds to that for analog part of the transmitter of telephone equipment. The frequen‐ cy response is shown in Figure 6. In order to evaluate the proposed method for the speech data processed by speech coding, the IRS filter has to be introduced shown as in [2]. The ex‐ perimental conditions are summarized in Table 1. Frame length is 25.6[msec] and frame shift length is 10[msec]. Analysis orders are 14 and 7 for real-valued analysis and complexvalued analysis, respectively. The basis expansion order L is set to be 1(time-invariant) or 2(time-varying) in the experiments. First order polynomial function is adopted as a basis function. White Gauss noise or pink noise [11] is adopted for additive noise and the levels are 30, 20, 10, 5, 0, and -5 [dB]. In order to extract more accurate F0, 3-point Lagrange's inter‐ polation is adopted. Commonly used criterion for F0 estimation, Gross Pitch Error (GPE), is adopted for objective evaluation. F0 estimation error is defined as

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis http://dx.doi.org/10.5772/51694 43

$$e\_p(n) = F\_e(n) - F\_r(n) \tag{23}$$

where Ft (n) is true F0 value and Fe(n) is the estimated one. The true values are derived by pitch file in Keele database. In Eq.(14), if |ep(n)| ≥Ft (n) ×THR / 100 then the estimation error is regarded as ERROR and GPE is the probability of the error frames. Otherwise, the estima‐ tion is regarded as SUCCESS and FPE is standard deviation of the error. Figures 7,8,9 and 10 show the experimental results setting the THR as 10[%]. Figure 7 and 9 means the results for male speech. Figure 8 and 10 means the results for female speech. In Figures, (1) shows the results of GPEs or FPEs for additive white Gauss noise. (2) shows the results of GPEs or FPEs for additive pink noise. PROPOSED means the GPEs or FPEs for the proposed method with δ being 25. SP means the Shimamura method [6], viz., Shimamura criterion for speech signal. Other lines mean the GPEs or FPEs for the analysis method shown in Table 2. In all figures, X-axis means noise level of 30, 20, 10, 5, 0,−5[dB]. Y-axis means GPE[%] or FPE[Hz].

Figures 7 and 8 demonstrate that the proposed method can perform slightly better than the full-search method(TVC\_E) for male speech while it can perform equivalently to the full search method(TVC\_E) for female speech. Figures 9 and 10 show that the proposed one does not perform well in terms of FPE although the Shimamura method performs better in terms of FPE.


**Table 1.** Experimental Conditions

( ) ( ) ( )

*p m* t

where f(τ ) and p(τ ) are AUTOC shown as in Eq.(19) and AMDF shown as in Eq.(20), re‐ spectively. In the frame-based method, Shimamura criterion shown as Eq.(21) is applied to complex AR residual extracted by the ELS-based TV-CAR speech analysis. The time-varying complex parameter is estimated and complex AR residual is calculated with the estimated complex parameter with Eq.(17). Note that pre-emphasis is operated for speech analysis such as real-valued AR or TV-CAR speech analysis, and inverse filtering is applied for the non pre-emphasized speech signal so as not to eliminate F0 spectrum on the residual signal. Real part of AUTOC is used to calculate the AUTOC for complex-valued signal. F0 is esti‐ mated within the range corresponding to 50-400[Hz]. In order to reduce the computational

<sup>=</sup> <sup>+</sup> (21)

(22)

t

*<sup>f</sup> <sup>G</sup>*

t

amount, the range is shortened by setting the upper value as follows.

adopted for objective evaluation. F0 estimation error is defined as

where F0

tion error.

**4. Experiments**

<sup>s</sup> and F1 s

42 Design and Architectures for Digital Signal Processing

min ( 0 10 ( ) /100, 400) *S SS F FF* + -

d

bound below F1 can not only reduce the computational cost but also can reduce the estima‐

Speech signals used in the experiment are 5 long sentences uttered by 5 male speaker and 5 long sentences uttered by 5 female speaker of Keele pitch database [9]. Speech signals are filtered by an IRS filter [10]. The IRS filter is band pass FIR filter whose frequency response corresponds to that for analog part of the transmitter of telephone equipment. The frequen‐ cy response is shown in Figure 6. In order to evaluate the proposed method for the speech data processed by speech coding, the IRS filter has to be introduced shown as in [2]. The ex‐ perimental conditions are summarized in Table 1. Frame length is 25.6[msec] and frame shift length is 10[msec]. Analysis orders are 14 and 7 for real-valued analysis and complexvalued analysis, respectively. The basis expansion order L is set to be 1(time-invariant) or 2(time-varying) in the experiments. First order polynomial function is adopted as a basis function. White Gauss noise or pink noise [11] is adopted for additive noise and the levels are 30, 20, 10, 5, 0, and -5 [dB]. In order to extract more accurate F0, 3-point Lagrange's inter‐ polation is adopted. Commonly used criterion for F0 estimation, Gross Pitch Error (GPE), is

are estimated F0 and F1 by the sample-based pre-selection. Setting upper


**Table 2.** Analysis methods

**Figure 7.** Experimental Results for Male speech

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

http://dx.doi.org/10.5772/51694

45

**Figure 6.** Frequency response of IRS filter

**Figure 7.** Experimental Results for Male speech

**Line Real or Complex Non or TV MMSE or ELS**

**LPC** Red Dotted Real Non MMSE **TVR** Blue Dotted Real TV MMSE **LPC\_E** Magenta Dotted Real Non ELS **TVR\_E** Green Dotted Real TV ELS **CLPC** Red Solid Complex Non MMSE **TVC** Blue Solid Complex TV MMSE **CLPC\_E** Magenta Solid Complex Non ELS **TVC\_E** Green Solid Complex TV ELS

**Table 2.** Analysis methods

44 Design and Architectures for Digital Signal Processing

**Figure 6.** Frequency response of IRS filter

**Figure 9.** Experimental Results for Male speech

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

http://dx.doi.org/10.5772/51694

47

**Figure 8.** Experimental Results for Female speech

**Figure 9.** Experimental Results for Male speech

**Figure 8.** Experimental Results for Female speech

46 Design and Architectures for Digital Signal Processing

varying F0 contour estimation. In the final-selection, F0 is estimated for only the shorten range based on the pre-selected F0 and F1. The proposed method can perform better for male

Low Computational Robust F0 Estimation of Speech Based on TV-CAR Analysis

http://dx.doi.org/10.5772/51694

49

This work was supported by Grand-in-Aid for Scientific Research (C), Research Project

speech in terms of GPE with reduced computation.

**Acknowledgements**

Number:20500158.

**Author details**

**References**

1917-1930.

Keiichi Funaki1\* and Takehito Higa2

\*Address all correspondence to: funaki@cc.u-ryukyu.ac.jp

EUSIPCO-2008, Lausanne, Switzerland Aug

analysis. Proc. EUSIPCO-98,Rodes,Greece Sep

plex Speech Analysis. Makuhari, Japan Sep

ELS method. Proc.EUROSPEECH2001, Aalborg Denmark Sep

1 Computing & Networking Center, University of the Ryukyus, Okinawa, Japan

2 Graduate School of Engineering and Science, University of the Ryukyus, Okinawa, Japan

[1] Alan de Cheveigne and H.Kawahra, YIN (2002). A fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America , 111(4),

[2] K., Funaki, et al. (2007). Robust F0 Estimation Based on Complex LPC Analysis for

[3] K., Funaki. (2008). F0 estimation based on robust ELS complex speech analysis. Proc.

[4] K., Funaki, Y., Miyanaga, & K., Tochinai. (1998). On a time-varying complex speech

[5] K., Funaki. (2001). A time-varying complex AR speech analysis based on GLS and

[6] T., Shimamura, & H., Kobayashi. (2001). Weighted Autocorrelation for Pitch Extrac‐ tion of Noisy Speech. IEEE Trans. Speech and Audio Processing , 9(7), 727-730.

[7] K., Funaki. (2010). On Evaluation of the F0 Estimation Based on Time-Varying Com‐

IRS Filtered Noisy Speech. IEICE Trans. on Fundamentals Aug , E90-A(8)

**Figure 10.** Experimental Results for Female speech

#### **5. Conclusions**

This paper proposed fast robust fundamental frequency estimation algorithm based on ro‐ bust TV-CAR speech analysis. The method provides two stage of search procedure, pre-se‐ lection and final-selection. In the pre-selection, F0 and F1 are estimated by using timevarying F0 contour estimation. In the final-selection, F0 is estimated for only the shorten range based on the pre-selected F0 and F1. The proposed method can perform better for male speech in terms of GPE with reduced computation.

#### **Acknowledgements**

This work was supported by Grand-in-Aid for Scientific Research (C), Research Project Number:20500158.

#### **Author details**

Keiichi Funaki1\* and Takehito Higa2

\*Address all correspondence to: funaki@cc.u-ryukyu.ac.jp

1 Computing & Networking Center, University of the Ryukyus, Okinawa, Japan

2 Graduate School of Engineering and Science, University of the Ryukyus, Okinawa, Japan

#### **References**

**Figure 10.** Experimental Results for Female speech

48 Design and Architectures for Digital Signal Processing

This paper proposed fast robust fundamental frequency estimation algorithm based on ro‐ bust TV-CAR speech analysis. The method provides two stage of search procedure, pre-se‐ lection and final-selection. In the pre-selection, F0 and F1 are estimated by using time-

**5. Conclusions**


[8] K., Funaki. (2011). F0 Contour Estimation Using ELS-based Robust Time-Varying Complex Speech Analysis. IEEE DSP/SPE workshop, Sedona, AZ, USA Jan

**Section 2**

**Optical Signal Processing**


**Optical Signal Processing**

[8] K., Funaki. (2011). F0 Contour Estimation Using ELS-based Robust Time-Varying

[9] Keele Pitch Database University of Liverpool http://www.liv.ac.uk/Psychology/hmp/

[10] ITU-T Recommendation G.191. (2000). Software tools for speech and audio coding

Complex Speech Analysis. IEEE DSP/SPE workshop, Sedona, AZ, USA Jan

projects/pitch.html

50 Design and Architectures for Digital Signal Processing

standardization. Nov.

[11] NOISE-X92,. http://spib.rice.edu/spib/selectnoise.html.

**Chapter 3**

**Optical Signal Processing: Data Exchange**

Optical signal processing is considered to be an attractive technique to enable fast signal ma‐ nipulation in the optical domain which can avoid cumbersome optical-electrical-optical (OEO) conversions [1]. Driven by the rapid increase of traffic rates, network capacity and complexity, advanced optical networks raise the significance of data traffic grooming and require different optical signal processing functions at network nodes to achieve enhanced network efficiency and flexibility. Typical optical signal processing operations include wavelength conversion, logic gate, format conversion, delay for buffer, regeneration, add/ drop, (de)multiplexing, multicasting, etc [2-14]. One may note that most of these functions work in a similar fashion of unidirectional information transfer. For example, wavelength conversion copies information from one wavelength and transfers it onto another wave‐ length [2]. To achieve superior network performance, bidirectional information swapping, named data exchange, would be expected to provide enhanced flexibility of optical signal

Generally speaking, as an important concept for efficiently utilizing network resources and im‐ proving network performance, data exchange refers to the information swapping between dif‐ ferent wavelengths/time-slots/polarizations or other degrees of freedom. In the wavelength domain (e.g., wavelength-division multiplexed (WDM) network), data exchange, which is al‐ so known as wavelength exchange or wavelength interchange, would require the swapping of data from one wavelength with the data from another wavelength. Extensions of data ex‐ change would expect the data swapping between different time-slots in the time domain (e.g., optical time-division multiplexed (OTDM) network), different polarization states in the polar‐ ization domain (e.g., polarization-multiplexed (pol-muxed) network), and different "twisted" light beams carrying different orbital angular momentum (OAM) values in the phase front do‐ main (e.g., OAM-multiplexed network). Moreover, the recently increasing interest of ad‐

> © 2013 Wang and Willner; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

Jian Wang and Alan E. Willner

http://dx.doi.org/10.5772/52205

**1. Introduction**

Additional information is available at the end of the chapter

processing compared to unidirectional information transfer [15].

### **Optical Signal Processing: Data Exchange**

Jian Wang and Alan E. Willner

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/52205

#### **1. Introduction**

Optical signal processing is considered to be an attractive technique to enable fast signal ma‐ nipulation in the optical domain which can avoid cumbersome optical-electrical-optical (OEO) conversions [1]. Driven by the rapid increase of traffic rates, network capacity and complexity, advanced optical networks raise the significance of data traffic grooming and require different optical signal processing functions at network nodes to achieve enhanced network efficiency and flexibility. Typical optical signal processing operations include wavelength conversion, logic gate, format conversion, delay for buffer, regeneration, add/ drop, (de)multiplexing, multicasting, etc [2-14]. One may note that most of these functions work in a similar fashion of unidirectional information transfer. For example, wavelength conversion copies information from one wavelength and transfers it onto another wave‐ length [2]. To achieve superior network performance, bidirectional information swapping, named data exchange, would be expected to provide enhanced flexibility of optical signal processing compared to unidirectional information transfer [15].

Generally speaking, as an important concept for efficiently utilizing network resources and im‐ proving network performance, data exchange refers to the information swapping between dif‐ ferent wavelengths/time-slots/polarizations or other degrees of freedom. In the wavelength domain (e.g., wavelength-division multiplexed (WDM) network), data exchange, which is al‐ so known as wavelength exchange or wavelength interchange, would require the swapping of data from one wavelength with the data from another wavelength. Extensions of data ex‐ change would expect the data swapping between different time-slots in the time domain (e.g., optical time-division multiplexed (OTDM) network), different polarization states in the polar‐ ization domain (e.g., polarization-multiplexed (pol-muxed) network), and different "twisted" light beams carrying different orbital angular momentum (OAM) values in the phase front do‐ main (e.g., OAM-multiplexed network). Moreover, the recently increasing interest of ad‐

© 2013 Wang and Willner; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

vanced modulation formats [16, 17] would require the data exchange to be available for different modulation formats, such as on-off keying (OOK), differential phase-shift keying (DPSK), differential quadrature phase-shift keying (DQPSK), pol-muxed, etc.

**2. Concept of data exchange**

ear device.

Robust data exchange in the wavelength, time, polarization and phase front domains might be valuable for superior network performance. As an example, a desirable goal of data ex‐ change would be to efficiently utilize nonlinearities in the wavelength domain, such that the data between two different wavelengths can be "exchanged", i.e., swapped, using nonlinear processes in a single device [54]. Figure 2(a) illustrates the basic concept of data exchange in the wavelength domain (wavelength exchange/interchange), which is a wavelength-domain data manipulation enabling the swapping of data between two different wavelengths. One straightforward way, as shown in Fig. 2 (b), is to use two separate wavelength converters (WCs) with one performing the wavelength conversion from signal A (Sig. A) to signal B (Sig. B), and the other from signal B to signal A. Towards single-device operation, one sim‐ ple way of data exchange in the wavelength domain is to explore the combined signal deple‐ tion and wavelength conversion effects in a nonlinear device including a piece of HNLF or a PPLN waveguide [55-58]. Non-degenerate FWM (χ(3)) in an HNLF [29-45] and cascaded sec‐ ond-order nonlinearities (χ(2) : χ(2)) in a PPLN waveguide [22-26] are potential choices to real‐ ize such data exchange. As shown in Fig. 2 (c), due to the signal depletion and wavelength conversion effects, the data carried by signal A is consumed and converted to the wave‐ length of signal B and vice versa. This enables single-device-based data exchange in the wavelength domain. Similar concepts of data exchange in the time, polarization and phase front domains are also available enabled by various optical nonlinearities or linear optics.

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 55

**Figure 2** 

**Figure 2.** (a) Concept of data exchange in the wavelength domain. (b) Data exchange by two separate wavelength converters (WCs). (c) An example of data exchange by signal depletion and wavelength conversion in a single nonlin‐

8

The emergence of nonlinear optics has triggered increased interest and paved a potential way to develop optical signal processing in high-speed optical networks [18, 19]. Optical nonlinearities (e.g. χ(2) and χ(3)), including difference-frequency generation (DFG) [20, 21], cascaded sum- and difference-frequency generation (cSFG/DFG) [22-26], degenerate/non-de‐ generate four-wave mixing (FWM) [27-47], and Kerr-induced nonlinear polarization rotation [48-50], are potentially suitable candidates to enable data exchange. In some cases, simple linear optics may also provide an alternative approach to facilitating data exchange [51, 52]. To fulfill the rapid development of high-speed large-capacity optical communications with emerging multiplexing/demultiplexing techniques and advanced modulation formats, as shown in Fig. 1, a laudable goal would be to achieve robust data exchange in different de‐ grees of freedom (wavelength, time, polarization, phase front), for different modulation for‐ mats (OOK, DPSK, DQPSK, pol-muxed), and at different granularities (entire data, groups of bits, tributary channels).

**Figure 1** 

**Figure 1.** Schematic illustration of robust data exchange.

In this chapter, we tend to provide a comprehensive review of research works towards ro‐ bust data exchange using various optical nonlinearities [22-50, 53] and simple linear optics [51, 52]. Several kinds of optical nonlinearities are employed: (1) cSFG/DFG in a periodically poled lithium niobate (PPLN) waveguide; (2) non-degenerate FWM in a highly nonlinear fi‐ ber (HNLF); (3) bidirectional degenerate FWM in an HNLF; (4) Kerr-induced nonlinear po‐ larization rotation in an HNLF; (5) Conversion-dispersion-based tunable delays. We start with a conceptual description of data exchange followed by state-of-the-art results.

7

8

#### **2. Concept of data exchange**

vanced modulation formats [16, 17] would require the data exchange to be available for different modulation formats, such as on-off keying (OOK), differential phase-shift keying

The emergence of nonlinear optics has triggered increased interest and paved a potential way to develop optical signal processing in high-speed optical networks [18, 19]. Optical nonlinearities (e.g. χ(2) and χ(3)), including difference-frequency generation (DFG) [20, 21], cascaded sum- and difference-frequency generation (cSFG/DFG) [22-26], degenerate/non-de‐ generate four-wave mixing (FWM) [27-47], and Kerr-induced nonlinear polarization rotation [48-50], are potentially suitable candidates to enable data exchange. In some cases, simple linear optics may also provide an alternative approach to facilitating data exchange [51, 52]. To fulfill the rapid development of high-speed large-capacity optical communications with emerging multiplexing/demultiplexing techniques and advanced modulation formats, as shown in Fig. 1, a laudable goal would be to achieve robust data exchange in different de‐ grees of freedom (wavelength, time, polarization, phase front), for different modulation for‐ mats (OOK, DPSK, DQPSK, pol-muxed), and at different granularities (entire data, groups

**Figure 1** 

In this chapter, we tend to provide a comprehensive review of research works towards ro‐ bust data exchange using various optical nonlinearities [22-50, 53] and simple linear optics [51, 52]. Several kinds of optical nonlinearities are employed: (1) cSFG/DFG in a periodically poled lithium niobate (PPLN) waveguide; (2) non-degenerate FWM in a highly nonlinear fi‐ ber (HNLF); (3) bidirectional degenerate FWM in an HNLF; (4) Kerr-induced nonlinear po‐ larization rotation in an HNLF; (5) Conversion-dispersion-based tunable delays. We start

with a conceptual description of data exchange followed by state-of-the-art results.

7

(DPSK), differential quadrature phase-shift keying (DQPSK), pol-muxed, etc.

of bits, tributary channels).

54 Design and Architectures for Digital Signal Processing

**Figure 1.** Schematic illustration of robust data exchange.

Robust data exchange in the wavelength, time, polarization and phase front domains might be valuable for superior network performance. As an example, a desirable goal of data ex‐ change would be to efficiently utilize nonlinearities in the wavelength domain, such that the data between two different wavelengths can be "exchanged", i.e., swapped, using nonlinear processes in a single device [54]. Figure 2(a) illustrates the basic concept of data exchange in the wavelength domain (wavelength exchange/interchange), which is a wavelength-domain data manipulation enabling the swapping of data between two different wavelengths. One straightforward way, as shown in Fig. 2 (b), is to use two separate wavelength converters (WCs) with one performing the wavelength conversion from signal A (Sig. A) to signal B (Sig. B), and the other from signal B to signal A. Towards single-device operation, one sim‐ ple way of data exchange in the wavelength domain is to explore the combined signal deple‐ tion and wavelength conversion effects in a nonlinear device including a piece of HNLF or a PPLN waveguide [55-58]. Non-degenerate FWM (χ(3)) in an HNLF [29-45] and cascaded sec‐ ond-order nonlinearities (χ(2) : χ(2)) in a PPLN waveguide [22-26] are potential choices to real‐ ize such data exchange. As shown in Fig. 2 (c), due to the signal depletion and wavelength conversion effects, the data carried by signal A is consumed and converted to the wave‐ length of signal B and vice versa. This enables single-device-based data exchange in the wavelength domain. Similar concepts of data exchange in the time, polarization and phase front domains are also available enabled by various optical nonlinearities or linear optics.

**Figure 2** 

**Figure 2.** (a) Concept of data exchange in the wavelength domain. (b) Data exchange by two separate wavelength converters (WCs). (c) An example of data exchange by signal depletion and wavelength conversion in a single nonlin‐ ear device.

#### **3. Recent advances for robust data exchange**

#### **3.1. Data exchange using cSFG/DFG in a single PPLN waveguide [22-26]**

As depicted in Fig. 2(c), data exchange based on signal depletion and wavelength conversion of cSFG/DFG involves two signals and two pumps, which can be described by the coupled-mode equations. To better understand the single-PPLN-based data exchange, under the slowly vary‐ ing amplitude approximation, we can derive the following analytical solutions to the complex amplitudes of signal A (*ASA*(*L* )) and signal B (*ASB*(*L* )) after data exchange [22]

Following the similar principle of PPLN-based data exchange using signal depletion and wavelength conversion of cSFG/DFG, we can further perform robust data exchange func‐ tions, including time- and channel-selective data exchange between WDM channels [23, 24]

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 57

The conceptual diagram of the proposed single-PPLN-based time- and channel-selective data exchange between WDM channels is illustrated in Fig. 3 [23, 24]. Multiple WDM channels (S1- S4) and two synchronized gated pumps (PA, PB) are coupled into a PPLN waveguide, in which cSFG/DFG processes take place. The wavelength selectivity of the quasi-phase matching (QPM) condition allows selection of channels for data exchange by proper choice of the two pump wave‐ lengths. For proper QPM of both cSFG/DFG processes, the two pump wavelengths are nearly symmetric to the two exchanged data wavelengths with respect to the QPM wavelength. For in‐ stance, as illustrated in Fig. 3, within the gated pump pulse duration, PB mixes with S1 to pro‐ duce an SF wave through the sum-frequency generation (SFG) process. Meanwhile, the SF wave interacts with PA to generate a new idler at the wavelength of S2 by the subsequent differencefrequency generation (DFG) process. During such nonlinear interactions, S1 can be depleted, and converted to S2 by means of proper control of the pump powers. Similarly, PA and S2 par‐ ticipate in the SFG process to create an SF wave, which simultaneously interacts with PB to yield an idler at the wavelength of S1 via the DFG process. Thus, S2 can also be consumed with its da‐ ta copied onto S1. Consequently, it is expected to implement optical data exchange between S1 and S2 without the use of additional spectrum and touching other channels. Note that time- and channel-selective data exchange in specific time-slots (groups of bits) and between selective WDM channels can be accomplished by appropriately choosing the gated pump pulse dura‐

**Figure 3** 

**Figure 3.** Concept and principle of single-PPLN-based time- and channel-selective data exchange between WDM

We first demonstrate the data exchange between two 10-Gbit/s signals. Two gated pumps with a duty cycle of 1/127 and a pulse duration of ~3.2 ns are employed. The average power of each signal and peak power of each pump coupled into the PPLN waveguide are about 4 mW and 1 W, respectively. Figure 4 displays the observed temporal wave‐ forms and eye diagrams of data exchange. The time-slots between the two straight lines correspond to the gated pump pulse duration, in which data exchange occurs. When S1

9

and low-speed tributary channel exchange of high-speed OTDM signals [25, 26].

tion and adjusting the pump wavelengths.

channels.

$$A\_{\rm SA}(L \text{ \text{\textdegree}}) + \omega\_{\rm SA} \omega\_{\rm SF} \times\_1 \frac{1}{M^2} A\_{\rm P1}^\*(0) \text{I\text{\textdegree}}\_1 A\_{\rm P1}(0) A\_{\rm SA}(0) + \kappa\_2 A\_{\rm P2}(0) A\_{\rm S3}(0) \text{I\textdegree} \cos(ML \text{ \textdegree}) - 1 \text{\textdegree} \tag{1}$$

$$A\_{\rm S3}(L \text{ \textdegree}) + \omega\_{\rm S3} \omega\_{\rm SF} \times\_2 \frac{1}{M^2} A\_{\rm P2}^\*(0) \text{I\textdegree}\_1 A\_{\rm P1}(0) A\_{\rm S4}(0) + \kappa\_2 A\_{\rm P2}(0) A\_{\rm S3}(0) \text{I\textdegree} \cos(ML \text{ \textdegree}) - 1 \text{\textdegree} \tag{2}$$

where *M* = *ωSAωSFκ*<sup>1</sup> 2 *PP*1(0) + *ωSBωSFκ*<sup>2</sup> 2 *PP*2(0). *ASA*(0), *ASB*(0), *AP*1(0) and *AP*2(0) are the in‐ put complex amplitudes of signal A, signal B, pump 1 and pump 2, respectively. *PP*1(0) and *PP*2(0) are the input power of pump 1 and pump 2. *κ*<sup>1</sup> (*κ*2) refers to the coupling coefficient of the second-order nonlinear interaction involving signal A (signal B) and pump 1 (pump 2). *ωSA*, *ωSB* and *ωSF* are the angular frequencies of signal A, signal B and sum-frequency (SF) wave, respectively. *L* is the waveguide length.

When ignoring the initial pump phase and setting the same power for two input pumps, we can further simplify Eqs. (1a)(1b) as follows

$$\begin{aligned} A\_{SA}(L\_-) &= \frac{\cos(ML\_-) + 1}{2} A\_{SA}(0) + \frac{\cos(ML\_-) - 1}{2} A\_{SB}(0) \text{ (a)}\\ A\_{SB}(L\_-) &= \frac{\cos(ML\_-) - 1}{2} A\_{SA}(0) + \frac{\cos(ML\_-) + 1}{2} A\_{SB}(0) \text{ (b)} \end{aligned} \tag{2}$$

When satisfying the following relationship written by

$$ML = (2N+1)\pi, \quad N = 0, \ 1, \ 2, \ 3 \ \cdots \ \cdots \ \cdots \tag{3}$$

we can obtain

$$\begin{aligned} A\_{SA}(L \ ) &= -A\_{SB}(0) \ & \text{(a)}\\ A\_{SB}(L \ ) &= -A\_{SA}(0) \ & \text{(b)} \end{aligned} \tag{4}$$

From Eq. (4) it can be clearly seen that data exchange between signal A and signal B is ach‐ ieved under the exchange condition governed by Eq. (3). In particular, beyond the data ex‐ change for OOK signal, the complex relationship in Eq. (4) also implies the modulationformat-transparency characteristic of PPLN-based data exchange.

9

Following the similar principle of PPLN-based data exchange using signal depletion and wavelength conversion of cSFG/DFG, we can further perform robust data exchange func‐ tions, including time- and channel-selective data exchange between WDM channels [23, 24] and low-speed tributary channel exchange of high-speed OTDM signals [25, 26].

**3. Recent advances for robust data exchange**

56 Design and Architectures for Digital Signal Processing

1 *<sup>M</sup>* <sup>2</sup> *AP*<sup>1</sup>

1 *<sup>M</sup>* <sup>2</sup> *AP*<sup>2</sup>

*PP*1(0) + *ωSBωSFκ*<sup>2</sup>

2

can further simplify Eqs. (1a)(1b) as follows

(SF) wave, respectively. *L* is the waveguide length.

*ASA*(*<sup>L</sup>* )= cos(*ML* ) <sup>+</sup> <sup>1</sup>

*ASB*(*<sup>L</sup>* )= cos(*ML* )−<sup>1</sup>

When satisfying the following relationship written by

format-transparency characteristic of PPLN-based data exchange.

*ASA*(*L* )= *ASA*(0) + *ωSAωSFκ*<sup>1</sup>

*ASB*(*L* )= *ASB*(0) + *ωSBωSFκ*<sup>2</sup>

where *M* = *ωSAωSFκ*<sup>1</sup>

we can obtain

**3.1. Data exchange using cSFG/DFG in a single PPLN waveguide [22-26]**

amplitudes of signal A (*ASA*(*L* )) and signal B (*ASB*(*L* )) after data exchange [22]

2

<sup>2</sup> *ASA*(0) <sup>+</sup>

<sup>2</sup> *ASA*(0) <sup>+</sup>

*ASA*(*L* )= − *ASB*(0) (a)

From Eq. (4) it can be clearly seen that data exchange between signal A and signal B is ach‐ ieved under the exchange condition governed by Eq. (3). In particular, beyond the data ex‐ change for OOK signal, the complex relationship in Eq. (4) also implies the modulation-

put complex amplitudes of signal A, signal B, pump 1 and pump 2, respectively. *PP*1(0) and *PP*2(0) are the input power of pump 1 and pump 2. *κ*<sup>1</sup> (*κ*2) refers to the coupling coefficient of the second-order nonlinear interaction involving signal A (signal B) and pump 1 (pump 2). *ωSA*, *ωSB* and *ωSF* are the angular frequencies of signal A, signal B and sum-frequency

When ignoring the initial pump phase and setting the same power for two input pumps, we

cos(*ML* )−1

cos(*ML* ) + 1

<sup>2</sup> *ASB*(0) (a)

<sup>2</sup> *ASB*(0) (b)

*ASB*(*<sup>L</sup>* )= <sup>−</sup> *ASA*(0) (b) (4)

*ML* =(2*N* + 1)*π*, *N* =0, 1, 2, 3 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ (3)

As depicted in Fig. 2(c), data exchange based on signal depletion and wavelength conversion of cSFG/DFG involves two signals and two pumps, which can be described by the coupled-mode equations. To better understand the single-PPLN-based data exchange, under the slowly vary‐ ing amplitude approximation, we can derive the following analytical solutions to the complex

\* (0) *<sup>κ</sup>*1*AP*1(0)*ASA*(0) <sup>+</sup> *<sup>κ</sup>*2*AP*2(0)*ASB*(0) cos(*ML* )−<sup>1</sup> (a)

(1)

(2)

\* (0) *<sup>κ</sup>*1*AP*1(0)*ASA*(0) <sup>+</sup> *<sup>κ</sup>*2*AP*2(0)*ASB*(0) cos(*ML* )−<sup>1</sup> (b)

*PP*2(0). *ASA*(0), *ASB*(0), *AP*1(0) and *AP*2(0) are the in‐

The conceptual diagram of the proposed single-PPLN-based time- and channel-selective data exchange between WDM channels is illustrated in Fig. 3 [23, 24]. Multiple WDM channels (S1- S4) and two synchronized gated pumps (PA, PB) are coupled into a PPLN waveguide, in which cSFG/DFG processes take place. The wavelength selectivity of the quasi-phase matching (QPM) condition allows selection of channels for data exchange by proper choice of the two pump wave‐ lengths. For proper QPM of both cSFG/DFG processes, the two pump wavelengths are nearly symmetric to the two exchanged data wavelengths with respect to the QPM wavelength. For in‐ stance, as illustrated in Fig. 3, within the gated pump pulse duration, PB mixes with S1 to pro‐ duce an SF wave through the sum-frequency generation (SFG) process. Meanwhile, the SF wave interacts with PA to generate a new idler at the wavelength of S2 by the subsequent differencefrequency generation (DFG) process. During such nonlinear interactions, S1 can be depleted, and converted to S2 by means of proper control of the pump powers. Similarly, PA and S2 par‐ ticipate in the SFG process to create an SF wave, which simultaneously interacts with PB to yield an idler at the wavelength of S1 via the DFG process. Thus, S2 can also be consumed with its da‐ ta copied onto S1. Consequently, it is expected to implement optical data exchange between S1 and S2 without the use of additional spectrum and touching other channels. Note that time- and channel-selective data exchange in specific time-slots (groups of bits) and between selective WDM channels can be accomplished by appropriately choosing the gated pump pulse dura‐ tion and adjusting the pump wavelengths.

**Figure 3** 

**Figure 3.** Concept and principle of single-PPLN-based time- and channel-selective data exchange between WDM channels.

We first demonstrate the data exchange between two 10-Gbit/s signals. Two gated pumps with a duty cycle of 1/127 and a pulse duration of ~3.2 ns are employed. The average power of each signal and peak power of each pump coupled into the PPLN waveguide are about 4 mW and 1 W, respectively. Figure 4 displays the observed temporal wave‐ forms and eye diagrams of data exchange. The time-slots between the two straight lines correspond to the gated pump pulse duration, in which data exchange occurs. When S1 and the two pumps are on while S2 is off, the data of S1 within the gated pump pulse duration is depleted (a2) and converted to the wavelength of S2 (b3). Similarly, we can al‐ so observe the depletion of S2 (b2) and the conversion from S2 to S1 (a3) by switching S1 off and S2 on. In the case of simultaneously turning on the two signals and the two pumps, it is found that groups of bits data exchange between the two signals (S1 to S2: (b4), S2 to S1: (a4)) within the gated pump pulse duration is successfully achieved.

fects, with the proper adjustment of pump powers, S1 can be depleted and converted to S2. Similarly, S2 can be extinguished to generate S1. As a result, data exchange between two sig‐ nals (S1, S2) can be implemented. In particular, by exploiting two synchronized subrate (e.g., 10 GHz) clock pumps which are time aligned to one of the tributary channels of two WDM high-speed OTDM signals (e.g., 160 Gbit/s), it is possible to achieve the tributary channel ex‐ change (e.g., 10 Gbit/s) between two WDM high-speed OTDM signals (e.g., 160 Gbit/s). As an example shown in Fig. 6, the tributary channel i (Ch. i) of two WDM high-speed OTDM signals is exchanged by using the signal depletion and wavelength conversion effects of

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 59

**Figure 5.** Measured eye diagrams and BER performance of 40-Gbit/s time- (groups of bits) and channel-selective data

**Figure 6** 

**Figure 6.** Concept and principle of single-PPLN-based tributary channel exchange between two WDM high-speed

Figure 7 displays the eye diagrams for tributary channel exchange (Ch.1) measured by an opti‐ cal sampling scope. Two 10-GHz clock pumps are time aligned to the tributary Ch.1 of two 160- Gbit/s signals. When the two pumps and S1 are present while S2 absent, Ch.1 of S1 is depleted

11

cSFG/DFG in a PPLN waveguide.

exchange between four WDM channels.

OTDM signals.

**Figure 4** 

**Figure 4.** Measured (a1-a4)(b1-b4) temporal waveforms and (a5)(a6)(b5)(b6) eye diagrams of 10-Gbit/s groups of bits data exchange.

We further demonstrate the single-PPLN-based channel-selective data exchange for multi‐ ple WDM channels at 40 Gbit/s. Four WDM channels (S1: 1535.5 nm, S2: 1539.4 nm, S3: 1543.3 nm, S4: 1547.2 nm) are employed in the experiment. It is possible to perform a chan‐ nel-selective data exchange by simply tuning the wavelength of the two pumps. Figure 5 displays the measured typical eye diagrams and bit-error rate (BER) performance for chan‐ nel-selective data exchange between WDM channels. The power penalty of 40-Gbit/s chan‐ nel-selective data exchange is estimated to be less than 4 dB at a BER of 10-9.

10 Figure 6 illustrates the concept and principle of single-PPLN-based tributary channel ex‐ change between two WDM high-speed OTDM signals [25, 26]. A PPLN waveguide is em‐ ployed as the nonlinear device to perform the tributary channel exchange. Two WDM highspeed signals (S1, S2) each consisting of many low-speed time-division multiplexed tributary channels (e.g., 16 10-Gbit/s tributary channels for 160-Gbit/s signal), together with two synchronized subrate clock (e.g., 10 GHz) pumps, are launched into the PPLN wave‐ guide for the tributary channel exchange. The wavelengths of two signals and two pumps are properly arranged to be symmetric (S1&P1, S2&P2) with respect to the QPM wavelength of PPLN. Inside the PPLN waveguide, two signals and two pumps participate in the cSFG/DFG nonlinear interactions, in which the photons of S1 (S2) and P1 (P2) are annihilat‐ ed to produce the photons of SF wave, which are simultaneously consumed to generate the photons of S2 (S1) and P2 (P1). Due to the signal depletion and wavelength conversion ef‐

11

fects, with the proper adjustment of pump powers, S1 can be depleted and converted to S2. Similarly, S2 can be extinguished to generate S1. As a result, data exchange between two sig‐ nals (S1, S2) can be implemented. In particular, by exploiting two synchronized subrate (e.g., 10 GHz) clock pumps which are time aligned to one of the tributary channels of two WDM high-speed OTDM signals (e.g., 160 Gbit/s), it is possible to achieve the tributary channel ex‐ change (e.g., 10 Gbit/s) between two WDM high-speed OTDM signals (e.g., 160 Gbit/s). As an example shown in Fig. 6, the tributary channel i (Ch. i) of two WDM high-speed OTDM signals is exchanged by using the signal depletion and wavelength conversion effects of cSFG/DFG in a PPLN waveguide.

and the two pumps are on while S2 is off, the data of S1 within the gated pump pulse duration is depleted (a2) and converted to the wavelength of S2 (b3). Similarly, we can al‐ so observe the depletion of S2 (b2) and the conversion from S2 to S1 (a3) by switching S1 off and S2 on. In the case of simultaneously turning on the two signals and the two pumps, it is found that groups of bits data exchange between the two signals (S1 to S2:

**Figure 4** 

**Figure 4.** Measured (a1-a4)(b1-b4) temporal waveforms and (a5)(a6)(b5)(b6) eye diagrams of 10-Gbit/s groups of bits

We further demonstrate the single-PPLN-based channel-selective data exchange for multi‐ ple WDM channels at 40 Gbit/s. Four WDM channels (S1: 1535.5 nm, S2: 1539.4 nm, S3: 1543.3 nm, S4: 1547.2 nm) are employed in the experiment. It is possible to perform a chan‐ nel-selective data exchange by simply tuning the wavelength of the two pumps. Figure 5 displays the measured typical eye diagrams and bit-error rate (BER) performance for chan‐ nel-selective data exchange between WDM channels. The power penalty of 40-Gbit/s chan‐

Figure 6 illustrates the concept and principle of single-PPLN-based tributary channel ex‐ change between two WDM high-speed OTDM signals [25, 26]. A PPLN waveguide is em‐ ployed as the nonlinear device to perform the tributary channel exchange. Two WDM highspeed signals (S1, S2) each consisting of many low-speed time-division multiplexed tributary channels (e.g., 16 10-Gbit/s tributary channels for 160-Gbit/s signal), together with two synchronized subrate clock (e.g., 10 GHz) pumps, are launched into the PPLN wave‐ guide for the tributary channel exchange. The wavelengths of two signals and two pumps are properly arranged to be symmetric (S1&P1, S2&P2) with respect to the QPM wavelength of PPLN. Inside the PPLN waveguide, two signals and two pumps participate in the cSFG/DFG nonlinear interactions, in which the photons of S1 (S2) and P1 (P2) are annihilat‐ ed to produce the photons of SF wave, which are simultaneously consumed to generate the photons of S2 (S1) and P2 (P1). Due to the signal depletion and wavelength conversion ef‐

nel-selective data exchange is estimated to be less than 4 dB at a BER of 10-9.

data exchange.

58 Design and Architectures for Digital Signal Processing

10

(b4), S2 to S1: (a4)) within the gated pump pulse duration is successfully achieved.

**Figure 5.** Measured eye diagrams and BER performance of 40-Gbit/s time- (groups of bits) and channel-selective data exchange between four WDM channels.

**Figure 6** 

**Figure 6.** Concept and principle of single-PPLN-based tributary channel exchange between two WDM high-speed OTDM signals.

Figure 7 displays the eye diagrams for tributary channel exchange (Ch.1) measured by an opti‐ cal sampling scope. Two 10-GHz clock pumps are time aligned to the tributary Ch.1 of two 160- Gbit/s signals. When the two pumps and S1 are present while S2 absent, Ch.1 of S1 is depleted and converted to the Ch.1 of S2 with the proper adjustment of pump powers and polarization states due to the signal depletion and wavelength conversion effects. Similarly, as the two pumps and S2 are turned on while S1 off, Ch.1 of S2 is extinguished with its data information copied on‐ to the Ch.1 of S1. In the presence of two 10-GHz pumps and both two 160-Gbit/s signals, Ch.1 of S2 is exchanged to the Ch.1 of S1. Meanwhile, Ch.1 of S1 is swapped to the Ch.1 of S2, resulting in the implementation of 10-Gbit/s tributary channel exchange between two 160-Gbit/s sig‐ nals. Moreover, it is convenient to further perform the 10-Gbit/s tributary exchange for all 16 trib‐ utary channels of two 160-Gbit/s signals simply by time shifting the 10-GHz clock pumps to be aligned with the corresponding tributary channel of interest.

**3.2. Modulation-format-transparent data exchange using non-degenerate FWM in an**

*<sup>P</sup>*<sup>1</sup>  *<sup>S</sup>*<sup>1</sup> *S* 2

exchange has the characteristic of modulation-format-transparency.

**Figure 10.** Results of data exchange (wavelength exchange) for 10-Gbit/s NRZ signals [32].

S1 S2 S1 S2 S1 S2

*<sup>P</sup>*<sup>2</sup>

ZDW

*S*1 *S* 2

S1

S2

(b) S2 depletion. (c) S1 & S2 data exchange.

S1: ON, S2: OFF (a)

S1 Depletion

Pumps

In addition to cSFG/DFG (χ(2) : χ(2)) in a PPLN waveguide [22-26], signal depletion and wave‐ length conversion of non-degenerate FWM (χ(3)) in an HNLF can also enable the data ex‐ change [29-41]. As shown in Fig. 9(a), when signal 1 (S1: *λ<sup>S</sup>* 1) and two continuous-wave (CW) pumps (P1: *λP*1, P2: *λP*2) are sent through the HNLF with S1 and P1 set symmetrically with respect to the zero-dispersion wavelength (ZDW) of the HNLF, S1 and P1 photons are consumed to produce photons of signal 2 (S2: *λ<sup>S</sup>* 2) and P2 during the non-degenerate FWM process. Thus the depletion of S1 is expected with its data information transparently copied onto a newly generated S2. Similarly, as shown in Fig. 9(b), the depletion of S2 accompanied by the generation of S1 can be achieved as S2 and two pumps are launched into the HNLF. As shown in Fig. 9(c), in the presence of two signals and two pumps at the input of HNLF with S1(S2) and P1(P2) symmetric relative to the ZDW of the HNLF, S1(S2) can be extin‐ guished and converted to S2(S1), resulting in the implementation of data exchange between

S1&S2 Data Exchange

 *<sup>P</sup>*<sup>2</sup> *<sup>P</sup>*<sup>1</sup>

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 61

S1: ON, S2: ON

Exchange Pumps

ZDW (c)

*S*1 *S* 2

*<sup>P</sup>*<sup>1</sup>

*<sup>P</sup>*<sup>2</sup>

Pumps

ZDW

S1: OFF, S2: ON (b)

S2 Depletion

**Figure 9.** Concept and principle of non-degenerate FWM-based signal depletion and data exchange. (a) S1 depletion.

For non-degenerate FWM-based data exchange, pump phase modulation is adopted in the experiment to suppress the stimulated Brillouin scattering (SBS) effect so that pump power can be efficiently utilized. Previous works of non-degenerate FWM-based data exchange have been reported for OOK signals [29-37], which are not affected by the phase modulation of two pumps. Shown in Fig. 10 is an example of data exchange (i.e., wavelength exchange) for 10-Gbit/s non-return-to-zero (NRZ) signals [32]. In order to perform phase-transparent data exchange for DPSK/DQPSK signals, it is desired that non-degenerate FWM-based data

S1

*Before Exchange After Exchange* 

S2

**HNLF [38-41]**

S1 and S2.

**Figure 7.** Measured eye diagrams for the tributary channel exchange (Ch. 1).

Figure 8 depicts power penalties at a BER of 10-9 of tributary exchange between two 160- Gbit/s signals for all 16 tributary channels. During the tributary channel exchange between two 160-Gbit/s signals, the average power penalty and the fluctuation of 16 tributary chan‐ nels is around 3.7 and 1.1 dB for S1 (S2 to S1) and 3.9 and 1.1 dB for S2 (S1 to S2).

**Figure 8.** Power penalties of tributary exchange for 16 tributary channels. (a)(c) Signal 1. (b)(d) Signal 2.

#### **3.2. Modulation-format-transparent data exchange using non-degenerate FWM in an HNLF [38-41]**

and converted to the Ch.1 of S2 with the proper adjustment of pump powers and polarization states due to the signal depletion and wavelength conversion effects. Similarly, as the two pumps and S2 are turned on while S1 off, Ch.1 of S2 is extinguished with its data information copied on‐ to the Ch.1 of S1. In the presence of two 10-GHz pumps and both two 160-Gbit/s signals, Ch.1 of S2 is exchanged to the Ch.1 of S1. Meanwhile, Ch.1 of S1 is swapped to the Ch.1 of S2, resulting in the implementation of 10-Gbit/s tributary channel exchange between two 160-Gbit/s sig‐ nals. Moreover, it is convenient to further perform the 10-Gbit/s tributary exchange for all 16 trib‐ utary channels of two 160-Gbit/s signals simply by time shifting the 10-GHz clock pumps to be

> 0 10 20 30 40 50 60 70 80 90 100 Time (ps)

Figure 8 depicts power penalties at a BER of 10-9 of tributary exchange between two 160- Gbit/s signals for all 16 tributary channels. During the tributary channel exchange between two 160-Gbit/s signals, the average power penalty and the fluctuation of 16 tributary chan‐

(a) (b)

nels is around 3.7 and 1.1 dB for S1 (S2 to S1) and 3.9 and 1.1 dB for S2 (S1 to S2).

**Figure 8.** Power penalties of tributary exchange for 16 tributary channels. (a)(c) Signal 1. (b)(d) Signal 2.

S2: after exchange (Ch.1) (P1&P2&S1&S2 ON)

S2: Ch.1 insertion (P1&P2&S1 ON, S2 OFF)

S2: Ch.1 depletion (P1&P2&S2 ON, S1 OFF)

S2: before exchange (P1&P2 OFF)

S1: after exchange (Ch.1) (P1&P2&S1&S2 ON)

S1: Ch.1 insertion (P1&P2&S2 ON, S1 OFF)

S1: Ch.1 depletion (P1&P2&S1 ON, S2 OFF) S1: before exchange (P1&P2 OFF)

aligned with the corresponding tributary channel of interest.

60 Design and Architectures for Digital Signal Processing

**Figure 7.** Measured eye diagrams for the tributary channel exchange (Ch. 1).

In addition to cSFG/DFG (χ(2) : χ(2)) in a PPLN waveguide [22-26], signal depletion and wave‐ length conversion of non-degenerate FWM (χ(3)) in an HNLF can also enable the data ex‐ change [29-41]. As shown in Fig. 9(a), when signal 1 (S1: *λ<sup>S</sup>* 1) and two continuous-wave (CW) pumps (P1: *λP*1, P2: *λP*2) are sent through the HNLF with S1 and P1 set symmetrically with respect to the zero-dispersion wavelength (ZDW) of the HNLF, S1 and P1 photons are consumed to produce photons of signal 2 (S2: *λ<sup>S</sup>* 2) and P2 during the non-degenerate FWM process. Thus the depletion of S1 is expected with its data information transparently copied onto a newly generated S2. Similarly, as shown in Fig. 9(b), the depletion of S2 accompanied by the generation of S1 can be achieved as S2 and two pumps are launched into the HNLF. As shown in Fig. 9(c), in the presence of two signals and two pumps at the input of HNLF with S1(S2) and P1(P2) symmetric relative to the ZDW of the HNLF, S1(S2) can be extin‐ guished and converted to S2(S1), resulting in the implementation of data exchange between S1 and S2.

**Figure 9.** Concept and principle of non-degenerate FWM-based signal depletion and data exchange. (a) S1 depletion. (b) S2 depletion. (c) S1 & S2 data exchange.

For non-degenerate FWM-based data exchange, pump phase modulation is adopted in the experiment to suppress the stimulated Brillouin scattering (SBS) effect so that pump power can be efficiently utilized. Previous works of non-degenerate FWM-based data exchange have been reported for OOK signals [29-37], which are not affected by the phase modulation of two pumps. Shown in Fig. 10 is an example of data exchange (i.e., wavelength exchange) for 10-Gbit/s non-return-to-zero (NRZ) signals [32]. In order to perform phase-transparent data exchange for DPSK/DQPSK signals, it is desired that non-degenerate FWM-based data exchange has the characteristic of modulation-format-transparency.

**Figure 10.** Results of data exchange (wavelength exchange) for 10-Gbit/s NRZ signals [32].

Under the non-depletion approximation, we derive the analytical solutions for the non-de‐ generate FWM involving two signals and two pumps written as [39]

ly observed that the data information carried by two 100-Gbit/s RZ-DQPSK signals is suc‐ cessfully swapped after the non-degenerate FWM-based data exchange. Figure 11(b) and (c) plot BER curves for the 100-Gbit/s RZ-DQPSK data exchange. Less than 1.2-dB power penal‐ ty at a BER of 10-9 is obtained for the 100-Gbit/s RZ-DQPSK wavelength conversion (WC) with only one signal (S1 or S2) present. Less than 5-dB power penalty at a BER of 10-9 is ob‐ served for the 100-Gbit/s RZ-DQPSK data exchange. The extra power penalty of data ex‐ change compared to wavelength conversion could be ascribed to the beating effect between

S1 (before data exchange)

Ch.Q

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 63

Ch.Q

Ch.Q

S2 (before data exchange)

S1 (after data exchange: S2 to S1)

**Figure 11.** Measured (a) demodulated waveforms and (b)(c) BER curves for 100-Gbit/s RZ-DQPSK data exchange.

**3.3. Multi-channel data exchange using bidirectional degenerate FWM in an HNLF [42-45]**

The aforementioned signal depletion and wavelength conversion based schemes with two pumps enable the two-channel data exchange [22-26, 29-41]. However, the extended applica‐ tions to simultaneous multi-channel data exchange might be limited. A laudable goal would

Figure 12 illustrates the concept and principle of multi-channel data exchange [42, 43]. De‐ generate FWM with a single CW pump is utilized. Four-channel DQPSK signals (S1-S4) are symmetric with respect to the CW pump. Simultaneous data exchange between S1 and S4 as well as S2 and S3 is expected. In general, such exchange function is not applicable with the unidirectional degenerate FWM in a single HNLF since the newly converted signals cannot be separated from the original signals. A potential solution is to explore the bidirectional de‐ generate FWM in a single HNLF assisted by optical filtering. As shown in Fig. 12, for the input four-channel signals (S1-S4), the filtered S1, S2 and CW pump are sent to HNLF from the left side, yielding S4 and S3 via degenerate FWM. The newly generated S4 and S3 are

be to explore the data exchange between multi-channel signals.

S2 (after data exchange: S1 to S2) Ch.Q

(b) (c)

the newly converted signal and the original residual signal.

Ch.I

Ch.I

Ch.I

(a)

Ch.I

$$\begin{aligned} A\_{\rm SA}^{\rm i} &= \left[ A\_{\rm SA0} \mathbb{I} \cos(\mathcal{gz}) - \frac{i \mathbb{k} \sin(\mathcal{gz})}{2g} \mathbb{I} + A\_{\rm SB} \frac{2 \dot{\mathcal{ry}}}{g} A\_{\rm P10}^{\rm i} A\_{\rm P20} \sin(\mathcal{gz}) \right] e^{\frac{\pi}{2} \text{Tr} \left( P\_{\rm II} + P\_{\rm B0} + \frac{k}{\tau} \mathbb{I} \right) z} \qquad \text{(a)}\\ A\_{\rm SB}^{\rm i} &= \left[ A\_{\rm SA0} \frac{2 \dot{\mathcal{ry}}}{g} A\_{\rm P10} A\_{\rm P20}^{\rm i} \sin(\mathcal{gz}) + A\_{\rm SB} \mathbb{I} \cos(\mathcal{gz}) + \frac{i \mathbb{k} \sin(\mathcal{gz})}{2g} \mathbb{I} \right] e^{\frac{\pi}{2} \text{Tr} \left( P\_{\rm II} + P\_{\rm B0} + \frac{k}{\tau} \mathbb{I} \right) z} \qquad \text{(b)} \end{aligned} \tag{5}$$

where *g* = 4 *γ* <sup>2</sup> *<sup>P</sup>*10*P*<sup>20</sup> <sup>+</sup> *<sup>k</sup>* <sup>2</sup> / 4 and *<sup>k</sup>* <sup>=</sup>*Δβ* <sup>+</sup> *<sup>γ</sup>* (*P*<sup>10</sup> <sup>−</sup> *<sup>P</sup>*20) are constants related to the pump pow‐ ers (*P* <sup>10</sup>, *P* 20), nonlinear coefficient (*γ*), and phase mismatching (*Δβ*). *ASA*0, *ASB*0, *AP*<sup>10</sup> and *AP*20 are the complex amplitudes of input signals (SA, SB) and pumps (P1, P2) containing both amplitude and phase information. *ASA* ' and *ASB* ' are the complex amplitudes of output signals (SA, SB) after the data exchange. Under the exchange condition of phase matching (*k* =0) and *gz* =(*N* + 1 / 2)*π* (*N* =0, 1, 2, ...) enabled by the proper adjustment of pump powers, we can further simplify Eqs. (5a)(5b) as follows

$$\begin{aligned} A\_{SA}^{'} &= \pm A\_{SB0} \frac{2i\gamma}{\mathcal{S}} A\_{P10}^{\*} A\_{P20} e^{i2\gamma(P\_{10} + P\_{20})z} \quad \text{(a)}\\ A\_{SB}^{'} &= \pm A\_{SA0} \frac{2i\gamma}{\mathcal{S}} A\_{P10} A\_{P20}^{\*} e^{i2\gamma(P\_{10} + P\_{20})z} \quad \text{(b)}\end{aligned} \tag{6}$$

Note that Eqs. (6a)(6b) indicate the linear relationship of complex amplitude between the output and input signals (*ASA* ' ∝ *ASB*0, *ASB* ' ∝ *ASA*0), implying the implementation of phasetransparent optical data exchange. We can further obtain the corresponding phase relation‐ ships of *ϕSA*'=*ϕSB* + *ϕP*<sup>2</sup> −*ϕP*1 and *ϕSB*'=*ϕSA* + *ϕP*<sup>1</sup> −*ϕP*2. Remarkably, the pump phase transfer (*ϕP*<sup>1</sup> −*ϕP*<sup>2</sup> ≠0) to the exchanged signals does not impact on the OOK data exchange but could cause severe degradation on the DPSK/DQPSK data exchange. Fortunately, according to the deduced phase relationships, it is possible to cancel the pump phase transfer by applying the precisely identical phase modulation to the two pumps (i.e., *ϕP*<sup>1</sup> =*ϕP*2), which makes it applicable to implement the data exchange of DPSK/DQPSK signals [38-41].

A 1-km piece of HNLF is adopted in the experiment, which has a nonlinear coefficient of 9.1 W-1·km-1, a ZDW of ~1552 nm, and a fiber loss of 0.45 dB/km. To suppress SBS, identical phase modulation is applied to the two pumps using a single phase modulator (PM) driven by a 10-Gbit/s pseudo-random binary sequence (PRBS). According to Eqs. (6a)(6b), the pre‐ cisely identical phase modulation of the two pumps could be canceled in the output signals after data exchange.

We demonstrate the phase-transparent data exchange between two 100-Gbit/s 27 -1 PRBS re‐ turn-to-zero DQPSK (RZ-DQPSK) signals (S1: signal 1, S2: signal 2) [40, 41]. Figure 11(a) dis‐ plays the measured temporal waveforms of the demodulated in-phase (Ch. I) and quadrature (Ch. Q) components for the 100-Gbit/s RZ-DQPSK data exchange. It can be clear‐ ly observed that the data information carried by two 100-Gbit/s RZ-DQPSK signals is suc‐ cessfully swapped after the non-degenerate FWM-based data exchange. Figure 11(b) and (c) plot BER curves for the 100-Gbit/s RZ-DQPSK data exchange. Less than 1.2-dB power penal‐ ty at a BER of 10-9 is obtained for the 100-Gbit/s RZ-DQPSK wavelength conversion (WC) with only one signal (S1 or S2) present. Less than 5-dB power penalty at a BER of 10-9 is ob‐ served for the 100-Gbit/s RZ-DQPSK data exchange. The extra power penalty of data ex‐ change compared to wavelength conversion could be ascribed to the beating effect between the newly converted signal and the original residual signal.

Under the non-depletion approximation, we derive the analytical solutions for the non-de‐

ers (*P* <sup>10</sup>, *P* 20), nonlinear coefficient (*γ*), and phase mismatching (*Δβ*). *ASA*0, *ASB*0, *AP*<sup>10</sup> and *AP*20 are the complex amplitudes of input signals (SA, SB) and pumps (P1, P2) containing

signals (SA, SB) after the data exchange. Under the exchange condition of phase matching (*k* =0) and *gz* =(*N* + 1 / 2)*π* (*N* =0, 1, 2, ...) enabled by the proper adjustment of pump powers,

\* *AP*20*<sup>e</sup>*

\* *e*

Note that Eqs. (6a)(6b) indicate the linear relationship of complex amplitude between the

transparent optical data exchange. We can further obtain the corresponding phase relation‐ ships of *ϕSA*'=*ϕSB* + *ϕP*<sup>2</sup> −*ϕP*1 and *ϕSB*'=*ϕSA* + *ϕP*<sup>1</sup> −*ϕP*2. Remarkably, the pump phase transfer (*ϕP*<sup>1</sup> −*ϕP*<sup>2</sup> ≠0) to the exchanged signals does not impact on the OOK data exchange but could cause severe degradation on the DPSK/DQPSK data exchange. Fortunately, according to the deduced phase relationships, it is possible to cancel the pump phase transfer by applying the precisely identical phase modulation to the two pumps (i.e., *ϕP*<sup>1</sup> =*ϕP*2), which makes it

A 1-km piece of HNLF is adopted in the experiment, which has a nonlinear coefficient of 9.1 W-1·km-1, a ZDW of ~1552 nm, and a fiber loss of 0.45 dB/km. To suppress SBS, identical phase modulation is applied to the two pumps using a single phase modulator (PM) driven by a 10-Gbit/s pseudo-random binary sequence (PRBS). According to Eqs. (6a)(6b), the pre‐ cisely identical phase modulation of the two pumps could be canceled in the output signals

turn-to-zero DQPSK (RZ-DQPSK) signals (S1: signal 1, S2: signal 2) [40, 41]. Figure 11(a) dis‐ plays the measured temporal waveforms of the demodulated in-phase (Ch. I) and quadrature (Ch. Q) components for the 100-Gbit/s RZ-DQPSK data exchange. It can be clear‐

*<sup>g</sup> AP*10*AP*<sup>20</sup>

'

' and *ASB*

\* *AP*20sin(*gz*)}*<sup>e</sup>*

*ik*sin(*gz*) <sup>2</sup>*<sup>g</sup>* } *<sup>e</sup>*

*<sup>P</sup>*10*P*<sup>20</sup> <sup>+</sup> *<sup>k</sup>* <sup>2</sup> / 4 and *<sup>k</sup>* <sup>=</sup>*Δβ* <sup>+</sup> *<sup>γ</sup>* (*P*<sup>10</sup> <sup>−</sup> *<sup>P</sup>*20) are constants related to the pump pow‐

*<sup>i</sup>*2*γ*(*P*10+*P*20)*<sup>z</sup>* (a)

*<sup>i</sup>*2*γ*(*P*10+*P*20)*<sup>z</sup>* (b)

*i* 2*γ*(*P*10+*P*20)+

*i* 2*γ*(*P*10+*P*20)−

∝ *ASA*0), implying the implementation of phase-

*k*

' are the complex amplitudes of output

*k*

<sup>2</sup> *<sup>z</sup>* (a)

<sup>2</sup> *<sup>z</sup>* (b)

(5)

(6)


generate FWM involving two signals and two pumps written as [39]

+ *ASB*<sup>0</sup>

\* sin(*gz*) <sup>+</sup> *ASB*<sup>0</sup> cos (*gz*) <sup>+</sup>

2*iγ <sup>g</sup> AP*<sup>10</sup>

2*g*

*ASA* '

*ASB* '={*ASA*<sup>0</sup>

where *g* = 4 *γ* <sup>2</sup>

={*ASA*<sup>0</sup> cos( *gz*)<sup>−</sup> *ik*sin(*gz*)

62 Design and Architectures for Digital Signal Processing

*<sup>g</sup> AP*10*AP*<sup>20</sup>

both amplitude and phase information. *ASA*

we can further simplify Eqs. (5a)(5b) as follows

*ASA* '

*ASB* '

output and input signals (*ASA*

after data exchange.

= ± *ASB*<sup>0</sup>

= ± *ASA*<sup>0</sup>

'

2*iγ <sup>g</sup> AP*<sup>10</sup>

2*iγ*

∝ *ASB*0, *ASB*

applicable to implement the data exchange of DPSK/DQPSK signals [38-41].

We demonstrate the phase-transparent data exchange between two 100-Gbit/s 27

2*iγ*

**Figure 11.** Measured (a) demodulated waveforms and (b)(c) BER curves for 100-Gbit/s RZ-DQPSK data exchange.

#### **3.3. Multi-channel data exchange using bidirectional degenerate FWM in an HNLF [42-45]**

The aforementioned signal depletion and wavelength conversion based schemes with two pumps enable the two-channel data exchange [22-26, 29-41]. However, the extended applica‐ tions to simultaneous multi-channel data exchange might be limited. A laudable goal would be to explore the data exchange between multi-channel signals.

Figure 12 illustrates the concept and principle of multi-channel data exchange [42, 43]. De‐ generate FWM with a single CW pump is utilized. Four-channel DQPSK signals (S1-S4) are symmetric with respect to the CW pump. Simultaneous data exchange between S1 and S4 as well as S2 and S3 is expected. In general, such exchange function is not applicable with the unidirectional degenerate FWM in a single HNLF since the newly converted signals cannot be separated from the original signals. A potential solution is to explore the bidirectional de‐ generate FWM in a single HNLF assisted by optical filtering. As shown in Fig. 12, for the input four-channel signals (S1-S4), the filtered S1, S2 and CW pump are sent to HNLF from the left side, yielding S4 and S3 via degenerate FWM. The newly generated S4 and S3 are selected at the right side of HNLF while the original S1, S2 and CW pump are blocked. Meanwhile, the filtered S3, S4 and CW pump are fed into HNLF from the right side, produc‐ ing S2 and S1 by degenerate FWM. The newly converted S2 and S1 are selected at the left side of HNLF while the original S3, S4 and CW pump are removed. As a consequence, si‐ multaneous four-channel data exchange (S1&S4, S2&S3) can be achieved using bidirectional FWM in a single HNLF assisted by optical filtering. The combined S1-S4 from both sides of HNLF are the output four-channel signals after data exchange. Note that the in-phase (Ch. I) and quadrature (Ch. Q) components of DQPSK signals are swapped after data exchange due to the phase-conjugation characteristic of degenerate FWM.

**Figure 12.** Concept and principle of simultaneous multi-channel DQPSK data exchange.

The proposed simultaneous multi-channel data exchange can be incorporated in a reconfig‐ urable network switching element to enhance the efficiency and flexibility of optical net‐ works. We construct a reconfigurable Tbit/s network switching element using double-pass liquid crystal on silicon (LCoS) technology accompanied by bidirectional degenerate FWM in a single HNLF. We demonstrate the LCoS+HNLF-based 2.3-Tbit/s multi-functional grooming switch which performs simultaneous selective add/drop, switchable data ex‐ change, and power equalization, for 23-channel 100-Gbit/s RZ-DQPSK signals [44, 45].

12

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 65

**Figure 13** 

**Figure 13.** Measured spectrum and balanced eyes for input unequalized 23-channel 100-Gbit/s RZ-DQPSK signals.

**Figure 14.** Measured spectrum and balanced eyes after multi-functional grooming switch (S6, S7: add/drop; S10, S11,

**Figure 15.** BER curves for simultaneous six-channel data exchange (S10, S11, S12, S21, S22, S23).

**3.4. Data exchange between two orthogonal polarizations using kerr-induced nonlinear**

In addition to the data exchange in the wavelength and time domains [22-45], it is also possi‐ ble to perform data exchange between two orthogonal polarizations in the time and polari‐

S12, S21, S22, S23: data exchange; S1-S23: power equalization).

**polarization rotation in an HNLF [48, 49]**

ITU-grid-compatible 23-channel (from S1: 1531.12 nm to S23: 1566.31 nm) 100-Gbit/s RZ-DQPSK signals are employed in the experiment. Figure 13 shows the measured spectrum of the input unequalized 23-channel 100-Gbit/s RZ-DQPSK signals with a power fluctuation of ~9.1 dB. Shown in the insets are typical balanced eyes for the in-phase (Ch. I) and quadra‐ ture (Ch. Q) components.

Shown in Fig. 14 is the measured spectrum and balanced eyes after grooming switch with power equalization (<1 dB) for all 23 channels (input unequalization: ~9.1 dB), two-channel add/drop for S6 and S7, and simultaneous six-channel data exchange (S10, S11, S12, S21, S22, S23). The inset of Fig. 14 depicts the spectrum of dropped S6 and S7. The BER performance is plotted in Fig. 15 and power penalties less than 5 dB for six-channel data exchange are observed at a BER of 10-9.

**Figure 13** 

selected at the right side of HNLF while the original S1, S2 and CW pump are blocked. Meanwhile, the filtered S3, S4 and CW pump are fed into HNLF from the right side, produc‐ ing S2 and S1 by degenerate FWM. The newly converted S2 and S1 are selected at the left side of HNLF while the original S3, S4 and CW pump are removed. As a consequence, si‐ multaneous four-channel data exchange (S1&S4, S2&S3) can be achieved using bidirectional FWM in a single HNLF assisted by optical filtering. The combined S1-S4 from both sides of HNLF are the output four-channel signals after data exchange. Note that the in-phase (Ch. I) and quadrature (Ch. Q) components of DQPSK signals are swapped after data exchange due

**HNLF** 

S3 S4

S3 S4

Pump

**S2 S2** 

**FWM** 

**S3 S3** 

*<sup>P</sup>*

**S4 S4** 

The proposed simultaneous multi-channel data exchange can be incorporated in a reconfig‐ urable network switching element to enhance the efficiency and flexibility of optical net‐ works. We construct a reconfigurable Tbit/s network switching element using double-pass liquid crystal on silicon (LCoS) technology accompanied by bidirectional degenerate FWM in a single HNLF. We demonstrate the LCoS+HNLF-based 2.3-Tbit/s multi-functional grooming switch which performs simultaneous selective add/drop, switchable data ex‐ change, and power equalization, for 23-channel 100-Gbit/s RZ-DQPSK signals [44, 45].

ITU-grid-compatible 23-channel (from S1: 1531.12 nm to S23: 1566.31 nm) 100-Gbit/s RZ-DQPSK signals are employed in the experiment. Figure 13 shows the measured spectrum of the input unequalized 23-channel 100-Gbit/s RZ-DQPSK signals with a power fluctuation of ~9.1 dB. Shown in the insets are typical balanced eyes for the in-phase (Ch. I) and quadra‐

Shown in Fig. 14 is the measured spectrum and balanced eyes after grooming switch with power equalization (<1 dB) for all 23 channels (input unequalization: ~9.1 dB), two-channel add/drop for S6 and S7, and simultaneous six-channel data exchange (S10, S11, S12, S21, S22, S23). The inset of Fig. 14 depicts the spectrum of dropped S6 and S7. The BER performance is plotted in Fig. 15 and power penalties less than 5 dB for six-channel data exchange are

**Data Exchange** 

**IN OUT** 

**S1 S1** 

*<sup>S</sup>* <sup>4</sup>

**Ch. I Ch. Q** 

**Ch. I Ch. Q** 

**Ch. I Ch. Q** 

**Ch. I Ch. Q** 

**DQPSK 0 1 1 0 0 1** 

**DQPSK 1 0 0 1 0 1** 

**DQPSK 1 0 0 1 1 1** 

**DQPSK** 

**1 1 0 1 0 0** 

**1 0 1 0 0 0** 

**0 0 1 1 1 0** 

**1 0 1 1 1 1** 

**1 0 0 0 1 1** 

t t

t t

t t

t t

*<sup>S</sup>*<sup>3</sup>

**S2 S3 S4** 

**Multi-channel** 

**S1 Pump** 

*<sup>S</sup>*<sup>1</sup> *<sup>S</sup>* <sup>2</sup>

S1 S2 Pump

> S1 S2

to the phase-conjugation characteristic of degenerate FWM.

**0 0** 

t t

t t

t t

t t

**Figure 12.** Concept and principle of simultaneous multi-channel DQPSK data exchange.

**1 1 1 1 1 1** 

**0** 

**1 1 1 1 1 1 1** 

**1 0 0 0** 

**1 1** 

**1 1 1** 

**0 0 0 0 0 0** 

**1 1 0 0 0 0** 

**1 1** 

64 Design and Architectures for Digital Signal Processing

**1 1** 

**0 0** 

**1** 

**DQPSK** 

**Ch. I Ch. Q** 

**Ch. I Ch. Q** 

**DQPSK** 

**Ch. I Ch. Q** 

**DQPSK** 

**Ch. I Ch. Q** 

**DQPSK** 

ture (Ch. Q) components.

observed at a BER of 10-9.

**Figure 13.** Measured spectrum and balanced eyes for input unequalized 23-channel 100-Gbit/s RZ-DQPSK signals.

**Figure 14.** Measured spectrum and balanced eyes after multi-functional grooming switch (S6, S7: add/drop; S10, S11, S12, S21, S22, S23: data exchange; S1-S23: power equalization).

**Figure 15.** BER curves for simultaneous six-channel data exchange (S10, S11, S12, S21, S22, S23).

#### **3.4. Data exchange between two orthogonal polarizations using kerr-induced nonlinear polarization rotation in an HNLF [48, 49]**

In addition to the data exchange in the wavelength and time domains [22-45], it is also possi‐ ble to perform data exchange between two orthogonal polarizations in the time and polari‐ zation domains [48-50]. We experimentally demonstrate the orthogonal tributary channel exchange between two pol-muxed DPSK OTDM data streams by using the Kerr effect-in‐ duced nonlinear birefringence in an HNLF [48, 49].

Figure 16 illustrates the concept and principle of the Kerr effect-based orthogonal tributary channel exchange of a pol-muxed DPSK OTDM signal. The strong subrate clock pump is 45o linearly polarized with respect to the two orthogonal polarizations of a pol-muxed DPSK OTDM signal. With the help of proper pump power control, the pump-induced nonlinear birefringence by Kerr effect could bring the selected tributary channel (aligned with the sub‐ rate clock pump) to a 90o polarization rotation for both of the two orthogonal polarizations of the pol-muxed signal, leading to the orthogonal tributary channel exchange when the pump is present. Other unselected orthogonal tributary channels with the pump absent will not experience the nonlinear polarization rotation and hence will be untouched. In addition, simply by shifting the subrate clock pump to be aligned with the tributary channel of inter‐ est, it is possible to implement orthogonal tributary channel exchange for all tributary chan‐ nels of the pol-muxed DPSK OTDM signal.

**Figure 16** 

**Figure 16.** Concept and principle of Kerr effect-based orthogonal tributary channel exchange of a pol-muxed DPSK OTDM signal.

Figure 17 displays the eye diagrams measured by an optical sampling scope for the typical orthogonal tributary channel (Ch. 1) exchange of a 160-Gbit/s pol-muxed DPSK OTDM sig‐ nal. As the 10-GHz clock pump is time aligned to tributary channel 1 (Ch. 1) of the X- and Ypolarized DPSK OTDM signal, in the absence of the Y-polarization, as shown in Fig. 17(b), Ch. 1 of the X-polarization is blocked by an X-polarizer after the HNLF due to the 90o rota‐ tion from the X- to the Y-polarization. When the Y-polarization is present but the X-polariza‐ tion is absent, Ch. 1 of the Y-polarization is inserted to the X-polarization through the 90o rotation from the Y- to the X-polarization, as shown in Fig. 17(c). In the presence of both the X- and Y-polarizations, the tributary Ch. 1 of the Y-polarization is changed to the X-polariza‐ tion, as shown in Fig. 17(d). Meanwhile, the original tributary Ch. 1 of the X-polarization is

13

also changed to the Y-polarization, as shown in Fig. 17(h), resulting in the orthogonal tribu‐

(e)

(f)

(g)

(h)

**Figure 17.** Eye diagrams of orthogonal tributary channel (Ch. 1) exchange of a 160-Gbit/s pol-muxed DPSK OTDM

Figure 18 plots the power penalties of the orthogonal tributary exchange for 8 tributary channels. Less than 4-dB power penalty at a BER of 10-9 is obtained for all 8 tributary chan‐

0 10 20 30 40 50 60 70 80 90 100 Time (ps)

 (X-Polarization: ON, Y-Polarization: ON) Y-Polarization after tributary exchange (Ch. 1)

Y-Polarization Ch. 1 block

(X-Polarization: OFF, Y-Polarization: ON)

Y-Polarization before tributary exchange

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 67

Y-Polarization Ch. 1 insertion

(X-Polarization: ON, Y-Polarization: OFF)

tary channel exchange of a pol-muxed DPSK OTDM signal.

0 10 20 30 40 50 60 70 80 90 100 Time (ps)

**Figure 18.** Power penalties of orthogonal tributary exchange for 8 tributary channels.

**3.5. Time-slot-exchange using conversion-dispersion-based tunable delays [46, 47]**

The demonstrated data exchange of groups of bits or tributary channels manipulates data in multiple degrees of freedom, such as time- (groups of bit) and channel-selective data ex‐ change between WDM channels [23, 24], tributary channel exchange between two WDM

X-Polarization after tributary exchange (Ch. 1)

(X-Polarization: ON, Y-Polarization: ON)

X-Polarization Ch. 1 insertion

(X-Polarization: OFF, Y-Polarization: ON)

X-Polarization Ch. 1 block

(X-Polarization: ON, Y-Polarization: OFF)

X-Polarization before tributary exchange

(a)

(b)

(c)

(d)

nels with a fluctuation of <1.5 dB.

signal.

also changed to the Y-polarization, as shown in Fig. 17(h), resulting in the orthogonal tribu‐ tary channel exchange of a pol-muxed DPSK OTDM signal.

zation domains [48-50]. We experimentally demonstrate the orthogonal tributary channel exchange between two pol-muxed DPSK OTDM data streams by using the Kerr effect-in‐

Figure 16 illustrates the concept and principle of the Kerr effect-based orthogonal tributary channel exchange of a pol-muxed DPSK OTDM signal. The strong subrate clock pump is 45o linearly polarized with respect to the two orthogonal polarizations of a pol-muxed DPSK OTDM signal. With the help of proper pump power control, the pump-induced nonlinear birefringence by Kerr effect could bring the selected tributary channel (aligned with the sub‐

of the pol-muxed signal, leading to the orthogonal tributary channel exchange when the pump is present. Other unselected orthogonal tributary channels with the pump absent will not experience the nonlinear polarization rotation and hence will be untouched. In addition, simply by shifting the subrate clock pump to be aligned with the tributary channel of inter‐ est, it is possible to implement orthogonal tributary channel exchange for all tributary chan‐

**Figure 16** 

**Figure 16.** Concept and principle of Kerr effect-based orthogonal tributary channel exchange of a pol-muxed DPSK

Figure 17 displays the eye diagrams measured by an optical sampling scope for the typical orthogonal tributary channel (Ch. 1) exchange of a 160-Gbit/s pol-muxed DPSK OTDM sig‐ nal. As the 10-GHz clock pump is time aligned to tributary channel 1 (Ch. 1) of the X- and Ypolarized DPSK OTDM signal, in the absence of the Y-polarization, as shown in Fig. 17(b), Ch. 1 of the X-polarization is blocked by an X-polarizer after the HNLF due to the 90o rota‐ tion from the X- to the Y-polarization. When the Y-polarization is present but the X-polariza‐ tion is absent, Ch. 1 of the Y-polarization is inserted to the X-polarization through the 90o rotation from the Y- to the X-polarization, as shown in Fig. 17(c). In the presence of both the X- and Y-polarizations, the tributary Ch. 1 of the Y-polarization is changed to the X-polariza‐ tion, as shown in Fig. 17(d). Meanwhile, the original tributary Ch. 1 of the X-polarization is

polarization rotation for both of the two orthogonal polarizations

13

duced nonlinear birefringence in an HNLF [48, 49].

66 Design and Architectures for Digital Signal Processing

nels of the pol-muxed DPSK OTDM signal.

rate clock pump) to a 90o

OTDM signal.

**Figure 17.** Eye diagrams of orthogonal tributary channel (Ch. 1) exchange of a 160-Gbit/s pol-muxed DPSK OTDM signal.

Figure 18 plots the power penalties of the orthogonal tributary exchange for 8 tributary channels. Less than 4-dB power penalty at a BER of 10-9 is obtained for all 8 tributary chan‐ nels with a fluctuation of <1.5 dB.

**Figure 18.** Power penalties of orthogonal tributary exchange for 8 tributary channels.

#### **3.5. Time-slot-exchange using conversion-dispersion-based tunable delays [46, 47]**

The demonstrated data exchange of groups of bits or tributary channels manipulates data in multiple degrees of freedom, such as time- (groups of bit) and channel-selective data ex‐ change between WDM channels [23, 24], tributary channel exchange between two WDM high-speed OTDM signals [25, 26], and orthogonal tributary channel exchange of a polmuxed OTDM signal [48-50]. Another important traffic grooming function, known as timeslot exchange or time-slot interchange, is to manipulate data only in the time domain to enable contention resolution and increase throughput efficiency in time-based networks. Time-slot exchange, occurring on the bit or packet level, can afford the network enhanced flexibility. For packet-switched networks, exchanging full data packets in the time domain requires optical delays that are tunable.

packets are extracted from an input signal via wavelength multicasting with clocked pumps, delayed relative to one another in a dispersion module, and then multiplexed back together using wavelength conversion in a PPLN waveguide followed by dispersion compensation. Shown in Fig. 23 are experimental results of time-slot exchange of 40-Gbit/s odd and even packets. The conversion-dispersion-based tunable delays enable time-slot exchange of varia‐

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 69

**Figure 20.** Optical spectra (left) and temporal waveforms (right) of converted signals after multicasting using clocked

**Figure 21.** Temporal waveforms of optical signals following delay via dispersion, conversion back to the original signal

**3.6. Data exchange between "twisted" light beams carrying Orbital Angular Momentum**

In optical communications, beyond well-known existing degrees of freedom such as wave‐ length, time and polarization, other degrees of freedom are encouraged to be explored to break "capacity crunch". For example, OAM which is related to the helical phase front of "twisted" light beams [59-61], can be considered as an additional degree of freedom [62, 63], where the multiplexing of data-carrying OAM beams provides yet another dimension in the

wavelength and dispersion compensation (left) and optical spectra after PPLN (right).

ble length optical packets (182 and 288 bits/packet).

pumps in HNLF.

**(OAM) [51, 52]**

Figure 19 shows concept and principle of conversion-dispersion-based time-slot exchange of two separate packets in the time domain [46]. Three clocked pumps (*λ*P1, *λ*P2, *λ*P3) are fed into an HNLF, along with a packetized input signal (*λ*S) located near the ZDW of HNLF. Degener‐ ate FWM between the clocked pumps and signal generates replicas of the input signal at new converted wavelengths (*λ*1, *λ*2, *λ*3) which contain only the information of the input signal at times when the clocked pumps are on. The three pumps are clocked to convert: (i) only packet A to *λ*1, (ii) all information but packets A and B to *λ*2, and (iii) only packet B to *λ*3. The three con‐ verted signals (*λ*1, *λ*2, *λ*3) then pass through a dispersion module, such as dispersion compen‐ sation fiber (DCF), and experience a wavelength-dependent delay via inter-channel chromatic dispersion. Due to the conversion-dispersion-based tunable delays, packet A is advanced, while packet B is retarded, relative to the reference (all information but packets A and B), resulting in the swapping of packets A and B in the time domain. After the delays, all three converted sig‐ nals are converted back to the original signal wavelength using a PPLN waveguide followed by the compensation for intra-channel chromatic dispersion.

**Figure 19.** Concept and principle of time-slot exchange of two separate packets using conversion-dispersion-based tunable delays.

Shown in Figs. 20 and 21 are experimental results of time-slot exchange of 40-Gbit/s optical data packets using conversion-dispersion-based tunable delays. Separate 182-bit packets are converted to separate wavelengths, delayed relative to one another using conversion-disper‐ sion-based tunable delays, and then recombined together to achieve new packets with time slots exchanged.

Similarly, time-slot exchange between odd and even data packets is also achievable using conversion-dispersion-based tunable delays [47]. As illustrated in Fig. 22, odd and even data

packets are extracted from an input signal via wavelength multicasting with clocked pumps, delayed relative to one another in a dispersion module, and then multiplexed back together using wavelength conversion in a PPLN waveguide followed by dispersion compensation. Shown in Fig. 23 are experimental results of time-slot exchange of 40-Gbit/s odd and even packets. The conversion-dispersion-based tunable delays enable time-slot exchange of varia‐ ble length optical packets (182 and 288 bits/packet).

high-speed OTDM signals [25, 26], and orthogonal tributary channel exchange of a polmuxed OTDM signal [48-50]. Another important traffic grooming function, known as timeslot exchange or time-slot interchange, is to manipulate data only in the time domain to enable contention resolution and increase throughput efficiency in time-based networks. Time-slot exchange, occurring on the bit or packet level, can afford the network enhanced flexibility. For packet-switched networks, exchanging full data packets in the time domain

Figure 19 shows concept and principle of conversion-dispersion-based time-slot exchange of two separate packets in the time domain [46]. Three clocked pumps (*λ*P1, *λ*P2, *λ*P3) are fed into an HNLF, along with a packetized input signal (*λ*S) located near the ZDW of HNLF. Degener‐ ate FWM between the clocked pumps and signal generates replicas of the input signal at new converted wavelengths (*λ*1, *λ*2, *λ*3) which contain only the information of the input signal at times when the clocked pumps are on. The three pumps are clocked to convert: (i) only packet A to *λ*1, (ii) all information but packets A and B to *λ*2, and (iii) only packet B to *λ*3. The three con‐ verted signals (*λ*1, *λ*2, *λ*3) then pass through a dispersion module, such as dispersion compen‐ sation fiber (DCF), and experience a wavelength-dependent delay via inter-channel chromatic dispersion. Due to the conversion-dispersion-based tunable delays, packet A is advanced, while packet B is retarded, relative to the reference (all information but packets A and B), resulting in the swapping of packets A and B in the time domain. After the delays, all three converted sig‐ nals are converted back to the original signal wavelength using a PPLN waveguide followed by

**Figure 19.** Concept and principle of time-slot exchange of two separate packets using conversion-dispersion-based

Shown in Figs. 20 and 21 are experimental results of time-slot exchange of 40-Gbit/s optical data packets using conversion-dispersion-based tunable delays. Separate 182-bit packets are converted to separate wavelengths, delayed relative to one another using conversion-disper‐ sion-based tunable delays, and then recombined together to achieve new packets with time

Similarly, time-slot exchange between odd and even data packets is also achievable using conversion-dispersion-based tunable delays [47]. As illustrated in Fig. 22, odd and even data

requires optical delays that are tunable.

68 Design and Architectures for Digital Signal Processing

the compensation for intra-channel chromatic dispersion.

tunable delays.

slots exchanged.

**Figure 20.** Optical spectra (left) and temporal waveforms (right) of converted signals after multicasting using clocked pumps in HNLF.

**Figure 21.** Temporal waveforms of optical signals following delay via dispersion, conversion back to the original signal wavelength and dispersion compensation (left) and optical spectra after PPLN (right).

#### **3.6. Data exchange between "twisted" light beams carrying Orbital Angular Momentum (OAM) [51, 52]**

In optical communications, beyond well-known existing degrees of freedom such as wave‐ length, time and polarization, other degrees of freedom are encouraged to be explored to break "capacity crunch". For example, OAM which is related to the helical phase front of "twisted" light beams [59-61], can be considered as an additional degree of freedom [62, 63], where the multiplexing of data-carrying OAM beams provides yet another dimension in the ever-continuing effort to increase the capacity and spectral efficiency of communication links [63]. When employing OAM beams to carry data information, a desirable function for flexible data processing would be the data exchange between "twisted" OAM beams.

**Figure 22.** Concept and principle of time-slot exchange of odd and even packets.

**Figure 23.** Temporal waveforms of time-slot exchange of 40-Gbit/s odd and even packets with two variable packet lengths. (a)(c) 182 bits/packet. (b)(d) 288 bits/packet.

14

**Figure 24** 

Figure 24 shows concept and principle of data exchange between OAM beams [51, 52]. Su‐

nal A, Signal B), shine at a reflective-type spatial light modulator (SLM) loaded with a spiral phase mask with a charge of ℓ*<sup>R</sup>* = −(ℓ<sup>1</sup> + ℓ2). After reflecting off the SLM, this phase mask adds an azimuthal phase term exp(*i*ℓ*Rθ*) to the two OAM beams and converts them into

tion of the SLM which flips the charge sign. As a result, data exchange between two OAM beams is implemented. For the input of two OAM beams with varied charges, reconfigura‐ ble data exchange is available by updating the phase mask loaded into the reflective-type SLM. Shown in Fig. 24 is an example of data exchange between DQPSK-carrying "twisted"

The measured interferograms (i.e., interference between OAM beams and a reference Gaus‐ sian beam), as shown in Fig. 25(a) and (b), indicate that the charges of two OAM beams be‐ fore exchange are +8 and +6. After exchange, the measured interferograms, as shown in Fig. 25(c) and (d), verify that the charges of two OAM beams after exchange become +6 and +8 (see Ref. 52 for details). Figure 25(e)-(h) show measured interferograms of reconfigurable da‐ ta exchange between another two OAM beams (OAM+10 and OAM+6) by updating the spiral

We measure temporal waveforms and balanced eyes of demodulated in-phase (Ch. I) and quadrature (Ch. Q) components of 100-Gbit/s RZ-DQPSK signals. As shown in Fig. 26, the observed temporal waveforms confirm the successful implementation of data exchange be‐ tween two OAM beams (OAM+10 and OAM+6). Shown in Fig. 27 are measured BER curves for 100-Gbit/s RZ-DQPSK data exchange between OAM+10 and OAM+6 beams with power

, which are further transformed into OAMℓ<sup>2</sup>

), each carrying different data information (Sig‐

and OAMℓ<sup>1</sup>

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 71

due to reflec‐

, OAMℓ<sup>2</sup>

**Figure 24.** Concept and principle of data exchange between "twisted" OAM beams.

perposed two OAM beams (OAMℓ<sup>1</sup>

and OAM−ℓ<sup>1</sup>

OAM beams (OAM+8, OAM+6).

phase mask loaded into the SLM.

penalty <1.9-dB at a BER of 1e-9.

OAM−ℓ<sup>2</sup>

14

**Figure 24** 

**Figure 24.** Concept and principle of data exchange between "twisted" OAM beams.

ever-continuing effort to increase the capacity and spectral efficiency of communication links [63]. When employing OAM beams to carry data information, a desirable function for

**Figure 23.** Temporal waveforms of time-slot exchange of 40-Gbit/s odd and even packets with two variable packet

flexible data processing would be the data exchange between "twisted" OAM beams.

70 Design and Architectures for Digital Signal Processing

**Figure 22.** Concept and principle of time-slot exchange of odd and even packets.

lengths. (a)(c) 182 bits/packet. (b)(d) 288 bits/packet.

Figure 24 shows concept and principle of data exchange between OAM beams [51, 52]. Su‐ perposed two OAM beams (OAMℓ<sup>1</sup> , OAMℓ<sup>2</sup> ), each carrying different data information (Sig‐ nal A, Signal B), shine at a reflective-type spatial light modulator (SLM) loaded with a spiral phase mask with a charge of ℓ*<sup>R</sup>* = −(ℓ<sup>1</sup> + ℓ2). After reflecting off the SLM, this phase mask adds an azimuthal phase term exp(*i*ℓ*Rθ*) to the two OAM beams and converts them into OAM−ℓ<sup>2</sup> and OAM−ℓ<sup>1</sup> , which are further transformed into OAMℓ<sup>2</sup> and OAMℓ<sup>1</sup> due to reflec‐ tion of the SLM which flips the charge sign. As a result, data exchange between two OAM beams is implemented. For the input of two OAM beams with varied charges, reconfigura‐ ble data exchange is available by updating the phase mask loaded into the reflective-type SLM. Shown in Fig. 24 is an example of data exchange between DQPSK-carrying "twisted" OAM beams (OAM+8, OAM+6).

The measured interferograms (i.e., interference between OAM beams and a reference Gaus‐ sian beam), as shown in Fig. 25(a) and (b), indicate that the charges of two OAM beams be‐ fore exchange are +8 and +6. After exchange, the measured interferograms, as shown in Fig. 25(c) and (d), verify that the charges of two OAM beams after exchange become +6 and +8 (see Ref. 52 for details). Figure 25(e)-(h) show measured interferograms of reconfigurable da‐ ta exchange between another two OAM beams (OAM+10 and OAM+6) by updating the spiral phase mask loaded into the SLM.

We measure temporal waveforms and balanced eyes of demodulated in-phase (Ch. I) and quadrature (Ch. Q) components of 100-Gbit/s RZ-DQPSK signals. As shown in Fig. 26, the observed temporal waveforms confirm the successful implementation of data exchange be‐ tween two OAM beams (OAM+10 and OAM+6). Shown in Fig. 27 are measured BER curves for 100-Gbit/s RZ-DQPSK data exchange between OAM+10 and OAM+6 beams with power penalty <1.9-dB at a BER of 1e-9.

**4. Discussions**

as follows.

[18].

**5. Conclusion**

The demonstrated miscellaneous data exchange functionalities provide great potential for facilitating flexible networks. With future improvement, several aspects could be considered

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 73

**1.** In practical applications, some supplementary functionalities might be required to con‐ struct complete and independent data exchange modules. Taking tributary channel ex‐ change as an example (Figs. 6-8), subrate clock pumps should be synchronized with high-speed OTDM signals. Note that the incoming signals and locally generated pumps are usually independent with each other. Hence, a supplementary functionality of clock recovery is required in real situations to get synchronized subrate clock pumps from in‐ coming signals. Fortunately, various optical clock recovery methods have been devel‐ oped [64]. In particular, recent promising demonstrations have shown the successful synchronization and sub-clock recovery for ultra-high speed OTDM signals up to 640 Gbit/s [65, 66]. As a consequence, it is possible develop a complete and independent da‐

ta exchange module by incorporating a synchronization and clock recovery unit.

tuning of conversion-dispersion-based optical delays [7].

**2.** Beyond reported functionalities, data exchange can be further extended in terms of de‐ grees of freedom, modulation formats, and granularities. For example, some additional degrees of freedom have recently attracted increasing interest in high-speed optical fi‐ ber communications to break the "capacity crunch", such as space [67, 68] and mode [69, 70]. A valuable goal would be achieve data exchange in these degrees of freedom. Also, some high-level modulation formats have been used in fiber transmission, such as 16-ary quadrature amplitude modulation (16-QAM), 32-QAM, etc [67, 68, 71]. Accord‐ ing to the demonstrated characteristic of modulation-format-transparency, most of the presented data exchange should be, in principle, available for these advanced modula‐ tion formats. However, high-level modulation formats show reduced tolerable perform‐ ance degradation, and therefore accurate manipulation of amplitude and phase would be expected. Additionally, data exchange with fine granularity in the time domain re‐ quires accurate control of time delay, which could be achievable assisted by the fine

**3.** In addition to PPLN and HNLFs, there would be some other alternative candidates ap‐ plicable for data exchange, including the use of third-order nonlinearities in semicon‐ ductor optical amplifiers (SOAs) [3], As2S3 waveguides [19], and silicon waveguides

In this chapter, we have reviewed recent research efforts towards robust data exchange. Var‐ ious kinds of optical nonlinearities, i.e., cSFG/DFG in a PPLN waveguide, non-degenerate FWM in an HNLF, bidirectional degenerate FWM in an HNLF, Kerr-induced nonlinear po‐

**Figure 25.** Measured interferograms. (a) OAM+8 and (b) OAM+6 beams become (c) OAM+6 and (d) OAM+8 ones after exchange. (e) OAM+10 and (f) OAM+6 beams become (g) OAM+6 and (h) OAM+10 ones after exchange.

**Figure 26.** Measured waveforms and balanced eyes of demodulated in-phase (Ch. I) and quadrature (Ch. Q) compo‐ nents of 100-Gbit/s RZ-DQPSK signals for data exchange between OAM+10 and OAM+6 beams. Bef. Ex., before ex‐ change; Aft. Ex., after exchange.

**Figure 27.** Measured BER curves for 100-Gbit/s RZ-DQPSK data exchange between OAM+10 and OAM+6 beams.

#### **4. Discussions**

**(e) (f)** 

**(d)** 

**(c) (g) (h)** 

exchange. (e) OAM+10 and (f) OAM+6 beams become (g) OAM+6 and (h) OAM+10 ones after exchange.

**Figure 25.** Measured interferograms. (a) OAM+8 and (b) OAM+6 beams become (c) OAM+6 and (d) OAM+8 ones after

**Ch. I** 

**Ch. Q** 

**Ch. I** 

**Ch. Q** 

**Figure 26.** Measured waveforms and balanced eyes of demodulated in-phase (Ch. I) and quadrature (Ch. Q) compo‐ nents of 100-Gbit/s RZ-DQPSK signals for data exchange between OAM+10 and OAM+6 beams. Bef. Ex., before ex‐

**Figure 27.** Measured BER curves for 100-Gbit/s RZ-DQPSK data exchange between OAM+10 and OAM+6 beams.

**(a) (b)** 

72 Design and Architectures for Digital Signal Processing

**Ch. I** 

**Ch. Q** 

**Ch. I** 

**Ch. Q** 

change; Aft. Ex., after exchange.

The demonstrated miscellaneous data exchange functionalities provide great potential for facilitating flexible networks. With future improvement, several aspects could be considered as follows.


#### **5. Conclusion**

In this chapter, we have reviewed recent research efforts towards robust data exchange. Var‐ ious kinds of optical nonlinearities, i.e., cSFG/DFG in a PPLN waveguide, non-degenerate FWM in an HNLF, bidirectional degenerate FWM in an HNLF, Kerr-induced nonlinear po‐ larization rotation in an HNLF, and conversion-dispersion-based tunable delays, together with simple linear optics, are exploited to enable robust data exchange in different degrees of freedom (wavelength, time, polarization, phase front), for different modulation formats (OOK, DPSK, DQPSK, pol-muxed), and at different granularities (entire data, groups of bits, tributary channels).

**Author details**

ifornia, USA

**References**

45(2): 195-205.

18(1): 333-339.

35-38.

Jian Wang1\* and Alan E. Willner2

\*Address all correspondence to: jwang@hust.edu.cn

1 Wuhan National Laboratory for Optoelectronics, College of Optoelectronic Science and

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 75

2 Department of Electrical Engineering, University of Southern California, Los Angeles, Cal‐

[1] Saruwatari M. All-optical signal processing for terabit/second optical transmission.

[2] Yoo SJB Wavelength conversion technologies for WDM network applications. J.

[3] Chan K, Chan CK, Chen LK, Tong F. Demonstration of 20-Gb/s all-optical XOR gate by four-wave mixing in semiconductor optical amplifier with RZ-DPSK modulated

[4] Wang J, Sun JQ, Zhang XL, Huang DX, Fejer MM. All-optical format conversions us‐ ing periodically poled lithium niobate waveguides. IEEE J. Quantum Electron. 2009;

[5] Okawachi Y, Sharping JE, Xu C, Gaeta AL. Large tunable optical delays via self-

[6] Wang Y, Yu CY, Yan LS, Willner AE, Roussev R, Langrock C, Fejer MM, Sharping JE, Gaeta AL. 44-ns continuously tunable dispersionless optical delay element using a PPLN waveguide with two-pump configuration, DCF, and a dispersion compensa‐

[7] Nuccio SR, Yilmaz OF, Wu X, Willner AE. Fine tuning of conversion/dispersion based optical delays with a 1 pm tunable laser using cascaded acousto-optic mixing.

[8] Dai Y, Okawachi Y, Turner-Foster AC, Lipson M, Gaeta AL, Xu C. Ultralong continu‐ ously tunable parametric delays via a cascading discrete stage. Opt. Express 2010;

[9] Salem R, Foster MA, Turner AC, Geraghty DF, Lipson M, Gaeta AL. Signal regenera‐ tion using low-power four-wave mixing on silicon chip. Nature Photonics 2008; 2(1):

phase modulation and dispersion. Opt. Express 2006; 14(25): 12022-12027.

Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, China

IEEE J. Sel. Top. Quantum Electron. 2000; 6(6): 1363-1374.

inputs. IEEE Photon. Technol. Lett. 2004; 16(3): 897-899.

tor. IEEE Photon. Technol. Lett. 2007; 19(11): 861-863.

Opt. Lett. 2010; 35(4): 523-525.

Lightwave Technol. 1996; 14(6): 955-966.

First, analytical solutions to the single-PPLN-based data exchange are derived showing the exchange condition. 40-Gbit/s time- (groups of bits) and channel-selective data exchange be‐ tween four WDM channels is implemented. 10-Gbit/s tributary channel exchange between two WDM 160-Gbit/s OTDM signals is demonstrated. Second, analytical solutions to the non-degenerate FWM-based data exchange are derived indicating the exchange condition and implying the characteristic of modulation-format-transparency. Phase-transparent data exchange (entire data) of 100-Gbit/s RZ-DQPSK signals is demonstrated. Third, a simple ap‐ proach is proposed to perform simultaneous multi-channel data exchange using bidirection‐ al degenerate FWM in an HNLF. A reconfigurable Tbit/s network switching element is constructed using double-pass LCoS technology, together with bidirectional degenerate FWM in a single HNLF. LCoS+HNLF-based 2.3-Tbit/s (23X100-Gbit/s RZ-DQPSK) multifunctional grooming switch (e.g., simultaneous add/drop, six-channel data exchange, and power equalization) is implemented. Fourth, 10-Gbit/s tributary channel exchange between two orthogonal polarizations of a 160-Gbit/s pol-muxed DPSK OTDM signal is demonstrat‐ ed based on the Kerr-induced nonlinear polarization rotation. Fifth, time-slot exchange of 40-Gbit/s optical data packets is demonstrated using conversion-dispersion-based tunable delays. Finally, reconfigurable 100-Gbit/s RZ-DQPSK data exchange between "twisted" OAM beams is demonstrated using simple linear optics.

The obtained theoretical and experimental results of data exchange in the wavelength, time, polarization and phase front domains, show that robust data exchange for different modula‐ tion formats and at different granularities could potentially enhance the efficiency and flexi‐ bility of optical networks.

#### **Acknowledgements**

We acknowledge Jeng-yuan Yang, Xiaoxia Wu, Scott R. Nuccio, Omer F. Yilmaz, Zahra Bakhtiari, Hao Huang, Xue Wang, Nisar Ahmed, Irfan Fazal, Yan Yan, Yang Yue, Lin Zhang, Yinying Xiao-Li, Bishara Shamee, Yongxiong Ren, Amanda Bozovich, Robert W. Hellwarth, Moshe Tur, Kevin Birnbaum, John Choi, Baris Erkmen and Samuel Dolinar for the helpful discussions, and the generous support of the National Natural Science Founda‐ tion of China (NSFC) under grants 61077051, 11274131, 61222502, the Program for New Cen‐ tury Excellent Talents in University (NCET-11-0182), the Defense Advanced Research Projects Agency (DARPA) under contract FA8650-08-1-7820, and the DARPA under InPho (Information in a Photon) program.

#### **Author details**

larization rotation in an HNLF, and conversion-dispersion-based tunable delays, together with simple linear optics, are exploited to enable robust data exchange in different degrees of freedom (wavelength, time, polarization, phase front), for different modulation formats (OOK, DPSK, DQPSK, pol-muxed), and at different granularities (entire data, groups of bits,

First, analytical solutions to the single-PPLN-based data exchange are derived showing the exchange condition. 40-Gbit/s time- (groups of bits) and channel-selective data exchange be‐ tween four WDM channels is implemented. 10-Gbit/s tributary channel exchange between two WDM 160-Gbit/s OTDM signals is demonstrated. Second, analytical solutions to the non-degenerate FWM-based data exchange are derived indicating the exchange condition and implying the characteristic of modulation-format-transparency. Phase-transparent data exchange (entire data) of 100-Gbit/s RZ-DQPSK signals is demonstrated. Third, a simple ap‐ proach is proposed to perform simultaneous multi-channel data exchange using bidirection‐ al degenerate FWM in an HNLF. A reconfigurable Tbit/s network switching element is constructed using double-pass LCoS technology, together with bidirectional degenerate FWM in a single HNLF. LCoS+HNLF-based 2.3-Tbit/s (23X100-Gbit/s RZ-DQPSK) multifunctional grooming switch (e.g., simultaneous add/drop, six-channel data exchange, and power equalization) is implemented. Fourth, 10-Gbit/s tributary channel exchange between two orthogonal polarizations of a 160-Gbit/s pol-muxed DPSK OTDM signal is demonstrat‐ ed based on the Kerr-induced nonlinear polarization rotation. Fifth, time-slot exchange of 40-Gbit/s optical data packets is demonstrated using conversion-dispersion-based tunable delays. Finally, reconfigurable 100-Gbit/s RZ-DQPSK data exchange between "twisted"

The obtained theoretical and experimental results of data exchange in the wavelength, time, polarization and phase front domains, show that robust data exchange for different modula‐ tion formats and at different granularities could potentially enhance the efficiency and flexi‐

We acknowledge Jeng-yuan Yang, Xiaoxia Wu, Scott R. Nuccio, Omer F. Yilmaz, Zahra Bakhtiari, Hao Huang, Xue Wang, Nisar Ahmed, Irfan Fazal, Yan Yan, Yang Yue, Lin Zhang, Yinying Xiao-Li, Bishara Shamee, Yongxiong Ren, Amanda Bozovich, Robert W. Hellwarth, Moshe Tur, Kevin Birnbaum, John Choi, Baris Erkmen and Samuel Dolinar for the helpful discussions, and the generous support of the National Natural Science Founda‐ tion of China (NSFC) under grants 61077051, 11274131, 61222502, the Program for New Cen‐ tury Excellent Talents in University (NCET-11-0182), the Defense Advanced Research Projects Agency (DARPA) under contract FA8650-08-1-7820, and the DARPA under InPho

OAM beams is demonstrated using simple linear optics.

bility of optical networks.

**Acknowledgements**

(Information in a Photon) program.

tributary channels).

74 Design and Architectures for Digital Signal Processing

Jian Wang1\* and Alan E. Willner2

\*Address all correspondence to: jwang@hust.edu.cn

1 Wuhan National Laboratory for Optoelectronics, College of Optoelectronic Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, China

2 Department of Electrical Engineering, University of Southern California, Los Angeles, Cal‐ ifornia, USA

#### **References**


[10] Kataoka N, Sone K, Wada N, Aoki Y, Kinoshita S, Miyata H, Miyazaki T, Onaka H, Kitayama K. Field trial of 640-Gbit/s-throughput, granularity-flexible optical network using packet-selective ROADM prototype. J. Lightwave Technol. 2009; 27(7): 825-832.

[23] Wang J, Nuccio SR, Wu XX, Yilmaz OF, Zhang L, Fazal I, Yang JY, Yue Y, Willner AE. 40-Gbit/s optical data exchange between WDM channels using second-order

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 77

[24] Wang J, Nuccio SR, Wu XX, Yilmaz OF, Zhang L, Fazal I, Yang JY, Yue Y, Willner AE. 40 Gbit/s optical data exchange between wavelength-division-multiplexed chan‐ nels using a periodically poled lithium niobate waveguide. Opt. Lett. 2010; 35(7):

[25] Wang J, Bakhtiari Z, Xiao-Li Y, Yilmaz OF, Nuccio SR, Wu XX, Huang H, Yang JY, Yue Y, Fazal I, Hellwarth R, Willner AE. Experimental demonstration of data traffic grooming of a single 10-Gbit/s TDM tributary channel between two 160-Gbit/s WDM

[26] Wang J, Bakhtiari Z, Yilmaz OF, Nuccio SR, Wu XX, Willner AE. 10 Gbit/s tributary channel exchange of 160 Gbit/s signals using periodically poled lithium niobate. Opt.

[27] Mori K, Takara H, Saruwatari M. Wavelength interchange with an optical parametric

[28] Gao Y, Dai YH, Shu C, He SL. Wavelength interchange of phase-shift-keying signal.

[29] Wong KKY, Marhic ME, Uesaka K, Kazovsky LG. Demonstration of wavelength ex‐

[30] Uesaka K, Wong KKY, Marhic ME, Kazovsky LG. Polarization-insensitive wave‐ length exchange in highly-nonlinear dispersion-shifted fiber. OFC2002, paper ThY3,

[31] Uesaka K, Wong KKY, Marhic ME, Kazovsky LG. Wavelength exchange in a highly nonlinear dispersion-shifted fiber: theory and experiments. IEEE J. Sel. Topics Quan‐

[32] Fung RWL, Cheung HKY, Wong KKY. Widely tunable wavelength exchange in anomalous-dispersion regime. IEEE Photon. Technol. Lett. 2007; 19(22): 1846-1848. [33] Cheung HKY, Fung RWL, Kwok CH, Wong KKY. All-optical packet switching by pulsed-pump wavelength exchange in a highly nonlinear dispersion-shifted fiber.

[34] Shen M, Xu X, Yuk TI, Wong KKY. Byte-level parametric wavelength exchange for narrow pulsewidth return-to-zero signals. IEEE Photon. Technol. Lett. 2009; 21(21):

[35] Kwok CH, Kuo BPP, Wong KKY. Pulsed pump wavelength exchange for high speed

[36] Shen M, Xu X, Yuk TI, Wong KKY. A 160-Gb/s OTDM demultiplexer based on para‐ metric wavelength exchange. IEEE J. Quantum Electron. 2009; 45(11): 1309-1316.

signal de-multiplexing. Opt. Express 2008; 16(15): 10894-10899.

change in a highly nonlinear fiber. ECOC 2001, pp. 272-273, 2001.

nonlinearities in PPLN waveguides. NLO 2009, paper PDPA1, 2009.

1067-1069.

2002.

1591-1593.

channels. OFC 2010, paper OWF1, 2010.

loop mirror. Electron. Lett. 1997; 33(6): 520-522.

IEEE Photon. Technol. Lett. 2010; 22(11): 838-840.

Lett. 2011; 36(5): 630-632.

tum Electron. 2002; 8(3): 560-568.

OFC2007, paper OTuB4, 2007.


[23] Wang J, Nuccio SR, Wu XX, Yilmaz OF, Zhang L, Fazal I, Yang JY, Yue Y, Willner AE. 40-Gbit/s optical data exchange between WDM channels using second-order nonlinearities in PPLN waveguides. NLO 2009, paper PDPA1, 2009.

[10] Kataoka N, Sone K, Wada N, Aoki Y, Kinoshita S, Miyata H, Miyazaki T, Onaka H, Kitayama K. Field trial of 640-Gbit/s-throughput, granularity-flexible optical network using packet-selective ROADM prototype. J. Lightwave Technol. 2009; 27(7): 825-832.

[11] Wang J, Fu HY, Geng DY, Willner AE. All-optical wavelength-/time-selective switch‐ ing/dropping/swapping for 100-GHz-spaced WDM signals using a periodically

[12] Wu XX, Bogoni A, Yilmaz OF, Nuccio SR, Wang J, Willner AE. Eightfold 40-320 Gbit/s phase-coherent multiplexing and 320-40 Gbit/s demultiplexing using highly

[13] Brès CS, Boggio JMC, Alic N, Radic S. 1-to-40 10-Gb/s channel multicasting and am‐ plification in wideband parametric amplifier. IEEE Photon. Technol. Lett. 2008;

[14] Biberman A, Lee BG, Turner-Foster AC, Foster MA, Lipson M, Gaeta AL, Bergman K. Wavelength multicasting in silicon photonic nanowires. Opt. Express 2010; 18(17):

[15] Hamza HS, Deogun JS. Wavelength-exchanging cross connects (WEX)—a new class of photonic cross-connect architectures. J. Lightwave Technol. 2006; 24(3): 1101-1111.

[16] Winzer PJ, Essiambre RJ. Advanced optical modulation formats. Proc. IEEE 2006;

[17] Winzer PJ, Essiambre RJ. Advanced modulation formats for high-capacity optical

[18] Oxenløwe LK, Ji H, Galili M, Pu MH, Hu H, Mulvad, HCH, Yvind K, Hvam JM, Clausen AT, Jeppesen P. Silicon photonics for signal processing of Tbit/s serial data

[19] Pelusi MD, Ta'eed VG, Fu LB, Mägi E, Lamont MRE, Madden S, Choi DY, Bulla DAP, Luther-Davies B, Eggleton BJ. Applications of highly-nonlinear chalcogenide glass devices tailored for high-speed all-optical signal processing. IEEE J. Sel. Top.

[20] Chowdhury A, Hagness SC, McCaughan L. Simultaneous optical wavelength inter‐ change with a two-dimensional second-order nonlinear photonic crystal. Opt. Lett.

[21] Chowdhury A, Staus C, Boland BF, Kuech TF, McCaughan L. Experimental demon‐ stration of 1535–1555 nm simultaneous optical wavelength interchange with a non‐

[22] Wang J, Sun QZ. Theoretical analysis of power swapping in quadratic nonlinear me‐

transport networks. J. Lightwave Technol. 2006; 24(23): 4711-4728.

signals. IEEE J. Sel. Top. Quantum Electron. 2012; 18(2): 996-1005.

linear photonic crystal. Opt. Lett. 2001; 26(17): 1353-1355.

Quantum Electron. 2008; 14(3): 529-539.

dium. Appl. Phys. Lett. 2010; 96(8): 081108.

2000; 25(11): 832-834.

poled lithium niobate waveguide. ECOC2012, paper Th.1.A.5, 2012.

nonlinear fibers. Opt. Lett. 2010; 35(11): 1896-1898.

20(16): 1417-1419.

76 Design and Architectures for Digital Signal Processing

18047-18055.

94(5): 952-985.


[37] Shen M, Cheung HKY, Fung RWL, Wong KKY. A comprehensive study on the dy‐ namic range of wavelength exchange and its impact on exchanged signal perform‐ ance. J. Lightwave Technol. 2009; 27(14): 2707-2716.

[49] Wang J, Yilmaz OF, Nuccio SR, Wu XX, Willner AE. Orthogonal tributary channel exchange of 160-Gbit/s pol-muxed DPSK signal. Opt. Express 2010; 18(16):

Optical Signal Processing: Data Exchange http://dx.doi.org/10.5772/52205 79

[50] Suzuki J, Taira K, Fukuchi Y, Ozeki Y, Tanemura T, Kikuchi K. All-optical time-divi‐ sion add-drop multiplexer using optical fibre Kerr shutter. Electron. Lett. 2004; 40(7):

[51] Wang J, Willner AE. Review of robust data exchange using optical nonlinearities. In‐ ternational Journal of Optics 2012; 2012: Article ID 575429 doi: 10.1155/2012/575429.

[52] Wang J, Yang JY, Fazal IM, Ahmed N, Yan Y, Willner AE, Dolinar S, Tur M. Experi‐ mental demonstration of 100-Gbit/s DQPSK data exchange between orbital-angular-

[53] Wang J, Yang JY, Fazal IM, Ahmed N, Yan Y, Huang H, Ren YX, Yue Y, Dolinar S, Tur M, Willner AE. Terabit free-space data transmission employing orbital angular

[54] Willner AE, Yilmaz OF, Wang J, Wu XX, Bogoni A, Zhang L, Nuccio SR. Optically efficient nonlinear signal processing. IEEE J. Sel. Topics Quantum Electron. 2011;

[55] Tian Y, Xiao XS, Gao SM, Yang CX. All-optical switch based on two-pump four-wave mixing in fibers without a frequency shift. Appl. Opt. 2007; 46(23): 5588-5592.

[56] Parameswaran KR, Fujimura M, Chou MH, Fejer MM. Low-power all-optical gate based on sum frequency mixing in APE waveguides in PPLN. IEEE Photon. Technol.

[57] Wang J, Sun JQ, Sun QZ. Experimental observation of a 1.5 μm band wavelength conversion and logic NOT gate at 40 Gbit/s based on sum-frequency generation. Opt.

[58] Wang J, Sun JQ, Sun QZ. Single-PPLN-based simultaneous half-adder, half-subtract‐ er, and OR logic gate: proposal and simulation. Opt. Express 2007; 15(4): 1690-1699.

[59] Allen L, Beijersbergen MW, Spreeuw RJC, Woerdman JP. Orbital angular momen‐ tum of light and the transformation of Laguerre–Gaussian laser modes. Phys. Rev. A

[60] Franke-Arnold S, Allen L, Padgett M. Advances in optical angular momentum. Laser

[61] Yao AM, Padgett MJ. Orbital angular momentum: origins, behavior and applications.

[62] Gibson G, Courtial J, Padgett M, Vasnetsov M, Pas'ko V, Barnett S, Franke-Arnold S. Free-space information transfer using light beams carrying orbital angular momen‐

momentum modes. OFC2012, paper OW1I.5, 2012.

momentum multiplexing. Nature Photonics 2012; 6(7): 488-496.

16995-17008.

17(2): 320-332.

Lett. 2000; 12(6): 654-656.

Lett. 2006; 31(11): 1711-1713.

1992; 45(11): 8185-8189.

Photon. Rev. 2008; 2(4): 299-313.

Adv. Opt. Photon. 2011; 3(2): 161-204.

tum. Opt. Express 2004; 12(22): 5448-5456.

445-446.


[49] Wang J, Yilmaz OF, Nuccio SR, Wu XX, Willner AE. Orthogonal tributary channel exchange of 160-Gbit/s pol-muxed DPSK signal. Opt. Express 2010; 18(16): 16995-17008.

[37] Shen M, Cheung HKY, Fung RWL, Wong KKY. A comprehensive study on the dy‐ namic range of wavelength exchange and its impact on exchanged signal perform‐

[38] Wang J, Bakhtiari Z, Xiao-Li Y, Nuccio SR, Yilmaz OF, Wu XX, Yang JY, Yue Y, Fazal I, Hellwarth R, Willner AE. Phase-transparent optical data exchange of 40-Gbit/s DPSK signals using four-wave-mixing in a highly nonlinear fiber. OFC 2010, paper

[39] Wang J, Bakhtiari Z, Nuccio SR, Yilmaz OF, Wu X, Willner AE. Phase-transparent optical data exchange of 40 Gbit/s differential phase-shift keying signals. Opt. Lett.

[40] Wang J, Nuccio SR, Huang H, Wang X, Yilmaz OF, Wu XX, Yang JY, Yue Y, Willner AE. Demonstration of 100-Gbit/s DQPSK data exchange between two different wave‐ length channels using parametric depletion in a highly nonlinear fiber. ECOC 2010,

[41] Wang J, Nuccio SR, Huang H, Wang X, Yang JY, Willner AE. Optical data exchange

[42] Wang J, Huang H, Wang X, Yang JY, Willner AE. Optical phase-transparent data grooming exchange of multi-channel 100-Gbit/s RZ-DQPSK signals. IEEE 23rd Pho‐

[43] Wang J, Huang H, Wang X, Yang JY, Willner AE. Multi-channel 100-Gbit/s DQPSK data exchange using bidirectional degenerate four-wave mixing. Opt. Express 2011;

[44] Wang J, Huang H, Wang X, Yang JY, Yilmaz OF, Wu XX, Nuccio SR, Willner AE. 2.3- Tbit/s (23X100-Gbit/s) RZ-DQPSK grooming switch (simultaneous add/drop, data ex‐ change and equalization) using double-pass LCoS and bidirectional HNLF. OFC

[45] Wang J, Huang H, Wang X, Yang JY, Willner AE. Reconfigurable 2.3-Tbit/s DQPSK simultaneous add/drop, data exchange and equalization using double-pass LCoS

[46] Christen L, Yilmaz OF, Nuccio SR, Wu XX, Fazal I, Willner AE. Tunable time-slotinterchange of 40-Gb/s optical packets using conversion/dispersion-based tunable

[47] Yilmaz OF, Christen L, Wu XX, Nuccio SR, Fazal I, Willner AE. Time-slot interchange of 40 Gbit/s variable length optical packets using conversion-dispersion-based tuna‐

[48] Wang J, Yilmaz OF, Nuccio SR, Wu XX, Bakhtiari Z, Xiao-Li Y, Yang JY, Huang H, Yue Y, Fazal I, Hellwarth R, Willner AE. Data traffic grooming/exchange of a single 10-Gbit/s TDM tributary channel between two pol-muxed 80-Gbit/s DPSK channels.

and bidirectional HNLF. Opt. Express 2011; 19(19): 18246-18252.

100-ns delays. OFC2008, paper OThA4, 2008.

ble delays. Opt. Lett. 2008; 33(17): 1954-1956.

CLEO 2010, paper CFJ5, 2010.

of 100-Gbit/s DQPSK signals. Opt. Express 2010; 18(23): 23740-23745.

tonics Society Annual Meeting 2010, paper WN2, 2010.

ance. J. Lightwave Technol. 2009; 27(14): 2707-2716.

OMT6, 2010.

2010; 35(17): 2979-2981.

78 Design and Architectures for Digital Signal Processing

paper Mo.1.A.4, 2010.

19(4): 3332-3338.

2011, paper OTuE2, 2011.


[63] Djordjevic IB, Arabaci M, Xu L, Wang T. Spatial-domain-based multidimensional modulation for multi-Tb/s serial optical transmission. Opt. Express 2011; 19(7): 6845-6857.

**Chapter 4**

**All-Optical Quaternary Logic Based Information**

Science and Technology is providing people all over the world with much better ways of communicating than ever before, and the winds of change have whipped up the de‐ sire to exchange more of everything from messages to movies. The field of computation and signal processing is growing day by day [1-7]. In last three to four decades, the phi‐ losophy, science and technical prospects enriched the scientific communities a lot. Mas‐ sive parallelism, speed of operation, increased spatial density attracts in many ways the scientists, researchers and technologists. Very Large Scale Integration (VLSI) technology has revolutionized the electronics industry and established the 20th century as the com‐ puter age. But, VLSI technology is approaching its fundamental limits in the sub-micron miniaturization process. Therefore an alternative technological solution to the problem of high speed information processing is needed, and unless we gear our thoughts toward a totally different pathway, we will not be able to further improve our information proc‐ essing performance for the future. Conservative and reversible logic gates are widely known to be compatible with revolutionary computing paradigms. At the same time the Multi-valued logic (MVL) is also positioned as a coming generation technology that can execute arithmetic functions faster and with less interconnect than binary logic [8-48].

In order to overcome the electronic bottlenecks and fully exploit the advantages of optics, it is necessary to move towards networks, where the transmitted data will remain exclu‐ sively in all optical domains without optical electrical optical (OEO) conversions. Ultra high-speed optical network is developing rapidly as growing capacity demand in telecom‐ munication system is increasing. In these networks, it is desired to carry out switching, routing and processing in optical domain to avoid bottlenecks of optoelectronic conver‐ sions. The dream of photonics is to have a completely all-optical technology. All-optical

> © 2013 Roy and Chattopadhyay; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2013 Roy and Chattopadhyay; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

**Processing: Challenges and Opportunities**

Jitendra Nath Roy and Tanay Chattopadhyay

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51559

**1. Introduction:**


### **All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities**

Jitendra Nath Roy and Tanay Chattopadhyay

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51559

#### **1. Introduction:**

[63] Djordjevic IB, Arabaci M, Xu L, Wang T. Spatial-domain-based multidimensional

[64] Lerber TV, Honkanen S, Tervonen A, Ludvigsen H, Küppers F. Optical clock recov‐

[65] Mulvad HCH, Tangdiongga E, Waardt H, Dorren HJS. 40 GHz clock recovery from

[66] Oxenløwe LK, Gómez-Agis F, Ware C, Kurimura S, Mulvad HCH, Galili M, Nakaji‐

[67] Takara H, Ono H, Abe Y, Masuda H, Takenaga K, Matsuo S, Kubota H, Shibahara K,

[68] Liu X, Chandrasekhar S, Chen X, Winzer PJ, Pan Y, Taunay TF, Zhu B, Fishteyn M,

[69] Al Amin A, Li A, Chen S, Gao G, Shieh W. Dual-LP11 mode 4X4 MIMO-OFDM transmission over a two-mode fiber. Opt. Express 2011; 19(17): 16672-16679.

[70] Ryf R, Randel S, Gnauck AH, Bolle C, Sierra A, Mumtaz S, Esmaeelpour M, Burrows

[71] Koizumi Y, Toyoda K, Yoshida M, Nakazawa M. 1024 QAM (60 Gbit/s) single-carrier coherent optical transmission over 150 km. Opt. Express 2012; 20(11): 12508-12514.

EC, Essiambre RJ, Winzer PJ, Peckham DW, McCurdy AH, Lingle R. Mode-division multiplexing over 96 km of few-mode fiber using coherent 6 × 6 MIMO processing. J.

Yan MF, Fini JM, Monberg EM, Dimarcello FV. 1.12-Tb/s 32-QAM-OFDM superchan‐ nel with 8.6-b/s/Hz intrachannel spectral efficiency and space-division multiplexed transmission with 60-b/s/Hz aggregate spectral efficiency. Opt. Express 2011; 19(26):

Kobayashi T, Miaymoto Y. 1000-km 7-core fiber transmission of 10X96-Gb/s PDM-16QAM using Raman amplification with 6.5 W per fiber. Opt. Express 2012;

640 Gbit/s OTDM signal using SOA based phase comparator. Electron. Lett. 2008;

ma H, Ichikawa J, Erasme D, Clausen AT, Jeppesen P. 640-Gbit/s data transmission and clock recovery using an ultrafast periodically poled lithium niobate device. J.

ery methods: Review (Invited). Opt. Fiber Technol. 2009; 15(4): 363-372.

6845-6857.

80 Design and Architectures for Digital Signal Processing

44(2): 146-147.

20(9): 10100-10105.

B958-B964.

Lightwave Technol. 2009; 27(3): 205-213.

Lightwave Technol. 2012; 30(4): 521-531.

modulation for multi-Tb/s serial optical transmission. Opt. Express 2011; 19(7):

Science and Technology is providing people all over the world with much better ways of communicating than ever before, and the winds of change have whipped up the de‐ sire to exchange more of everything from messages to movies. The field of computation and signal processing is growing day by day [1-7]. In last three to four decades, the phi‐ losophy, science and technical prospects enriched the scientific communities a lot. Mas‐ sive parallelism, speed of operation, increased spatial density attracts in many ways the scientists, researchers and technologists. Very Large Scale Integration (VLSI) technology has revolutionized the electronics industry and established the 20th century as the com‐ puter age. But, VLSI technology is approaching its fundamental limits in the sub-micron miniaturization process. Therefore an alternative technological solution to the problem of high speed information processing is needed, and unless we gear our thoughts toward a totally different pathway, we will not be able to further improve our information proc‐ essing performance for the future. Conservative and reversible logic gates are widely known to be compatible with revolutionary computing paradigms. At the same time the Multi-valued logic (MVL) is also positioned as a coming generation technology that can execute arithmetic functions faster and with less interconnect than binary logic [8-48].

In order to overcome the electronic bottlenecks and fully exploit the advantages of optics, it is necessary to move towards networks, where the transmitted data will remain exclu‐ sively in all optical domains without optical electrical optical (OEO) conversions. Ultra high-speed optical network is developing rapidly as growing capacity demand in telecom‐ munication system is increasing. In these networks, it is desired to carry out switching, routing and processing in optical domain to avoid bottlenecks of optoelectronic conver‐ sions. The dream of photonics is to have a completely all-optical technology. All-optical

switching is an essential technology for transparent fiber optic networks and for all forms of optical signal processing as the optical interconnections and optical integrated circuits is immune to electromagnetic interference, and free from electrical short circuits. In a pur‐ suit to probe into cutting-edge research areas, the development of different ultra-fast alloptical switches has received considerable interest in recent years all over the world for future optical information processing [49-59]. As photon is the ultimate unit of informa‐ tion with unmatched speed and with data package in a signal of zero mass, the techni‐ ques of computing with light may provide a way out of the limitations of computational speed and complexity inherent in electronics computing.

The fundamentals of digital signal processing are straightforward. To send something as simple as a phone message or as complicated as a picture, we digitize it by breaking it up into a series of binary bits, transmit the bits, and decode them at the other end to re-cre‐ ate the message or picture. The ones or zeroes in the bits are encoded by turning some signal on or off. In the past, the signal has been electrical, but increasingly it is composed of light pulses. We use a laser to produce the light, and then add information to it with a modulator, transmit it through optical fibers, amplify it if needed, receive it with a photo detector and re-create the message with a demodulator. An optical signal is better than an electrical one, with less attenuation, faster switching, and more signals traveling together. In everyday we have to handle enormous and ever increasing, amounts of information. Binary number (0 and 1) is insufficient in respect to the demand of the coming genera‐ tion. The application of multi-valued (non-binary) signals can provide a considerable re‐ lief in transmission, storage and processing of large amount of information in digital signal processing. Quaternary logic (4-valued) is one type of MVL [60-82].

In this chapter, all-optical scheme for designing some polarization encoded quaternary logic gates (quaternary min and quaternary delta literal) with the help of nonlinear material based interferometric switches have been discussed. Design of all-optical quaternary multi‐ valued multiplexer and demultiplexer circuits have also been described with the help of these basic gates. For the quaternary data processing in optics, the quaternary number (0, 1, 2, 3) have been represented by four discrete polarized state of light. In optical implementa‐ tion we can consider the set of Quaternary logic states {0, 1, 2, 3} as : 0= No light, 1 = vertical‐ ly polarized light (↕), 2 = horizontally polarized light (•), 3 = partially polarized light (↔).This chapter is organized as follows. Section-1.1 to 1.3 gives a brief overview of multivalued logic (MVL) i.e. What is MVL? Why do we need it? How it can be implemented and where MVL can be applied? Section-2.1 describes the basic principle of all-optical interfero‐ metric switches which is the cornerstone of all logic based signal processing. Section-2.2 and section 2.3 describes the design and operational principle of some basic all-optical quaterna‐ ry logic circuits (QMIN, Delta Literal). All-optical circuit for quaternary multiplexer and de‐ multiplexer are described in section-2.4 and section 2.5. Also quaternary T-gate is discussed in section 3. Challenges in design issues that is to be considered for experimentally achieve result from the proposed scheme is mentioned in section section-4. Chapter ends with Con‐ clusions and Future Scopes given in section-5.

**Figure 1.** Different field in multi-valued logic.

0≤*a* ≤(*R* −1)] can be written as [56] :

Multi-valued logic (MVL) is a non binary logic with radix >2. Binary logic is limited to only two states 'True' (1) and 'False' (0), MVL replaces these with finite and infinite num‐ ber of values. MVL system is defined as system operating on higher radix than two. In

1


*N*

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

83

*i*

= + ++ + = <sup>L</sup> å (1)

0

*<sup>R</sup>*; [where

the base-*R* number system, a numerical value of *N*-bit data(*aN* <sup>−</sup>1*aN* <sup>−</sup>2⋯*a*2*a*1*a*0)

1 2 10 1 2 10



*N N i N N i*

( . . . .) .

*a R a R aR aR aR*

**1.1. What is Quaternary Logic?**

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities http://dx.doi.org/10.5772/51559 83

**Figure 1.** Different field in multi-valued logic.

#### **1.1. What is Quaternary Logic?**

switching is an essential technology for transparent fiber optic networks and for all forms of optical signal processing as the optical interconnections and optical integrated circuits is immune to electromagnetic interference, and free from electrical short circuits. In a pur‐ suit to probe into cutting-edge research areas, the development of different ultra-fast alloptical switches has received considerable interest in recent years all over the world for future optical information processing [49-59]. As photon is the ultimate unit of informa‐ tion with unmatched speed and with data package in a signal of zero mass, the techni‐ ques of computing with light may provide a way out of the limitations of computational

The fundamentals of digital signal processing are straightforward. To send something as simple as a phone message or as complicated as a picture, we digitize it by breaking it up into a series of binary bits, transmit the bits, and decode them at the other end to re-cre‐ ate the message or picture. The ones or zeroes in the bits are encoded by turning some signal on or off. In the past, the signal has been electrical, but increasingly it is composed of light pulses. We use a laser to produce the light, and then add information to it with a modulator, transmit it through optical fibers, amplify it if needed, receive it with a photo detector and re-create the message with a demodulator. An optical signal is better than an electrical one, with less attenuation, faster switching, and more signals traveling together. In everyday we have to handle enormous and ever increasing, amounts of information. Binary number (0 and 1) is insufficient in respect to the demand of the coming genera‐ tion. The application of multi-valued (non-binary) signals can provide a considerable re‐ lief in transmission, storage and processing of large amount of information in digital

signal processing. Quaternary logic (4-valued) is one type of MVL [60-82].

clusions and Future Scopes given in section-5.

In this chapter, all-optical scheme for designing some polarization encoded quaternary logic gates (quaternary min and quaternary delta literal) with the help of nonlinear material based interferometric switches have been discussed. Design of all-optical quaternary multi‐ valued multiplexer and demultiplexer circuits have also been described with the help of these basic gates. For the quaternary data processing in optics, the quaternary number (0, 1, 2, 3) have been represented by four discrete polarized state of light. In optical implementa‐ tion we can consider the set of Quaternary logic states {0, 1, 2, 3} as : 0= No light, 1 = vertical‐ ly polarized light (↕), 2 = horizontally polarized light (•), 3 = partially polarized light (↔).This chapter is organized as follows. Section-1.1 to 1.3 gives a brief overview of multivalued logic (MVL) i.e. What is MVL? Why do we need it? How it can be implemented and where MVL can be applied? Section-2.1 describes the basic principle of all-optical interfero‐ metric switches which is the cornerstone of all logic based signal processing. Section-2.2 and section 2.3 describes the design and operational principle of some basic all-optical quaterna‐ ry logic circuits (QMIN, Delta Literal). All-optical circuit for quaternary multiplexer and de‐ multiplexer are described in section-2.4 and section 2.5. Also quaternary T-gate is discussed in section 3. Challenges in design issues that is to be considered for experimentally achieve result from the proposed scheme is mentioned in section section-4. Chapter ends with Con‐

speed and complexity inherent in electronics computing.

82 Design and Architectures for Digital Signal Processing

Multi-valued logic (MVL) is a non binary logic with radix >2. Binary logic is limited to only two states 'True' (1) and 'False' (0), MVL replaces these with finite and infinite num‐ ber of values. MVL system is defined as system operating on higher radix than two. In the base-*R* number system, a numerical value of *N*-bit data(*aN* <sup>−</sup>1*aN* <sup>−</sup>2⋯*a*2*a*1*a*0) *<sup>R</sup>*; [where 0≤*a* ≤(*R* −1)] can be written as [56] :

$$(a\_{N-1}, R^{N-1} + a\_{N-2}, R^{N-2} + \dots + a\_1, R^1 + a\_0, R^0) = \sum\_{i=0}^{N-1} a\_i, R^i \tag{1}$$

For example Ternary logic (*R*=3) has three logical states {0, 1, 2} or {1 ¯, 0, 1} [18]. These are known as ordinary ternary and symmetric ternary logic respectively. Quaternary logic (*R* = 4) has four logical states {0, 1, 2, 3}. Like binary world there are also numbers of basic gates in multi-valued logic world. Depending on the radix and number of variables used, differ‐ ent logic functions can be generated. The numbers of possible functions are [37].

$$f(R, n) = R^{\binom{R^\*}{}} \tag{2}$$

schemes [57, 59]. The block diagram of this interfacing circuit is shown in the Fig. 3. Here input & outputs are 4-valued and the internal circuitry is binary (radix=2). Decoder circuit converts quaternary input into its binary equivalent. After performing the logical operation in binary system, it is then encoded to its quaternary equivalent by encoder circuit. Hence, it can be said that this scheme requires no major modifications of the existing transmitter, re‐ ceiver, or transmission link. Quaternary digits are two major types: ordinary quaternary dig‐ it (OQD) and quaternary signed digit (QSD) [75-76]. QSD is useful for carry free arithmetic

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

85

operations [19, 31-35]. Fig. 4 indicates how quaternary and binary are interfaced.

**Figure 2.** Different field in quaternary logic.

**1.3. How All-optical Quaternary Logic can be implemented?**

Consideration of different logical states is a challenge. It can be done in different ways and given in Fig. 5. In electronics, efforts have already been made to design four-valued logic

Where *R* = radix, *n* = number of variables. In ternary logic of two variable (*R* = 3, *n* =2) there are *f* (3, 2) = 33<sup>2</sup> =19683 possible functions. For quaternary (R = 4, n = 2) there are *f* (4, 2) = 442 =4294967296 logical operations. Hence, huge numbers of logical operation can be possible for higher radix (Fig. 1). Among them, some basic gates are the MAX, MIN, Complement, Cycle or successor, Literals etc [6, 7, 38-41], which is indicated in Fig. 2.

#### **1.2. Why do we need All-optical Quaternary Logic based signal processing?**

The most pressing problems in present-day binary systems are interconnection problems, both on-chip and between chip. On chip the difficulties of placement and routing of the digital logic elements which go to make up the complete chip are escalating with increase in capability per chip, and the silicon area used for interconnections may be greater than that used for the active logic elements. Similarly, the difficulties of bringing an increasing number of connections offchip is promoting -a new consideration of packaging concepts in an attempt to overcome prob‐ lems which are becoming mechanically, thermally, and electrically extreme. All these factors point to the attraction of raising the information content per interconnection from the present lowest-possible (binary) level. Multiple-valued logic, in which the number of discreet logic lev‐ els is not confined to two, has been the subject of much research over many years. The practical objective of this work is to increase the information content of the digital signals in a system to a higher value than that provided by binary operation. To increase the transmission capacity of future communication the present binary system is becoming very critical. A more formal ap‐ proach would be an *n*-valued logic which has *n* different states, each state having a unique identifier. Multi-valued logic (MVL) is defined as a non-binary logic and involves the switch‐ ing between more than two states. Multi-valued logic can be viewed as an alternative ap‐ proach to solve many problems in transmission, storage and processing of large amount of information in digital signal processing [22]. The main advantages of multi-valued logic sys‐ tems and circuits are greater speed of arithmetic operations realization, greater density of memorized information, better usage of transmission paths, decreasing of interconnections complexity and interconnections area, decreasing of pin number of integrated circuits and printed boards, possibilities for easier testing.

In the field of data communication, the quaternary codes are preferred because four-valued (i.e. quaternary) logic signals easily interface with the binary world. They may be decoded directly into their two binary-digit equivalent. Quaternary logic world can easily be inter‐ faced with binary logic in all-optical domain with the help of our suggested DEC and ENC schemes [57, 59]. The block diagram of this interfacing circuit is shown in the Fig. 3. Here input & outputs are 4-valued and the internal circuitry is binary (radix=2). Decoder circuit converts quaternary input into its binary equivalent. After performing the logical operation in binary system, it is then encoded to its quaternary equivalent by encoder circuit. Hence, it can be said that this scheme requires no major modifications of the existing transmitter, re‐ ceiver, or transmission link. Quaternary digits are two major types: ordinary quaternary dig‐ it (OQD) and quaternary signed digit (QSD) [75-76]. QSD is useful for carry free arithmetic operations [19, 31-35]. Fig. 4 indicates how quaternary and binary are interfaced.

**Figure 2.** Different field in quaternary logic.

For example Ternary logic (*R*=3) has three logical states {0, 1, 2} or {1

84 Design and Architectures for Digital Signal Processing

Cycle or successor, Literals etc [6, 7, 38-41], which is indicated in Fig. 2.

printed boards, possibilities for easier testing.

**1.2. Why do we need All-optical Quaternary Logic based signal processing?**

are *f* (3, 2) = 33<sup>2</sup>

442

ent logic functions can be generated. The numbers of possible functions are [37].

known as ordinary ternary and symmetric ternary logic respectively. Quaternary logic (*R* = 4) has four logical states {0, 1, 2, 3}. Like binary world there are also numbers of basic gates in multi-valued logic world. Depending on the radix and number of variables used, differ‐

( ) *<sup>n</sup> R*

=19683 possible functions. For quaternary (R = 4, n = 2) there are *f* (4, 2) =

Where *R* = radix, *n* = number of variables. In ternary logic of two variable (*R* = 3, *n* =2) there

=4294967296 logical operations. Hence, huge numbers of logical operation can be possible for higher radix (Fig. 1). Among them, some basic gates are the MAX, MIN, Complement,

The most pressing problems in present-day binary systems are interconnection problems, both on-chip and between chip. On chip the difficulties of placement and routing of the digital logic elements which go to make up the complete chip are escalating with increase in capability per chip, and the silicon area used for interconnections may be greater than that used for the active logic elements. Similarly, the difficulties of bringing an increasing number of connections offchip is promoting -a new consideration of packaging concepts in an attempt to overcome prob‐ lems which are becoming mechanically, thermally, and electrically extreme. All these factors point to the attraction of raising the information content per interconnection from the present lowest-possible (binary) level. Multiple-valued logic, in which the number of discreet logic lev‐ els is not confined to two, has been the subject of much research over many years. The practical objective of this work is to increase the information content of the digital signals in a system to a higher value than that provided by binary operation. To increase the transmission capacity of future communication the present binary system is becoming very critical. A more formal ap‐ proach would be an *n*-valued logic which has *n* different states, each state having a unique identifier. Multi-valued logic (MVL) is defined as a non-binary logic and involves the switch‐ ing between more than two states. Multi-valued logic can be viewed as an alternative ap‐ proach to solve many problems in transmission, storage and processing of large amount of information in digital signal processing [22]. The main advantages of multi-valued logic sys‐ tems and circuits are greater speed of arithmetic operations realization, greater density of memorized information, better usage of transmission paths, decreasing of interconnections complexity and interconnections area, decreasing of pin number of integrated circuits and

In the field of data communication, the quaternary codes are preferred because four-valued (i.e. quaternary) logic signals easily interface with the binary world. They may be decoded directly into their two binary-digit equivalent. Quaternary logic world can easily be inter‐ faced with binary logic in all-optical domain with the help of our suggested DEC and ENC

*f ( R,n ) R* = (2)

¯, 0, 1} [18]. These are

#### **1.3. How All-optical Quaternary Logic can be implemented?**

Consideration of different logical states is a challenge. It can be done in different ways and given in Fig. 5. In electronics, efforts have already been made to design four-valued logic [37-48] with charge couple device (CCD). In I2 L circuits 0mA, 10mA, 20mA and 30mAg are four different logical states, in *ν*MOS (neuron-MOS) have the logic levels 0.0V, 1.1V, 2.2V and 3.3V [39], also in CMOS MVL have the logic levels 0 V, 1V, 2V, 3V respectively [37]. Quantum computation and information is the study of the information processing tasks that can be accomplished using quantum mechanical systems. Just as the classical computation is built upon bits, quantum computation also has an analogous concept called qubits. Analo‐ gous to classical computation, the operations on qubits are carried out using quantum logic gates. Of late, renewed interest in optical computing has been witnessed due to the emer‐ gence of novel photonics structures that includes nano-photonics, silicon photonics, biophotonics and plasmonics etc. Optical quaternary logical operation is an interesting and challenging field of research for future optical signal processing where we can expect much innovation [58-82]. Polarization properties of light can play significant role here.

Polarization may be a good choice for representing different logical states in all-optical qua‐

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

87

**•** Nature of polarization does not change due to absorption of light like intensity. Therefore the strength or weakness of the beam plays no role in the operation of the devices.

For the quaternary data processing in optics, the quaternary logic states {0, 1, 2, 3} can be

represented by four discrete polarized state of light as mentioned below:

**•** The sate of polarization can be changed easily by Polarization converter.

**Figure 4.** Interfacing Binary and Quaternary world by the help of ENC and DEC.

**Figure 5.** Quaternary (4-valued) logic implementation.

0 = No light.

ternary (4-valued) logic operations because [7, 15],

**•** No photon energy is wasted.

**Figure 3.** Binary-to-quaternary Encoder and Quaternary-to-binary decoder.

Polarization may be a good choice for representing different logical states in all-optical qua‐ ternary (4-valued) logic operations because [7, 15],


[37-48] with charge couple device (CCD). In I2

86 Design and Architectures for Digital Signal Processing

four different logical states, in *ν*MOS (neuron-MOS) have the logic levels 0.0V, 1.1V, 2.2V and 3.3V [39], also in CMOS MVL have the logic levels 0 V, 1V, 2V, 3V respectively [37]. Quantum computation and information is the study of the information processing tasks that can be accomplished using quantum mechanical systems. Just as the classical computation is built upon bits, quantum computation also has an analogous concept called qubits. Analo‐ gous to classical computation, the operations on qubits are carried out using quantum logic gates. Of late, renewed interest in optical computing has been witnessed due to the emer‐ gence of novel photonics structures that includes nano-photonics, silicon photonics, biophotonics and plasmonics etc. Optical quaternary logical operation is an interesting and challenging field of research for future optical signal processing where we can expect much

innovation [58-82]. Polarization properties of light can play significant role here.

**Figure 3.** Binary-to-quaternary Encoder and Quaternary-to-binary decoder.

L circuits 0mA, 10mA, 20mA and 30mAg are

**Figure 4.** Interfacing Binary and Quaternary world by the help of ENC and DEC.

**Figure 5.** Quaternary (4-valued) logic implementation.

For the quaternary data processing in optics, the quaternary logic states {0, 1, 2, 3} can be represented by four discrete polarized state of light as mentioned below:

0 = No light.

1 = vertically polarized light (↕)

2 = horizontally polarized light (•)

3 = mixed polarized light or un-polarized light (↔).

#### **2. Designing of Polarization encoded all-optical Quaternary multiplexer / De-multiplexer**

interferometer is balanced so that, in the absence of a control signal, the incoming signal emerges from one output port. The presence of a strong control pulse changes the refractive

Where ∆n is the change in the refractive index of the medium, n2 is the nonlinear refractive coefficient and I is the intensity of the light incident on the medium. A change in the index adds a phase shift between the two arms of the interferometer, so that the incoming signal is switched over to another output port. This method of switching is based on cross phase modulation (XPM). In much resent years, Semiconductor optical amplifier (SOA) makes a revolution in designing high speed (>100 Gb/s) interferometric switches in all-optical infor‐ mation processing system. The technology of SOA has been evolving rapidly during the re‐ sent years and has become mature enough so that it is now key factor in implementation of modern optical communication networks. SOA are commercially available device and have different important properties. Such as, fast and strong nonlinearities, short latency, thermal stability, low power consumption, large dynamic range, short response time, broadband and versatile operation and capability of large scale integration with chip level design.

**Figure 7.** A TOAD based optical switch, where SOA: Semiconductor optical amplifier, CW: Clockwise pulse, CCW:

The Fig. 7 is a Sagnac interferometer that uses an SOA offset from the midpoint of the loop and is known as a terahertz optical asymmetric demultiplexer (TOAD). It can operate at frequen‐ cies in terahertz range [50-51]. There are two couplers; 1) the control coupler provides an input path for the control pulses to enter the fiber loop in order to saturate the SOA, and 2) the input coupler (50:50) where the incoming pulse signal train entering the loop splits equally into

Counter clockwise pulse, CO: coupler, F: Filter which blocks control pulse.

<sup>2</sup> D = n n I (3)

http://dx.doi.org/10.5772/51559

89

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

index of the medium given by

Multiplexing and de-multiplexing are two essential features in almost all the signal commu‐ nication systems, where a lot of information is being handled without any mutual distur‐ bances. The principles and possibilities of designing of all-optical quaternary multi-valued multiplexer and de-multiplexer circuits are described with the help of quaternary MIN and quaternary delta literal gates (Fig. 6). Nonlinear material based interferometric switches can take an important role here. Working principle of Terahertz Optical Asymmetric Demulti‐ plexer (TOAD) based all-optical switch is discussed in Section-2.1. Section-2.2 and Sec-2.3 describes the design and operational principle of some basic all-optical quaternary logic cir‐ cuits (QMIN, Delta Literal). All-optical circuit for quaternary multiplexer and demultiplexer are proposed and described in Sec-2.4 and sec-2.5 respectively.

**Figure 6.** Overview of Quaternary Mux/Demux.

#### **2.1. Interferometer based optical switch:**

Interferometric devices for optical processing have been of great interest in the recent years [50-55]. Optical switch using a nonlinear interferometer makes it possible for one optical sig‐ nal to control and switch another optical signal through the nonlinear interaction in a mate‐ rial. The incoming signal to be switched is split between the arms of the interferometer. The interferometer is balanced so that, in the absence of a control signal, the incoming signal emerges from one output port. The presence of a strong control pulse changes the refractive index of the medium given by

1 = vertically polarized light (↕)

**De-multiplexer**

2 = horizontally polarized light (•)

88 Design and Architectures for Digital Signal Processing

**Figure 6.** Overview of Quaternary Mux/Demux.

**2.1. Interferometer based optical switch:**

3 = mixed polarized light or un-polarized light (↔).

are proposed and described in Sec-2.4 and sec-2.5 respectively.

**2. Designing of Polarization encoded all-optical Quaternary multiplexer /**

Multiplexing and de-multiplexing are two essential features in almost all the signal commu‐ nication systems, where a lot of information is being handled without any mutual distur‐ bances. The principles and possibilities of designing of all-optical quaternary multi-valued multiplexer and de-multiplexer circuits are described with the help of quaternary MIN and quaternary delta literal gates (Fig. 6). Nonlinear material based interferometric switches can take an important role here. Working principle of Terahertz Optical Asymmetric Demulti‐ plexer (TOAD) based all-optical switch is discussed in Section-2.1. Section-2.2 and Sec-2.3 describes the design and operational principle of some basic all-optical quaternary logic cir‐ cuits (QMIN, Delta Literal). All-optical circuit for quaternary multiplexer and demultiplexer

Interferometric devices for optical processing have been of great interest in the recent years [50-55]. Optical switch using a nonlinear interferometer makes it possible for one optical sig‐ nal to control and switch another optical signal through the nonlinear interaction in a mate‐ rial. The incoming signal to be switched is split between the arms of the interferometer. The

$$\mathbf{A}\mathbf{n} = \mathbf{n}\_2\mathbf{I} \tag{3}$$

Where ∆n is the change in the refractive index of the medium, n2 is the nonlinear refractive coefficient and I is the intensity of the light incident on the medium. A change in the index adds a phase shift between the two arms of the interferometer, so that the incoming signal is switched over to another output port. This method of switching is based on cross phase modulation (XPM). In much resent years, Semiconductor optical amplifier (SOA) makes a revolution in designing high speed (>100 Gb/s) interferometric switches in all-optical infor‐ mation processing system. The technology of SOA has been evolving rapidly during the re‐ sent years and has become mature enough so that it is now key factor in implementation of modern optical communication networks. SOA are commercially available device and have different important properties. Such as, fast and strong nonlinearities, short latency, thermal stability, low power consumption, large dynamic range, short response time, broadband and versatile operation and capability of large scale integration with chip level design.

**Figure 7.** A TOAD based optical switch, where SOA: Semiconductor optical amplifier, CW: Clockwise pulse, CCW: Counter clockwise pulse, CO: coupler, F: Filter which blocks control pulse.

The Fig. 7 is a Sagnac interferometer that uses an SOA offset from the midpoint of the loop and is known as a terahertz optical asymmetric demultiplexer (TOAD). It can operate at frequen‐ cies in terahertz range [50-51]. There are two couplers; 1) the control coupler provides an input path for the control pulses to enter the fiber loop in order to saturate the SOA, and 2) the input coupler (50:50) where the incoming pulse signal train entering the loop splits equally into clockwise (CW) and counter clockwise (CCW) pulses. CW and CCW pulses arrive at the SOA at slightly different times as determined by the offset *Δx*of the SOA from the midpoint of the loop. Another strong light pulse is also injected to the loop. It is called control signal (CS). When *CS*=*ON*, then SOA changes its index of refraction. As a result, the two counter-propaga‐ tion data signal will experience a differential gain saturation profiles. Therefore cross phase modulation (XPM) take place when they recombine at the input coupler. Then relative phase difference between CW and CCW pulse become ∼*π* and the data will exit from the transmit‐ ted port / T-port (output-1 according to the Fig. 7). In the absence of a control signal (*CS=OFF*), the incoming signal enters the fiber loop, pass through the SOA at different times as they coun‐ ter-propagate around the loop, and experience the nearly same unsaturated amplifier gain of SOA, recombine at the input coupler. Then, relative phase difference between CW and CCW is zero (0). Then no data is found at the T-port. Then data is reflected back toward the source and isolated by optical circulator (CR). The port through which it comes is called reflected port /Rport (output-2 according to the Fig. 7). A filter (F) may be used at the output to reject the control and pass the incoming pulse. 'F' can be polarization filter of band pass filter.

The output power of transmitted port (T-port) and reflected port (R-port) of a TOAD based switch can be expressed as [52-53],

$$P\_{\tau}(t) = \frac{P\_{\text{in}}(t)}{4} \cdot \left( G\_{\text{cv}}(t) + G\_{\text{cav}}(t) - 2\sqrt{G\_{\text{cv}}(t) \cdot G\_{\text{cav}}(t)} \cdot \cos(\Delta\phi) \right) \tag{4}$$

*ξ*is bit period. For low switching window eccentricity of the loop (*T*) should be small. One data when transmit through the switching window, next data cannot pass until the gain re‐

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

91

In summary we can say, in the absence of control signal, the incoming signal exits through input port of TOAD and reaches to the output port-2 as shown in Fig. 8(a).In this case no light is present in the output port-1. But in the presence of control signal, the incoming sig‐ nal exits through output port of TOAD and reaches to the output port-1 as shown in Fig. 8(a).In this case no light is present in the output port-2. In the absence of incoming signal, port-1 and port-2 receives no light as the filter blocks the control signal. Only incoming sig‐ nal is passed through filter. Truth table is given in Fig. 8(b).The above principle of the switch

covery of the SOA takes place.

is used to design basic quaternary logic circuits.

**2.2. All-optical two input Quaternary MIN Gate**

∧ is QMIN operation. Truth table is shown in Table 1.

**Table 1.** Truth Table of Quaternary MIN(x,y).

**Figure 8.** (a) Schematic block diagram TOAD based switch (b) the corresponding truth table.

Quaternary MIN gate is equivalent AND gate in binary world [6,39]. It is an important mul‐ ti-valued logic function. The QMIN operation is shown in the equation no (7), the operator

> **Y X 0123**

> > **0** 0000 **1** 0111 **2** 0122 **3** 0123

,, , *n n x x x QMIN x x x* ÙÙÙ= L L (7)

1 2 ( ) 1 2

$$P\_{\mathcal{R}}(t) = \frac{P\_{\text{av}}(t)}{4} \cdot \left( G\_{\text{cv}}(t) + G\_{\text{cav}}(t) + 2\sqrt{G\_{\text{cav}}(t) \cdot G\_{\text{cav}}(t)} \cdot \cos(\Delta\phi) \right) \tag{5}$$

where, *Gcw*(*t*), *Gccw*(*t*)are the power gain of CW and CCW pulse, *Δφ* = (*φcw* −*φccw*)is the phase difference between CW and CCW pulse, can be expressed as *Δφ* = −*α* / 2⋅ ln(*Gcw* / *Gccw*). The temporal duration of the switching window (*τwin*) that de‐ pends on the offset position of the SOA in the loop (*Δx*) is given by *Toff* = 2*Δx* / *c fiber*, where *c fiber* is the velocity of light inside the optical fiber. More specifically eccentricity of the loop must be less than half the bit period, otherwise the two counter-propagating halves of incoming signal (IS) being processed will not experience the gain dynamics caused by their synchronized control pulses but instead by others resulting in incomplete switching. *T FWHM* of the control pulse must be as short as possible and ideally less than the switching window so that when CCW pulse is inserted in the SOA the CW pulse al‐ ready passed through and the SOA gain has started to recover after saturation by the con‐ trol pulse or else the two components of IS will overlap inside the SOA perceiving its nonlinear properties only partially altered, thus obstructing the creation of the required differential phase shift [52-53].

$$
\sigma < T < 0.5\xi < \tau\_\epsilon < 1.5\xi \tag{6}
$$

*ξ*is bit period. For low switching window eccentricity of the loop (*T*) should be small. One data when transmit through the switching window, next data cannot pass until the gain re‐ covery of the SOA takes place.

In summary we can say, in the absence of control signal, the incoming signal exits through input port of TOAD and reaches to the output port-2 as shown in Fig. 8(a).In this case no light is present in the output port-1. But in the presence of control signal, the incoming sig‐ nal exits through output port of TOAD and reaches to the output port-1 as shown in Fig. 8(a).In this case no light is present in the output port-2. In the absence of incoming signal, port-1 and port-2 receives no light as the filter blocks the control signal. Only incoming sig‐ nal is passed through filter. Truth table is given in Fig. 8(b).The above principle of the switch is used to design basic quaternary logic circuits.

**Figure 8.** (a) Schematic block diagram TOAD based switch (b) the corresponding truth table.

#### **2.2. All-optical two input Quaternary MIN Gate**

clockwise (CW) and counter clockwise (CCW) pulses. CW and CCW pulses arrive at the SOA at slightly different times as determined by the offset *Δx*of the SOA from the midpoint of the loop. Another strong light pulse is also injected to the loop. It is called control signal (CS). When *CS*=*ON*, then SOA changes its index of refraction. As a result, the two counter-propaga‐ tion data signal will experience a differential gain saturation profiles. Therefore cross phase modulation (XPM) take place when they recombine at the input coupler. Then relative phase difference between CW and CCW pulse become ∼*π* and the data will exit from the transmit‐ ted port / T-port (output-1 according to the Fig. 7). In the absence of a control signal (*CS=OFF*), the incoming signal enters the fiber loop, pass through the SOA at different times as they coun‐ ter-propagate around the loop, and experience the nearly same unsaturated amplifier gain of SOA, recombine at the input coupler. Then, relative phase difference between CW and CCW is zero (0). Then no data is found at the T-port. Then data is reflected back toward the source and isolated by optical circulator (CR). The port through which it comes is called reflected port /Rport (output-2 according to the Fig. 7). A filter (F) may be used at the output to reject the control

The output power of transmitted port (T-port) and reflected port (R-port) of a TOAD based

j

j

(4)

(5)

{ ( )} ( ) ( ) ( ) ( ) 2 ( ) ( ) cos <sup>4</sup>

{ ( )} ( ) ( ) ( ) ( ) 2 ( ) ( ) cos <sup>4</sup>

where, *Gcw*(*t*), *Gccw*(*t*)are the power gain of CW and CCW pulse, *Δφ* = (*φcw* −*φccw*)is the phase difference between CW and CCW pulse, can be expressed as *Δφ* = −*α* / 2⋅ ln(*Gcw* / *Gccw*). The temporal duration of the switching window (*τwin*) that de‐ pends on the offset position of the SOA in the loop (*Δx*) is given by *Toff* = 2*Δx* / *c fiber*, where *c fiber* is the velocity of light inside the optical fiber. More specifically eccentricity of the loop must be less than half the bit period, otherwise the two counter-propagating halves of incoming signal (IS) being processed will not experience the gain dynamics caused by their synchronized control pulses but instead by others resulting in incomplete switching. *T FWHM* of the control pulse must be as short as possible and ideally less than the switching window so that when CCW pulse is inserted in the SOA the CW pulse al‐ ready passed through and the SOA gain has started to recover after saturation by the con‐ trol pulse or else the two components of IS will overlap inside the SOA perceiving its nonlinear properties only partially altered, thus obstructing the creation of the required

0.5 1.5

 x

<< < < *T <sup>e</sup>* (6)

 xt

s

and pass the incoming pulse. 'F' can be polarization filter of band pass filter.

*T cw ccw cw ccw P t Pt G t G t G t G t* = × + - × ×D

*R cw ccw cw ccw P t Pt G t G t G t G t* = × + + × ×D

switch can be expressed as [52-53],

90 Design and Architectures for Digital Signal Processing

differential phase shift [52-53].

*in*

*in*

Quaternary MIN gate is equivalent AND gate in binary world [6,39]. It is an important mul‐ ti-valued logic function. The QMIN operation is shown in the equation no (7), the operator ∧ is QMIN operation. Truth table is shown in Table 1.

$$\mathbf{x}\_{\cdot} \wedge \mathbf{x}\_{\cdot} \wedge \cdots \wedge \mathbf{x}\_{\cdot} = \mathbf{Q} \text{MIN}(\mathbf{x}\_{\cdot}, \mathbf{x}\_{\cdot}, \cdots, \mathbf{x}\_{\cdot}) \tag{7}$$


**Table 1.** Truth Table of Quaternary MIN(x,y).

Polarization encoded all- optical QMIN Gate is shown in the Fig. 9. Here light from inputs X and Y fall into two PBS (PBS1& PBS2), where it split into two polarized light, one is vertically polarized (↕) and other is horizontally polarized (•).Hence, X1& Y1 are vertically polarized (↕) and X2& Y2 are horizontally polarized (•). Light from X2 and Y2 are fed to two Interfero‐ metric switches (here TOAD) S1 and S2 as incoming signal and also their control signals are taken from Y2 and X2 respectively. The lower outputs of S1 and S2 are passed through a PC (polarization converter, which is preferably half wave plate; converts vertically polarized light to horizontal one and vice versa). It is indicated as S1L and S2L respectively in the Fig. 9. X1 and S1L is combined by a BC-1, the combined ray (C1) is connected to another switch S3 as incoming signal. Also Y1 and S2L are combined by BC-2 and the combined ray (C2) is con‐ nected to S3 as control signal. The upper output channel of S3 (S3U) is feed to BC-3. Again X2 and Y2 are feed to another switch S4 as incoming and the control signal respectively. All the control signals are amplified by EDFA (Erbium Doped Fiber Amplifier). When incoming light signal is incident on wavelength converter (WC) then the wavelength converter (WC) converts the wavelength of incoming signal to wavelength of control signal. The upper out‐ put channel of this switch S4 (S4U) is connected to BC-3. The combined ray is the final output. Let us describe the operational principles in detail [66].

(as both the incoming and control is present of S4 and S2) As Y1=S2L=0. So S3U=0. So the

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

93

**6.** When X takes horizontally polarized light i.e. 2 (•) and Y is partially polarized light i.e.3 (↔) then X<sup>1</sup> receives no light (0) and Y1=1, X2=Y2=2. Hence S4U=2 and S2L=0, C2=1. Again as X2=Y2=2, then S1L=C1=0 (as both the incoming and control signal are present at S1). As

**7.** When X=3 (↔), then X1=1 and X2=2. When Y=1, then Y1=1 and Y2=0. So S4U=S2L=0 (as the control signal is absent in S4 and the incoming signal is absent in S2) and S1L=1 (as in‐ coming is present but control signal is absent in S1). So C1=C2=1, hence S3U=1. So final

**8.** When X=3 (↔), then X1=1 and X2=2. When Y=2 (•), Y1=0 and Y2=2. So S4U=2 and

**9.** When X & Y both are partially polarized light i.e. 3 (↔), Then X1=Y1=1 and X2=Y2=2. So S4U receives horizontally polarized light (2 i.e. •) and S2L=S1L=0. Hence C1=C2=1, hence

**Figure 9.** All-optical Quaternary QMIN(X,Y) Circuit. S (Switch): PBS : Polarizing Beam Splitter BC : Beam Combiner PC :

Polarization Converter, ► EDFA : Erbium Doped Fiber Amplifier, ∎ : WC Wavelength Converter.

C1 is the incoming signal of S3, hence S3U=0. And the final Output 2 (•).

S2L=S1L=0. So C1=1 and C2=0, hence S3U=0. So the final output is 2 (•).

final Output after BC-3 is 2 (•).

S3U=1. So final Output is 3 (↔).

Output is 1 (↕).

	- **•** Now as Y=1(↕) then Y1=1, Y2=0. So S2L=0 (as incoming signal of S2 is absent), C2=1. So S3U=1 (as C2 is the incoming signal of S3). So the final Output after BC-3 is 1 (↕).
	- **•** When Y=2 (•) i.e. horizontally polarized light, then Y1=0 (no light) and Y2=2. So S2L and hence C2 receives vertically polarized light (1 i.e. ↕). Hence S3U=1. So the final Output after BC-3 is 1 (↕).
	- **•** Now when Y=3 (↔), then Y1=1,Y2=2. So S2L=1 (as incoming signal is present but con‐ trol signal is absent at S2), C2=1. Hence S3U=1. So the final Output after BC-3 is 1 (↕).

(as both the incoming and control is present of S4 and S2) As Y1=S2L=0. So S3U=0. So the final Output after BC-3 is 2 (•).

**6.** When X takes horizontally polarized light i.e. 2 (•) and Y is partially polarized light i.e.3 (↔) then X<sup>1</sup> receives no light (0) and Y1=1, X2=Y2=2. Hence S4U=2 and S2L=0, C2=1. Again as X2=Y2=2, then S1L=C1=0 (as both the incoming and control signal are present at S1). As C1 is the incoming signal of S3, hence S3U=0. And the final Output 2 (•).

Polarization encoded all- optical QMIN Gate is shown in the Fig. 9. Here light from inputs X and Y fall into two PBS (PBS1& PBS2), where it split into two polarized light, one is vertically polarized (↕) and other is horizontally polarized (•).Hence, X1& Y1 are vertically polarized (↕) and X2& Y2 are horizontally polarized (•). Light from X2 and Y2 are fed to two Interfero‐ metric switches (here TOAD) S1 and S2 as incoming signal and also their control signals are taken from Y2 and X2 respectively. The lower outputs of S1 and S2 are passed through a PC (polarization converter, which is preferably half wave plate; converts vertically polarized light to horizontal one and vice versa). It is indicated as S1L and S2L respectively in the Fig. 9. X1 and S1L is combined by a BC-1, the combined ray (C1) is connected to another switch S3 as incoming signal. Also Y1 and S2L are combined by BC-2 and the combined ray (C2) is con‐ nected to S3 as control signal. The upper output channel of S3 (S3U) is feed to BC-3. Again X2 and Y2 are feed to another switch S4 as incoming and the control signal respectively. All the control signals are amplified by EDFA (Erbium Doped Fiber Amplifier). When incoming light signal is incident on wavelength converter (WC) then the wavelength converter (WC) converts the wavelength of incoming signal to wavelength of control signal. The upper out‐ put channel of this switch S4 (S4U) is connected to BC-3. The combined ray is the final output.

**1.** When X=0, X1=X2=0. X2 which act as a incoming signal of S1 and S4 is zero. So S1L and S4U receive no light. So the BC-1 receives no light and hence the output of BC-1 is zero therefore S3U receives no light. Hence the final Output after BC-3 is 0. This result cannot

**2.** Similarly when Y=0, Y1=Y2=0 so all the incoming signals of S2 and control signal of S4 and S3 is zero hence S4U=S2L=C2=S3U=0 i.e. receive no light. So the final Output after BC-3

**3.** When X=1(↕), X1=1 and X2=0. So S1L& S4U receives no light (as incoming signal of S1 and

**•** Now as Y=1(↕) then Y1=1, Y2=0. So S2L=0 (as incoming signal of S2 is absent), C2=1. So S3U=1 (as C2 is the incoming signal of S3). So the final Output after BC-3 is 1 (↕).

**•** When Y=2 (•) i.e. horizontally polarized light, then Y1=0 (no light) and Y2=2. So S2L and hence C2 receives vertically polarized light (1 i.e. ↕). Hence S3U=1. So the final

**•** Now when Y=3 (↔), then Y1=1,Y2=2. So S2L=1 (as incoming signal is present but con‐ trol signal is absent at S2), C2=1. Hence S3U=1. So the final Output after BC-3 is 1 (↕).

**4.** When X is 2 (•) and Y is 1 (↕), then X1& Y2 receives no light. That means here, X1 =0 & Y2 =0 and Y1=1, X2=2. Hence S4U=S2L=0 (as the control signal of S4 and incoming signal of S2 is absent) and S1L=C1=C2=1 i.e. vertically polarized light. So S3U=1 as both the incoming

**5.** When X=Y=2 (•) i.e. both of them are horizontally polarized light, then X1& Y1 receives no light (0) and X2& Y2 receives horizontally polarized light (2). Hence S4U=2 and S2L=0

and control signal of S3 are present. So the final Output after BC-3 is 1 (↕).

Let us describe the operational principles in detail [66].

is 0. This result cannot change by any value of X.

S4 is absent). And C1=1 (↕) i.e. vertically polarized light.

be changed by any value of Y.

92 Design and Architectures for Digital Signal Processing

Output after BC-3 is 1 (↕).


**Figure 9.** All-optical Quaternary QMIN(X,Y) Circuit. S (Switch): PBS : Polarizing Beam Splitter BC : Beam Combiner PC : Polarization Converter, ► EDFA : Erbium Doped Fiber Amplifier, ∎ : WC Wavelength Converter.

#### **2.3. All-optical Quaternary delta LITERALS**

Literals are very important function in multi-valued logic based information processing [67]. The truth table of Delta literal circuit [66] is in the Table 2 and the circuit diagram is shown in the Fig. 10. Here, X is the quaternary input, which can take any one of the four quaternary logic states and the output is *x* <sup>0</sup> , *x* <sup>1</sup> , *x* <sup>2</sup> and *x* <sup>3</sup> respectively.

light (•). And hence the output of BC-1 is *x* <sup>0</sup> receives partially polarized light i.e. 3 (↔). Hence the final outputs are *x* <sup>0</sup> =3(↔) i.e. partially polarized light and others receives no

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

95

**2.** When X=1 (↕), then X1=1, X2=0 and C=1. So S3L=0 as control signal of S3 is present and S1L=1 (↕) as control signal of S1 is absent. Other outputs (S2U& S2L) of the switch S<sup>2</sup> are absent, as the incoming signal of S2 is absent. So the final outputs are*x* <sup>0</sup> =0, *x* <sup>1</sup> =3(↔),

**3.** When X=2 (•), then X1=0, X2=1 and C=1. So S3L=0, S1L=0 as the incoming signal of S1 is absent. S2U=0 and S2L=1 (↕), because control signal of S2 is absent. Hence the final outputs

**4.** When X is partially polarized light i.e.3 (↔), then all the X1, X2& C receives vertically polarized light i.e.1. So S3L= S1L= S2L=0 and S2U=1 (↕) as both the incoming and control signal of S3, S1 and S2 are present. So the final outputs are*x* <sup>0</sup> =0, *x* <sup>1</sup> =0, *x* <sup>2</sup> = 0and *x* <sup>3</sup> =3

{,, ,}

<sup>ý</sup> <sup>Î</sup> Ù = ïþ (8)

**QDEMUX Outputs Y0 Y1 Y2 Y3**

 *where A 0 1 2 3* 

Now we design quaternary multiplexer (QMUX) and demultiplexer (QDEMUX) using the basic gates QMIN and Delta Literal, made by switching character of the non-liner

> 0 A A 0 00 1 B 0 B 00 2 C 0 0 C0 3 D 0 0 0D

Multiplexing means many into one. A multiplexer is system dealing with many inputs and only with single output. A quaternary multiplexer with n-control inputs can be used to route one of 4*<sup>n</sup>* data inputs (it may be any one of the four logical states) to the output. Fig. 11

**QMUX Output (Y)**

light i.e.*x* <sup>1</sup> =0, *x* <sup>2</sup> = 0and*x* <sup>3</sup> = 0.

are*x* <sup>0</sup> =0, *x* <sup>1</sup> =0, *x* <sup>2</sup> =3(↔) and*x* <sup>3</sup> = 0.

**2.4. Design of All-optical Quaternary Multiplexer (4:1) :**

*3 A A* 

*0 A 0*  Ù = üï

**Control input signal (X)**

**Table 3.** Truth Table of Quaternary Multiplexer (QMUX) and Demultiplexer (QDEMUX).

From the truth Table 3 (truth table of QMIN gate) we can say that,

*x* <sup>2</sup> = 0and*x* <sup>3</sup> = 0.

(↔).

material based switch [66].


**Table 2.** Truth table of quaternary delta Literals.

**1.** When X=0 (absent of light), X1& X2 receives no light and the other outputs of the switch S1, S2 are 0 as they receives no light. Here only vertically polarized light (↕), which comes from LS through PBS1, falls on S3. This act as incoming signal. Here as control signal is absent (because of C=0) the light comes out through lower channel of S3 i.e. through S3L. A part of this directly enters in the beam combiner (BC) and another part is passed through PC, which converts vertically polarized (↕) to horizontally polarized light (•). And hence the output of BC-1 is *x* <sup>0</sup> receives partially polarized light i.e. 3 (↔). Hence the final outputs are *x* <sup>0</sup> =3(↔) i.e. partially polarized light and others receives no light i.e.*x* <sup>1</sup> =0, *x* <sup>2</sup> = 0and*x* <sup>3</sup> = 0.


#### **2.4. Design of All-optical Quaternary Multiplexer (4:1) :**

**2.3. All-optical Quaternary delta LITERALS**

94 Design and Architectures for Digital Signal Processing

, *x* <sup>1</sup>

**Input X**

logic states and the output is *x* <sup>0</sup>

**Table 2.** Truth table of quaternary delta Literals.

**Figure 10.** All-optical Quaternary Delta Literal Circuit.

Literals are very important function in multi-valued logic based information processing [67]. The truth table of Delta literal circuit [66] is in the Table 2 and the circuit diagram is shown in the Fig. 10. Here, X is the quaternary input, which can take any one of the four quaternary

respectively.

**Output X0 X1 X2 X3**

0 3 000 1 0 300 2 0 030 3 0 003

**1.** When X=0 (absent of light), X1& X2 receives no light and the other outputs of the switch S1, S2 are 0 as they receives no light. Here only vertically polarized light (↕), which comes from LS through PBS1, falls on S3. This act as incoming signal. Here as control signal is absent (because of C=0) the light comes out through lower channel of S3 i.e. through S3L. A part of this directly enters in the beam combiner (BC) and another part is passed through PC, which converts vertically polarized (↕) to horizontally polarized

, *x* <sup>2</sup> and *x* <sup>3</sup>

From the truth Table 3 (truth table of QMIN gate) we can say that,

$$\begin{array}{c} \{\mathcal{B}\land A = A \\ \emptyset \land A = 0 \end{array} \right\} \quad \text{where} \quad A \in \{0, 1, 2, 3\} \tag{8}$$

Now we design quaternary multiplexer (QMUX) and demultiplexer (QDEMUX) using the basic gates QMIN and Delta Literal, made by switching character of the non-liner material based switch [66].


**Table 3.** Truth Table of Quaternary Multiplexer (QMUX) and Demultiplexer (QDEMUX).

Multiplexing means many into one. A multiplexer is system dealing with many inputs and only with single output. A quaternary multiplexer with n-control inputs can be used to route one of 4*<sup>n</sup>* data inputs (it may be any one of the four logical states) to the output. Fig. 11 is the design of 4:1 all-optical quaternary multiplexer. Here four inputs A, B, C and D [which can be any one of the 4-logical state i.e. 0(no light), 1(↕), 2 (•), 3 (↔)] are connected to four 2 input QMIN gates (QMIN0, QMIN1, QMIN2, and QMIN3). Other input of the QMIN is fed from one of the Delta Literal outputs (i.e.*x* <sup>0</sup> , *x* <sup>1</sup> , *x* <sup>2</sup> and *x* <sup>3</sup> ) respectively as shown in Fig. 11. These inputs of QMIN are act as a select line.

**4.** When X is partially polarized light i.e. 3 (↔), then only *x* <sup>3</sup>

**2.5. Design of All-optical Quaternary Demultiplexer (1:4):**

(second column).

, *x* <sup>1</sup>

LS

A

<sup>0</sup> *<sup>x</sup>* <sup>1</sup> *<sup>x</sup>* <sup>2</sup> *<sup>x</sup>* <sup>3</sup> *<sup>x</sup>*

, *x* <sup>2</sup> and *x* <sup>3</sup>

The truth table is shown in the Table 3 (third column).

BS

BS

BS

**Figure 12.** All optical Quaternary 1:4 Demultiplexer (QDEMUX).

outputs (i.e.*x* <sup>0</sup>

(↔). And *x* <sup>0</sup> = *x* <sup>1</sup> = *x* <sup>2</sup> = 0 (no light). Hence only QMIN3 is active and others QMIN0, QMIN1 and QMIN2 are inactive. Hence Y3 = D & Y0 = Y1 = Y2 =0 and combining, at the final output we receives Y = D. The truth table of this circuit is shown in the Table 3

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

A quaternary demultiplexer has the opposite function of QMUX. Here one input data is passed to one of the outputs according to the selection of the control. Fig. 12 is the design of 1:4 all-optical quaternary demultiplexers. Here one Input A is fed to every four 2-input QMIN gate as one input light (light from A is split by three beam splitters (BS) and fed to four 2-input QMIN gates) and other input of the QMIN is fed from one of the Delta Literal

This circuit act like same way as multiplexer circuit i.e. one QMIN is active (depends on the selection of the control line) and others QMIN gates are inactive. The active QMIN gate passes the input data from A which may be one of the four logical sate. The final outputs are taken from the combination of four QMIN output line (Y0, Y1, Y2 and Y3) as shown in Fig. 12.

DELTA LITERAL

X

X0 X1 X2 X3

X

) respectively. These inputs of QMIN gates act as select line.

QMIN0

QMIN1

Y0

Y1

Y2

Y3

QMIN2

QMIN3

receives the logical state 3

http://dx.doi.org/10.5772/51559

97

**Figure 11.** All optical Quaternary 4:1 Multiplexer (QMUX).

<sup>0</sup> *<sup>x</sup>* <sup>1</sup> *<sup>x</sup>* <sup>2</sup> *<sup>x</sup>* <sup>3</sup> *<sup>x</sup>*


X

**4.** When X is partially polarized light i.e. 3 (↔), then only *x* <sup>3</sup> receives the logical state 3 (↔). And *x* <sup>0</sup> = *x* <sup>1</sup> = *x* <sup>2</sup> = 0 (no light). Hence only QMIN3 is active and others QMIN0, QMIN1 and QMIN2 are inactive. Hence Y3 = D & Y0 = Y1 = Y2 =0 and combining, at the final output we receives Y = D. The truth table of this circuit is shown in the Table 3 (second column).

#### **2.5. Design of All-optical Quaternary Demultiplexer (1:4):**

is the design of 4:1 all-optical quaternary multiplexer. Here four inputs A, B, C and D [which can be any one of the 4-logical state i.e. 0(no light), 1(↕), 2 (•), 3 (↔)] are connected to four 2 input QMIN gates (QMIN0, QMIN1, QMIN2, and QMIN3). Other input of the QMIN is fed

, *x* <sup>1</sup>

X

, *x* <sup>2</sup> and *x* <sup>3</sup>

QMIN1

QMIN2 BC

Y2

Y3

Y0

Y1

QMIN0

QMIN3

) respectively as shown in Fig. 11.

Y

receives the logical state 3

receives the logical state 3

receives the logical state 3

is connected with QMIN0, then ac‐

is connected with QMIN1, then according to the

, *x* <sup>2</sup> and *x* <sup>3</sup>

) re‐

from one of the Delta Literal outputs (i.e.*x* <sup>0</sup>

96 Design and Architectures for Digital Signal Processing

These inputs of QMIN are act as a select line.

LS

A

B

C

D

**Figure 11.** All optical Quaternary 4:1 Multiplexer (QMUX).

(↔). & *x* <sup>0</sup> = *x* <sup>2</sup> = *x* <sup>3</sup> = 0 (no light). As *x* <sup>1</sup>

**1.** When X (delta literal input) is zero i.e. with no signal (0), then *x* <sup>0</sup>

ceive no light (0) as discussed in earlier section. As *x* <sup>0</sup>

**2.** When X is vertically polarized light i.e. 1 (↕), then only *x* <sup>1</sup>

**3.** When X is horizontally polarized i.e. 2 (•), then only *x* <sup>2</sup>

(↔) i.e. partially polarized light and others outputs of the delta literal (*x* <sup>1</sup>

Hence Y1 = B & Y0 = Y2 = Y3 =0 and at the final output we receives Y = B.

cording to the equation no 7, only QMIN0 is active and others QMIN gates (QMIN1, QMIN2 and QMIN3) are inactive. Hence the corresponding input A is at the output i.e. Y0 =

equation no (2) only QMIN1 is active and QMIN0, QMIN2 and QMIN3 are inactive.

(↔). And *x* <sup>0</sup> = *x* <sup>1</sup> = *x* <sup>3</sup> = 0 (No light). Hence only QMIN2 is active and QMIN0, QMIN1 and QMIN3 are inactive. Hence Y2 = C & Y0 = Y1 = Y3 =0 and at the final output, we

A. and Y1 = Y2 = Y3 =0. Hence after combining in BC we receives Y = A at the outputs.

<sup>0</sup> *<sup>x</sup>* <sup>1</sup> *<sup>x</sup>* <sup>2</sup> *<sup>x</sup>* <sup>3</sup> *<sup>x</sup>*

receives Y = C.

DELTA LITERAL

X0 X1 X2 X3

X

A quaternary demultiplexer has the opposite function of QMUX. Here one input data is passed to one of the outputs according to the selection of the control. Fig. 12 is the design of 1:4 all-optical quaternary demultiplexers. Here one Input A is fed to every four 2-input QMIN gate as one input light (light from A is split by three beam splitters (BS) and fed to four 2-input QMIN gates) and other input of the QMIN is fed from one of the Delta Literal outputs (i.e.*x* <sup>0</sup> , *x* <sup>1</sup> , *x* <sup>2</sup> and *x* <sup>3</sup> ) respectively. These inputs of QMIN gates act as select line. This circuit act like same way as multiplexer circuit i.e. one QMIN is active (depends on the selection of the control line) and others QMIN gates are inactive. The active QMIN gate passes the input data from A which may be one of the four logical sate. The final outputs are taken from the combination of four QMIN output line (Y0, Y1, Y2 and Y3) as shown in Fig. 12. The truth table is shown in the Table 3 (third column).

X

**Figure 12.** All optical Quaternary 1:4 Demultiplexer (QDEMUX).

<sup>0</sup> *<sup>x</sup>* <sup>1</sup> *<sup>x</sup>* <sup>2</sup> *<sup>x</sup>* <sup>3</sup> *<sup>x</sup>*

#### **3. Quaternary T-gate:**

In section 2.4 we have reported all-optical 4:1 all-optical quaternary multiplexer. It is also known as 'T-Gate' [82]. The schematic diagram for quaternary T-gate is shown in Fig. 13. some logic operations are given in table 4.

The mathematical expression for all-optical quaternary T-gate using MIN & delta literals can

Where, '∧ ' is MIN operator (*x* ∧ *y*= minimum of (*x, y*)) and *δ*- literals function is *x <sup>a</sup>* =(*R* −1) if *x=a*, else 0] The four incoming data transmission lines are 'A', 'B', 'C' and 'D' [which can be any one of the 4-logical state i.e. 0 (no light), 1(↕ ), 2 (• ), 3 (↔)] and 'X' is the selection input. By using proper section we can get any data (A, B, C or D) at the output. If X=0, the output is A, when X=1 then the output is B, for X=2 the output is C and when X=3 then the output is D

,,,; if 2

ï

This T-gate can successfully used for designing any quaternary circuits. So it is called 'uni‐ versal' element of quaternary logic. Some quaternary logical operations with T-gate is shown in Table 4. Here inputs of T-gate A, B, C, and D are shown in colomn-3 of that table 4. X is the select input = 0123 . Here the quaternary multiplexer or T-gate is all-optical in na‐

**4. Challenges in designing the polarization encoded all-optical system:**

Here, in this proposed scheme, we have proposed and described an all-optical circuit for de‐ signing quaternary (four-valued) multiplexer & de-multiplexer with the help of some polari‐

ï

 if 0 if 1

*A x B x*

ì =

*C x D x*

î =

<sup>ï</sup> <sup>=</sup> <sup>=</sup> <sup>í</sup> <sup>=</sup> <sup>ï</sup>

if 3

(10)

( )

*T ABCDx*

ture. Hence all the quaternary circuits are all-optical.

( ) <sup>3000</sup> <sup>0300</sup> <sup>0030</sup> <sup>0003</sup> *O Ax Bx Cx Dx* = Ù +Ù +Ù +Ù (9)

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

99

be written as:

**Figure 13.** All optical Quaternary T-gate.

respectively i.e. [82]:


**Table 4.** Some well-known one input quaternary logical functions (radix *R*=4) and design process with quaternary T-gate.

The mathematical expression for all-optical quaternary T-gate using MIN & delta literals can be written as:

$$O = \left( A \wedge \mathbf{x}^{\{\text{t@00}\}} + B \wedge \mathbf{x}^{\{\text{t@00}\}} + C \wedge \mathbf{x}^{\{\text{t@00}\}} + D \wedge \mathbf{x}^{\{\text{t@00}\}} \right) \tag{9}$$

**Figure 13.** All optical Quaternary T-gate.

**3. Quaternary T-gate:**

**Name of the**

Compliment /

Clockwise

Counter

Cycle *X*

Cycle *X*

Literal *x*

Truncated Sum

Truncated difference

Threshold

Step literals

literals (up) *aU* (*x*)={

(down) *aD* (*x*)={

⌢

⌢

*<sup>a</sup> <sup>b</sup>* ={

some logic operations are given in table 4.

98 Design and Architectures for Digital Signal Processing

**functions Symbol & mathematical expression**

*b* = (*x* + *b*)mod 4

*bc* = (*x* −*b*)mod 4

(*R* −1) if *a*≤ *x* ≤*b* 0 otherwise

*<sup>X</sup>* <sup>⊞</sup>*a*={ (*<sup>X</sup>* <sup>+</sup> *<sup>a</sup>*) if, *<sup>X</sup>* <(*<sup>R</sup>* <sup>−</sup>1)

*<sup>X</sup>* <sup>⊟</sup>*a*={ (*<sup>X</sup>* <sup>−</sup>*a*) if, *<sup>X</sup>* <sup>≥</sup>*<sup>a</sup>*

1 if *x* ≥*a* 0 else

1 if *x* ≤*a* 0 else

In section 2.4 we have reported all-optical 4:1 all-optical quaternary multiplexer. It is also known as 'T-Gate' [82]. The schematic diagram for quaternary T-gate is shown in Fig. 13.

Inverter *<sup>x</sup>*¯ =(*<sup>R</sup>* <sup>−</sup>1)<sup>−</sup> *<sup>x</sup>* <sup>3210</sup> <sup>3210</sup>

Successor *Suc*(*x*) =(*x* + 1) mod 4 1230 1230

**Inputs**

**ABCD**

<sup>1230</sup> *<sup>X</sup>*

<sup>2301</sup> *<sup>X</sup>*

<sup>3012</sup> *<sup>X</sup>*

<sup>3012</sup> *<sup>X</sup>*

<sup>2301</sup> *<sup>X</sup>*

<sup>1230</sup> *<sup>X</sup>*

0330 *x*

0333 *x*

0033 *x*

0111 1*<sup>U</sup>* (*x*)= 0111 0011 2*<sup>U</sup>* (*x*)= 0011

1100 1*<sup>D</sup>* (*x*)= 1100 1110 2*<sup>D</sup>* (*x*)= 1110

(*<sup>R</sup>* <sup>−</sup>1) otherwise <sup>1233</sup> *<sup>X</sup>* <sup>⊞</sup>1 = <sup>1233</sup>

0 otherwise <sup>0012</sup> *<sup>X</sup>* <sup>⊟</sup>1 = <sup>0012</sup>

**Table 4.** Some well-known one input quaternary logical functions (radix *R*=4) and design process with quaternary T-gate.

⌢ 1 = 1230

⌢ 2 = 2301

⌢ 3 = 3012

⌢

⌢

⌢

1*c* = 3012

2*c* = 2301

3*c* = 1230

<sup>1</sup> <sup>2</sup> = 0330

<sup>1</sup> <sup>3</sup> = 0333

<sup>2</sup> <sup>3</sup> = 0033

**(logical states) Outputs**

Where, '∧ ' is MIN operator (*x* ∧ *y*= minimum of (*x, y*)) and *δ*- literals function is *x <sup>a</sup>* =(*R* −1) if *x=a*, else 0] The four incoming data transmission lines are 'A', 'B', 'C' and 'D' [which can be any one of the 4-logical state i.e. 0 (no light), 1(↕ ), 2 (• ), 3 (↔)] and 'X' is the selection input. By using proper section we can get any data (A, B, C or D) at the output. If X=0, the output is A, when X=1 then the output is B, for X=2 the output is C and when X=3 then the output is D respectively i.e. [82]:

$$T\begin{pmatrix} A, B, C, D; \mathbf{x} \end{pmatrix} = \begin{vmatrix} A & \text{if } \mathbf{x} = \mathbf{0} \\ B & \text{if } \mathbf{x} = \mathbf{l} \\ C & \text{if } \mathbf{x} = \mathbf{2} \\ D & \text{if } \mathbf{x} = \mathbf{3} \end{vmatrix} \tag{10}$$

This T-gate can successfully used for designing any quaternary circuits. So it is called 'uni‐ versal' element of quaternary logic. Some quaternary logical operations with T-gate is shown in Table 4. Here inputs of T-gate A, B, C, and D are shown in colomn-3 of that table 4. X is the select input = 0123 . Here the quaternary multiplexer or T-gate is all-optical in na‐ ture. Hence all the quaternary circuits are all-optical.

#### **4. Challenges in designing the polarization encoded all-optical system:**

Here, in this proposed scheme, we have proposed and described an all-optical circuit for de‐ signing quaternary (four-valued) multiplexer & de-multiplexer with the help of some polari‐ zation encoded basic quaternary logic gates (quaternary min and quaternary delta literal). It is important to note that the above discussions are based on simple model. In order to exper‐ imentally achieve result from the proposed scheme, some design issues have to be consid‐ ered. For example, polarization properties of fiber, predetermined values of the intensities, wavelength of laser light for control and incoming signals, introduction of filter, intensity losses due to beam splitters/fiber couplers etc. The output logical states of every ternary cir‐ cuit can be determined by, stokes vector [S] measurement. Stokes vector can be calculated from the measurement of six intensities (Ii,j) in the photo detector (PD) by use of a linear ana‐ lyzer (LA) followed by a quarter wave plate (*λ* / 4plate), which is shown in the Fig. 14. The formula for calculating stokes vector is [83]:

$$\begin{bmatrix} \mathbf{S} \end{bmatrix} = \begin{bmatrix} \mathbf{S}\_{0} \\ \mathbf{S}\_{1} \\ \mathbf{S}\_{2} \\ \mathbf{S}\_{3} \end{bmatrix} = \sqrt{\frac{\mu\_{0}}{\mathcal{E}\_{0}}} \begin{bmatrix} \mathbf{I}\_{(0, \mathbf{v}^{\circ})} + \mathbf{I}\_{(0, \mathbf{v}\mathbf{v}^{\circ})} \\ \mathbf{I}\_{(0, \mathbf{v}^{\circ})} - \mathbf{I}\_{(0, \mathbf{v}\mathbf{v}^{\circ})} \\ \mathbf{I}\_{(0, 4\mathbf{v}^{\circ})} - \mathbf{I}\_{(0, 13\mathbf{v}^{\circ})} \\ \mathbf{I}\_{\left(\frac{\lambda}{4}, 4\mathbf{v}^{\circ}\right)} - \mathbf{I}\_{\left(\frac{\lambda}{4}, 13\mathbf{v}^{\circ}\right)} \end{bmatrix} \tag{11}$$

(PDL) degrades the optical signal to noise ratio (OSNR) and also degrades the extinction ra‐ tio. PDL of 3 dB could cause 1 dB power penalty [84]. Optical depolarizers can be used to reduce the polarization-induced noise in optical sensing and measurement systems [85]. Again random birefringence in optical fibers induces an unpredictable rotation of the sate of polarization (SOP); this can be adjusted by using polarization controller and PM fiber. In‐ trinsic cross talk between two polarization states, imperfection of polarized tracking after transmission link etc may cause polarization mode dispersion (PMD). This may cause the delay among the different states of polarization. The effects of PMD are expected to be simi‐ lar to those of other approaches that have been studied in the literature [86]. Optical amplifi‐ ers degrade the signal-to-noise ratio (SNR) of the amplified signal because of spontaneous emission added to the signal during its amplification (ASE). The OSNR-errors arise in this process. For polarized signal PHB will cause ASE polarization orthogonal to the signal po‐ larization. Bruyere et al [87] have shown that the DOP of ASE could exceed 70% in transo‐ ceanic links with low PMD. Of course, the most of ASE light does not traverse the entire light path, and then OSNR-errors become less as < 0.6 dB. Polarization related problem dis‐ cussed above would occur inside the considered circuit. The said problem will not occur in optical communications system once the signal comes out the output. State of polarization may be changed if it is passed through bi-refringent crystals or optically active substances. The significant advantage of this proposed scheme is that the schemes are all-optical in na‐ ture and can be easily and successfully be extended for higher order multiplexer and demul‐ tiplexer. As an example, for 16:1 multiplexer, 16-select lines can be constructed by two Delta literals outputs and 16-QMIN gates. Now select lines are to be fed to again 16-QMIN gate as first input and the second input is to be taken from the input signals. By selecting proper select line we can transfer any one of 16-input signals to the output. This scheme is easily practicable as the components of our design are technically highly developed and widely used in optical communication. The proposed scheme will work with other 2x2 Interfero‐

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

101

In present day digital signal processing is based on conventional binary number system (radix = 2). It has two logical states 'LOW' and 'HIGH'. Binary logic (or logic for that matter) is NOT a law of nature. The reason why binary logic seems more natural is because we have been more exposed to it. The perspective MVL is more like an extension of binary logic and very conventional, though broader in possibilities. In a wide sense, a signal may be anything that can be observed to have states that change in time and space. In narrow sense, a signal is a physical quantity that can be measured, usually by an electronic device. Signals, as con‐ veyers of information about the state of a system, should be processed to extract and under‐ stand the information content encoded. Nowadays, digital systems, and sometimes mixedsignal systems, are prevalent in information transmission, storage, and processing. However, enormous, and ever increasing, amounts of information that can be handled, even in everyday life, focus attention to multiple-valued (MV) logic, which permit more compact

metric switches (like Mach–Zehnder interferometer) also.

**5. Conclusions and Future Scopes:**

Where first subscript ('*i'*) index lack or presence of *λ* / 4 plate and the second ('*j*') gives the azimuth of the analyzer. *μ*0and *ε*0 is free space permeability and permittivity. Degree of po‐ larization (DOP) is also calculated by the equation:

$$\text{DOP} = \frac{\sqrt{S\_1^2 + S\_2^2 + S\_3^2}}{S\_0} \tag{12}$$

The value of DOP can be plotted in Poincaŕe sphere in point 'P' and we found that, for verti‐ cally (↕) and horizontally (•) polarized light OP=DOP=1 and lies on the equator of the Poin‐ caŕe sphere (at point *y* and *x* respectively).

**Figure 14.** Measurement technique of output logical states.

In high speed data communication (50 GB/s or TB/s) random change of polarization in a short time can produce power fluctuation at the output. So polarization dependent loss

(PDL) degrades the optical signal to noise ratio (OSNR) and also degrades the extinction ra‐ tio. PDL of 3 dB could cause 1 dB power penalty [84]. Optical depolarizers can be used to reduce the polarization-induced noise in optical sensing and measurement systems [85]. Again random birefringence in optical fibers induces an unpredictable rotation of the sate of polarization (SOP); this can be adjusted by using polarization controller and PM fiber. In‐ trinsic cross talk between two polarization states, imperfection of polarized tracking after transmission link etc may cause polarization mode dispersion (PMD). This may cause the delay among the different states of polarization. The effects of PMD are expected to be simi‐ lar to those of other approaches that have been studied in the literature [86]. Optical amplifi‐ ers degrade the signal-to-noise ratio (SNR) of the amplified signal because of spontaneous emission added to the signal during its amplification (ASE). The OSNR-errors arise in this process. For polarized signal PHB will cause ASE polarization orthogonal to the signal po‐ larization. Bruyere et al [87] have shown that the DOP of ASE could exceed 70% in transo‐ ceanic links with low PMD. Of course, the most of ASE light does not traverse the entire light path, and then OSNR-errors become less as < 0.6 dB. Polarization related problem dis‐ cussed above would occur inside the considered circuit. The said problem will not occur in optical communications system once the signal comes out the output. State of polarization may be changed if it is passed through bi-refringent crystals or optically active substances. The significant advantage of this proposed scheme is that the schemes are all-optical in na‐ ture and can be easily and successfully be extended for higher order multiplexer and demul‐ tiplexer. As an example, for 16:1 multiplexer, 16-select lines can be constructed by two Delta literals outputs and 16-QMIN gates. Now select lines are to be fed to again 16-QMIN gate as first input and the second input is to be taken from the input signals. By selecting proper select line we can transfer any one of 16-input signals to the output. This scheme is easily practicable as the components of our design are technically highly developed and widely used in optical communication. The proposed scheme will work with other 2x2 Interfero‐ metric switches (like Mach–Zehnder interferometer) also.

#### **5. Conclusions and Future Scopes:**

zation encoded basic quaternary logic gates (quaternary min and quaternary delta literal). It is important to note that the above discussions are based on simple model. In order to exper‐ imentally achieve result from the proposed scheme, some design issues have to be consid‐ ered. For example, polarization properties of fiber, predetermined values of the intensities, wavelength of laser light for control and incoming signals, introduction of filter, intensity losses due to beam splitters/fiber couplers etc. The output logical states of every ternary cir‐ cuit can be determined by, stokes vector [S] measurement. Stokes vector can be calculated from the measurement of six intensities (Ii,j) in the photo detector (PD) by use of a linear ana‐ lyzer (LA) followed by a quarter wave plate (*λ* / 4plate), which is shown in the Fig. 14. The

> () ( ) () ( ) ()( )

o o o o o o

0,0 0,90

é ù +

I I

I I

0,0 0,90

,45 ,135 4 4

æ öæ ö ç ÷ç ÷ è øè ø

ë û

  (11)

o o

2 0 0,45 0,135

Where first subscript ('*i'*) index lack or presence of *λ* / 4 plate and the second ('*j*') gives the azimuth of the analyzer. *μ*0and *ε*0 is free space permeability and permittivity. Degree of po‐

> 222 123 0

+ + <sup>=</sup> (12)

*SSS S*

The value of DOP can be plotted in Poincaŕe sphere in point 'P' and we found that, for verti‐ cally (↕) and horizontally (•) polarized light OP=DOP=1 and lies on the equator of the Poin‐

In high speed data communication (50 GB/s or TB/s) random change of polarization in a short time can produce power fluctuation at the output. So polarization dependent loss

ê ú é ù ê ú - ê ú ê ú ê ú = = ê ú ê ú - ê ú ê ú ê ú ë û - ê ú

S I I S I I 

m

e

formula for calculating stokes vector is [83]:

100 Design and Architectures for Digital Signal Processing

[ ]

larization (DOP) is also calculated by the equation:

caŕe sphere (at point *y* and *x* respectively).

**Figure 14.** Measurement technique of output logical states.

S

0

S

S

3

DOP

1 0

In present day digital signal processing is based on conventional binary number system (radix = 2). It has two logical states 'LOW' and 'HIGH'. Binary logic (or logic for that matter) is NOT a law of nature. The reason why binary logic seems more natural is because we have been more exposed to it. The perspective MVL is more like an extension of binary logic and very conventional, though broader in possibilities. In a wide sense, a signal may be anything that can be observed to have states that change in time and space. In narrow sense, a signal is a physical quantity that can be measured, usually by an electronic device. Signals, as con‐ veyers of information about the state of a system, should be processed to extract and under‐ stand the information content encoded. Nowadays, digital systems, and sometimes mixedsignal systems, are prevalent in information transmission, storage, and processing. However, enormous, and ever increasing, amounts of information that can be handled, even in everyday life, focus attention to multiple-valued (MV) logic, which permit more compact encoding of information within the same amount of digits. Although, having certain consid‐ erable demerits, multiple-valued logic is viewed as promising alternatives in many practical solutions. Many contemporary logic design technologies are oriented towards supporting an efficient implementing of various signal processing algorithms. In order to entirely ex‐ ploit all the available resources, sophisticated methods are required. Humans count by tens, machines count by twos, these sums up the way we do arithmetic today. However, there are countless other ways to count. Challenges and opportunities are wide.

[8] Fagotto, E. A. M., & Abbade, M. L. F. (2009). All-optical demultiplexing of 4-ASK op‐ tical signals with four-wave mixing optical gates. *Optics Communications*, 283,

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

103

[9] Khan, M. H. A., & Perkowski, M. A. (2007). GF(4) based synthesis of quaternary re‐ versible / quantum logic gate. *Proc. of the 37th international symposium on multiple-val‐*

[10] Obiniyi, A. A., Absalom, E. E., & Adako, K. (2011). Arithmetic Logic Design with Color-Coded Ternary for Ternary Computing. *International Journal of Computer Appli‐*

[11] Garai, S. K., Pal, A., & Mukhopadhyay, S. (2010). All-optical frequency-encoded in‐ version operation with tristate logic using reflecting semiconductor optical am‐

[12] Stakhov, A. (2002). Brousentsov's ternary principle Bergman's number system and ternary mirror-symmetrical arithmetic. *The Computer Journal*, 45(2), 221-236, doi:

[13] Walklin, S., & Conradi, J. (1997). A 10 Gb/s 4-ary ASK lightwave system. *ECOC 97,*

[14] Liu, S., Li, C., Wu, J., & Lin, Y. (1989). Optoelectronic multiple-valued logic imple‐ mentation. *Optics letters*, 14(14), 713-715, http://dx.doi.org/10.1364/OL.14.000713.

[15] Awwal, A. A. S., Karim, M. A., & Cherri, A. K. (1987). Polarization-encoded optical shadow-casting scheme: design of multi-output trinary combinational logic units.

[16] Imai, Y., & Ohtsuka, Y. (1987). Optical multiple-output and multiple-valued logic op‐ eration based on fringe shifting techniques using a spatial light modulator. *Applied*

[17] Li, C., & Yan, J. (2011). Design Method and Implementation of Ternary Logic Optical Calculator. *Advances in Information and Communication Technology*, 347, Doi:

[18] Chattopadhyay, T. (2010). All-optical symmetric ternary logic gate. *Optics and Laser*

[19] Song, Kai, & Yan, Liping. (2012). Design and implementation of the one step MSD

[20] Chattopadhyay, T., Maity, G. K., & Roy, Jitendra Nath. (2008). Designing of all-opti‐ cal tri-state logic system with the help of optical nonlinear material. *Journal of Nonlin‐*

[21] Chattopadhyay, T., & Roy, J. N. (2011). Semiconductor optical amplifier (SOA)-assist‐ ed Sagnac switch for designing of all-optical tri-state logic gates. *Optik International*

adder of optical computer. *Applied Optics*, 51(7), 917, doi: AO.51.000917.

*ear Optical Physics & Materials*, 17(3), 315-328, Doi: S0218863508004159.

*Technology*, 42, DOI: 10.1016/j.optlastec.2010.01.023, 1014-1021.

*Applied optics*, 26(22), 4814-4818, http://dx.doi.org/10.1364/AO.26.004814.

1102-1109, Doi: 10.1016/j.optcom.2009.10.094.

comjnl/45.2.221.

*ued logic, ISMVL'07 IEEE*, Doi:10.1109/ISMVL.2007.26.

*cations*, 26(11), 0975-8887, Doi: 10.5120/3162-2929, 31-37.

plifiers. *Optik*, 121, DOI:10.1016/j.ijleo.2009.02.011, 1462-1465.

*IEE, Conference publication*, 448(448), 255-258, DOI: cp:19971538.

*optics*, 26(2), 274-277, doi: AO.26.000274.

10.1007/978-3-642-18369-0\_17, 147-166.

#### **Author details**

Jitendra Nath Roy1\* and Tanay Chattopadhyay2

\*Address all correspondence to: jnroys@yahoo.co.in

1 Department of Physics, National Institute of Technology, Agartala, Jirania, Tripura, India

2 Mechanical operation (stage-II), Kolaghat Thermal Power station, WBPDCL, West Bengal, India

#### **References**


[8] Fagotto, E. A. M., & Abbade, M. L. F. (2009). All-optical demultiplexing of 4-ASK op‐ tical signals with four-wave mixing optical gates. *Optics Communications*, 283, 1102-1109, Doi: 10.1016/j.optcom.2009.10.094.

encoding of information within the same amount of digits. Although, having certain consid‐ erable demerits, multiple-valued logic is viewed as promising alternatives in many practical solutions. Many contemporary logic design technologies are oriented towards supporting an efficient implementing of various signal processing algorithms. In order to entirely ex‐ ploit all the available resources, sophisticated methods are required. Humans count by tens, machines count by twos, these sums up the way we do arithmetic today. However, there are

1 Department of Physics, National Institute of Technology, Agartala, Jirania, Tripura, India

2 Mechanical operation (stage-II), Kolaghat Thermal Power station, WBPDCL, West Bengal,

[1] Woods, D., & Naughton, T. J. (2012). Photonic neural networks. *Nature Physics*, 8, doi:

[2] Caulfield, H. J., & Dolev, S. (2010). Why future supercomputing requires optics. *Na‐*

[3] Caulfield, H. J., Dolev, S., & Green, W. M. J. (2009). Appl. Opt. A, 48(Optical High-Performance Computing feature issue), http://dx.doi.org/10.1364/AO.48.0OHPC1.

[4] Roy, S., Prasad, M., Topolancik, J., & Vollmer, F. (2010). All optical switching with bacterirhodopsin protein coated microcavities and its application to low power com‐ puting circuits. *Journal of Applied Physics*, 107, 053115, http://dx.doi.org/

[5] Reed, T., Mashanovich, G., Gardes, F. Y., & Thomson, D. J. (2010). Silicon optical

[6] Hurst, S. L. (1984). Multiple-Valued Logic-Its Status and its Future. *IEEE Transactions*

[7] Lohmann, A. W. (1988). Polarization and optical logic. *Applied Optics*, 25, 1594-1597.

modulators. *Nature Photonics*, 4, doi:10.1038/nphoton.2010.179, 518-526.

*on computers*, C-33(12), 1160-1179, Doi:10.1109/TC.1984.1676392.

*ture Photonics*, 4, doi:10.1038/nphoton.2010.94, 262-263.

countless other ways to count. Challenges and opportunities are wide.

Jitendra Nath Roy1\* and Tanay Chattopadhyay2

102 Design and Architectures for Digital Signal Processing

10.1038/nphys2283, 257-259.

10.1063/1.3310385.

\*Address all correspondence to: jnroys@yahoo.co.in

**Author details**

India

**References**


*Journal for Light and Electron Optics*, 122(12), DOI: 10.1016/j.ijleo.2010.06.045, 1073-1078.

[34] Ghosh, A. K., Bhattacharya, A., Raul, M., & Basuray, A. (2012). Trinary arithmetic and logic unit (TALU) using savart plate and spatial light modulator (SLM) suitable for optical computation in multivalued logic. *Optics & Laser Technology*, 44(5), DOI:

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

105

[35] Jahangir, I., Hasan, D. M. N., & Reza, M. S. (2009, 23-26 Jan.) Design of Some Quater‐ nary Combinational Logic Blocks Using a New Logic System. Singapore. *TENCON*

[36] Smith, K. C. (1988). Multiple-valued logic: a tutorial and appreciation. *IEEE comput‐*

[37] Cunha, R., Boudinov, H., & Carro, L. (2007). Quaternary look-up tables using voltage mode CMOS logic design. *Proceedings of the 37th International Symposium of multiple*

[38] Shanbhag, N. R., Nagchoudhuri, D., Ferd, R. E. S., & Visweswaran, G. S. (1990). Qua‐ ternary logic circuits with 2-jum CMOS technology. *IEEE Journal of solid-state circuits*,

[39] Park, S. J., Yoon, B. H., Sub Yoon, K., & Kim, H. S. (2004). Design of Quaternary Logic Gate Using Double Pass-transistor Logic with neuron MOS Down Literal Circuit. *Proceedings of the 34 th International Symposium on Multiple-valued logic (ISMVL2004)*,

[40] Kerkho, H. G., & Tervoert, M. L. (1981). Multiple-valued logic charge-coupled devi‐ ces. *IEEE Transactions computers*, C-30(9), 644-652, http://doi.ieeecomputersociety.org/

[41] Yasuda, Y., Tokuda, Y., Zaima, S., Pak, K., Nakamura, T., & Yoshida, A. (1986). Reali‐ zation of Quaternary Logic Circuits by n-Channel MOS Devices. *IEEE Journal of solid-*

[42] Kerkhoff, H. G., & Tervoert, M. L. (1981). Multiple-Valued logic Charge-Coupled De‐ vices. *IEEE Transactions computers*, C-30(9), 644-652, http://doi.ieeecomputersoci‐

[43] Brilman, M., Etiemble, D., Oursel, J. L., & Tatareau, P. (1982). A 4-valued ECL encod‐ er and decoder circuit. *IEEE Journal of solid-state circuits*, sc-17(3), 547-552, Doi:

[44] Mangin, J. L., & Current, K. W. (1986). Characteristics of prototype CMOS quaternary logic Encoder-Decoder circuits. *IEEE Transactions on Computers*, C-35(2), 157-161, Doi:

[45] Etiemble, D., & Israël, M. (2002). A current-mode folding / interpolating CMOS ana‐ log to quaternary encoding block. *Proc. Of 32nd IEEE International Symposium on Mul‐ tiple-valued logic (ISMVL2002)*, 0-7695-2831-7, http://doi.ieeecomputersociety.org/

*state circuits*, sc-21(1), 162-168, DOI: 10.1109/JSSC.1986.1052493.

*'09, IEEE Region 10 Conference*, Doi: 10.1109/TENCON.2009.5396095, 1-6.

*valued logic (ISMVL'07) IEEE*, 0-7695-2831-7, Doi:10.1109/ISMVL.2007.47.

10.1016/j.optlastec.2011.11.044, 1583-1592.

*ers*, 21(4), 1160-1179, doi:10.1109/2.48.

25(3), 790-798, doi: 10.1109/4.102677.

10.1109/TC.1981.1675862.

ety.org/10.1109/TC.1981.1675862.

10.1109/TC.1986.1676733.

10.1109/TC.1986.1676733.

10.1109/ISMVL.2002.1011099.

198-203, doi: 10.1109/ISMVL.2004.1319941.


[34] Ghosh, A. K., Bhattacharya, A., Raul, M., & Basuray, A. (2012). Trinary arithmetic and logic unit (TALU) using savart plate and spatial light modulator (SLM) suitable for optical computation in multivalued logic. *Optics & Laser Technology*, 44(5), DOI: 10.1016/j.optlastec.2011.11.044, 1583-1592.

*Journal for Light and Electron Optics*, 122(12), DOI: 10.1016/j.ijleo.2010.06.045,

[22] Chattopadhyay, T., & Roy, J. N. (2011, March 26-27). All-optical multi-valued com‐ puting: the future challenges and opportunities. Kolkata. *International conference on*

[23] Vasundara Patel, K. S., & Gurumurthy, K. S. (2010). Quaternary Sequential Circuits. *IJCSNS International Journal of Computer Science and Network Security*, 10(7), 110, DOI:

[24] Yi, J., Huacan, H., & Yangtian, L. (2005). Ternary computer architecture. *Physica*

[25] Roy, J. N. (2009). Mach-Zehnder interferometer-based tree architecture for all-optical logic and arithmetic operations. *Optik-International Journal for Light and Electron Op‐*

[26] Khan, M. H. A., Siddika, N. K., & Perkowski, M. A. (2008). Minimization of quaterna‐ ry galois field sum of products expression for multi-output quaternary logic function using quaternary Galois field decision diagram. *Proceedings of the 38th international symposium on Multiple-valued logic (DOI 10.1109/ISMVL.2008.31) IEEE*, 125-130, Doi:

[27] Li, G., Qian, F., Ruan, H., & Liu, L. (1999). Compact Parallel Optical Modified-Sign‐ ed-Digit Arithmetic-Logic Array Processor with Electron-Trapping Device. *Applied*

[28] Chattopadhyay, T., & Sarkar, T. (2012). All-optical switching by Kerr nonlinear prism and its application to of binary-to-gray-to-binary code conversion. *Optics & Laser*

[29] Jung, Y. J., Lee, S., & Park, N. (2008). All-optical 4-bit gray code to binary coded deci‐

[30] Nakamura, T., Kani, J. I., Teshima, M., & Iwatsuki, K. (2004). A quaternary amplitude shift keying modulator for suppressing initial amplitude distortion. *IEEE Journal of*

[31] Awwal, A. A. S., & Ahmed, J. U. (1993). Fast Carry free Adder Design Using QSD Number System. *IEEE*, Doi:10.1109/NAECON.1993.290791, 1085-1088,

[32] Awwal, A. A. S., & Ahmed, J. U. (1994). Two-bit restricted signed-digit quaternary

[33] Cherri, A. K., & Al-Zayed, A. S. (2009). Circuit designs of ultra-fast all-optical modi‐ fied signed-digit adders using semiconductor optical amplifier and Mach-Zehnder

full adder. *Proc. Of IEEE*, DOI:10.1109/NAECON.1994.332917, 1119-1125.

interferometer. *Optik*, doi:10.1016/j.ijleo.2009.02.029, in press.

*Technology*, 44(6), 1722-1728, http://dx.doi.org/10.1016/j.optlastec.2012.02.007.

*convergence of Optics and Electronics (COE 11)*, 94-101, 978-8-19064-011-4.

*Scripta*, T 118(98), 101, doi:10.1238/Physica.Topical.118a00098.

*tics*, 120(7), 218-324, http://dx.doi.org/10.1016/j.ijleo.2007.09.004.

*Optics*, 38(23), 5039-5045, http://dx. doi.org/10.1364/AO.38.005039.

mal converter. *Proc. of SPIE*, 6890, 68900S, doi:10.1117/12.762429.

*Lightwave Technology*, 22(3), 733-738, DOI: 10.1109/JLT.2004.824465.

1073-1078.

104 Design and Architectures for Digital Signal Processing

10.1109/ICCIT.2009.5407139.

10.1109/ISMVL.2008.31.

CH3306-8/93/0000-1085.


[46] Chan, H. L., Mohan, S., & George, I. Haddad. (1996). Compact Multiple-Valued Mul‐ tiplexerers Using negative Differential Resistance Devices. *IEEE Journal of Solid-State circuits*, 31(8), 1151-1155, doi=10.1.1.136.8146.

[58] Shen, Z. Y., Wu, L., & Yan, J. (2012). The reconfigurable module of ternary optical computer. *Optik International Journal for Light and Electron Optics*, http://dx.doi.org/

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

107

[59] Chattopadhyay, T., & Roy, J. N. (2009). All-optical conversion scheme: binary to qua‐ ternary and quaternary to binary number. *Optics & Laser Technology*, 41(3), 289-294,

[60] Taraphdar, C., Chattopadhyay, T., & Roy, J. N. (2009). Polarization Encoded All-opti‐ cal Ternary Max Gate. Kolkata. *International conference on computers and devices for communication CODEC-09*, Paper ID OLT-3512. Print, 978-1-42445-073-2, INSPEC Ac‐

[61] Taraphdar, C., Chattopadhyay, T., & Roy, J. N. (2011). All-optical integrated ternary MIN and MAX gate. Kolkata, India. *Proccedings of international conference on Trends in*

[62] Bhowmik, P., Roy, J. N., & Chattopadhyay, T. (2011). Designing of all-optical two in‐ put ternary MAX logical operation. *National conference of photonics and nano sciences,*

[63] Roy, J. N., Chattopadhyay, T., Manna, S., & Maity, G. K. (2008). Polarization encoded all-optical quaternary max gate. IIT Delhi, India. *PHOTONICS '08, International Con‐*

[64] Chattopadhyay, T., & Roy, J. N. *Quaternary MAX gate and its applications in all-optical*

[65] Bhowmik, P., Chattopadhyay, T., Taraphdar, C., & Roy, J. N. (2011, March 26-27). De‐ signing of All Optical Circuit for Two Input Ternary MIN Logical Operation. Kolka‐ ta. *International conference on convergence of Optics and Electronics, (COE 11)*, 94-101,

[66] Chattopadhyay, T., & Roy, J. N. (2009). Polarization encoded all-optical quaternary multiplexer and demultiplexer-a proposal. *Optik International Journal for Light and*

[67] Chattopadhyay, T., & Roy, J. N. (2010). Polarization encoded TOAD based all-optical quaternary Literals. *Optik International Journal for Light and Electron Optics*, 121,

[68] Taraphdar, C., Chattopadhyay, T., & Roy, J. N. (2011). Designing of an all-optical scheme for single input Ternary logical operations. *Optik International Journal for Light*

[69] Chattopadhyay, T., & Roy, J. N. Polarization encoded four valued ordinary inverter. *XXXVI OSI Symposium on Frontiers in Optics and Photonics, (FOP 11) IIT*, P 99,

*and Electron Optics*, 122(1), 33-36, http://dx.doi.org/10.1016/j.ijleo.2009.09.016.

*Electron Optics*, 120, 941-946, http://dx.doi.org/10.1016/j.ijleo.2008.03.030.

617-622, http://dx.doi.org/10.1016/j.ijleo.2008.09.014.

*optics and photonics (IconTOP 2011)*, 476-481, 978-81-908188-1-0.

10.1016/j.ijleo.2012.03.081.

cession Number 11136770.

*Dept of physics, Garhbeta college*, 51-58.

*ference on Fiber Optics and Photonics*, 1-4.

*domain*, unpublished.

978-8-19064-011-4.

978-81-309-1964-5 .

http://dx.doi.org/10.1016/j.optlastec.2008.06.003.


[58] Shen, Z. Y., Wu, L., & Yan, J. (2012). The reconfigurable module of ternary optical computer. *Optik International Journal for Light and Electron Optics*, http://dx.doi.org/ 10.1016/j.ijleo.2012.03.081.

[46] Chan, H. L., Mohan, S., & George, I. Haddad. (1996). Compact Multiple-Valued Mul‐ tiplexerers Using negative Differential Resistance Devices. *IEEE Journal of Solid-State*

[47] Chattopadhyay, T., Bhowmik, P., & Roy, J. N. (2012). Polarization encoded optical N-

[48] Keshavarzian, P., & Mirzaee, M. M. (2012). A novel, efficient CNTFET Galois design as a basic ternary-valued logic field. *Nanotechnology, Science and Applications*, 5, 1-11,

[49] Khorasaninejad, M., & Saini, S. S. (2011). All-optical logic gate in silicon nanowire op‐ tical waveguides. *IET Circuits, Devices and system*, 5(2), 115-122, 10.1049/iet-cds.

[50] Glesk, I., Runser, R. J., & Prucnal, P. R. (2001). New generation of devices for all-opti‐ cal communication. *Acta physica slovaca*, 51(2), 151-162, http://dx.doi.org/

[51] Wang, B. C., Baby, V., Tong, W., Xu, L., Friedman, M., Glesk, I., Runser, R. J., & Pruc‐ nal, P. R. (2002). A novel fast optical switch based on two cascaded terahertz optical asymmetric demultiplexers (TOAD). *Optics Express*, 10(1), 15-23, http://www.optic‐

[52] Houbavlis, T., & Zoiros, K. E. (2004). Numerical simulation of semiconductor optical amplifier assisted Sagnac gate and investigation of its switching characteristics. *Opti‐*

[53] Zoiros, K. E., Chasioti, R., Koukourlis, C. S., & Houbavlis, T. (2007). On the output characteristics of a semiconductor optical amplifier driven by an ultrafast optical time division multiplexing pulse train. *Optik*, 118(3), 134-146, DOI: 10.1016/j.ijleo.

[54] Shen, Z. Y., & Wu, L. L. (2008). Reconfigerable optical logic unit with a teraheartz op‐ tical asymmetric demultiplexer and electro-optic switches. *Applied Optics*, 47(21),

[55] Roy, J. N., Maity, A. K., & Mukhopadhyay, S. (2006). Designing of an all-optical time division multiplexing scheme with the help of non linear material based tree-net ar‐

[56] Karim, M. A., & Awwal, A. A. S. (2003). *Optical Computing: an introduction*, Wiley,

[57] Chattopadhyay, T., & Roy, J. N. (2009). An all-optical technique for a binary-to-qua‐ ternary encoder and a quaternary-to-binary decoder. *J.Opt.A: Pure Appl. Opt.*, 11, doi:

*cal Engineering*, 43(7), 1622-1627, http://dx.doi.org/10.1117/1.1751132.

*circuits*, 31(8), 1151-1155, doi=10.1.1.136.8146.

valued inverter. *JOSA B*, Accepted.

106 Design and Architectures for Digital Signal Processing

http://dx.doi.org/10.2147/NSA.S27550.

sinfobase.org/oe/abstract.cfm?URI=oe-10-1-15.

3737-3742, DOI: 10.1364/AO.47.003737.

chitecture. *Chinese Optics letters*, 4(8), 483-486.

2010.0142.

10.1117/12.498224.

2006.01.012.

New York, Chap-6.

10.1088/1464-4258/11/7/075501.


[70] Chattopadhyay, T., & Roy, J.N. (2012). All-optical ordinary quaternary inverter (QNOT) using binary NOT gate. *Optik International Journal for Light and Electron Op‐ tics*, in press, doi: 10.1016/j.ijleo.2012.01.035.

[82] Chattopadhyay, T. (2010). All-optical quaternary circuits using quaternary T-gate. *Optik International Journal for Light and Electron Optics*, 121, 1784-1788, DOI:10.1016/

All-Optical Quaternary Logic Based Information Processing: Challenges and Opportunities

http://dx.doi.org/10.5772/51559

109

[83] Domanski, A.W. (2005). Polarization degree fading during propagation of partially coherent light through retarders. *Opto-Electronics Review, 7th International Workshop on*

[84] Mecozzi, A., & Shtaif, M. (2002). The statistics of polarization dependent loss in opti‐ cal communication systems. *IEEE photonics tech. letter*, 14(3), 313-315, DOI:

[85] Nelson, L.E, Nielson, T.N., & Kogelnik, H. (2001). Observation of PMD-induced co‐ herent crosstalk in polarization-multiplexed transmission. *IEEE photonics tech. letter*,

[86] Tang, J.M., & Shore, K.A. (1998). Strong picosecond optical pulse propagating in sem‐ iconductor optical amplifiers at transparency. *IEEE Journal of Quantum Electronics*,

[87] Bruyere, F., & Andouin, O. (1994). Penalties in long-haul optical amplifiers systems due to polarization dependent loss and gain. *IEEE Photonics Tech. Letters*, 6(5),

j.ijleo.2009.04.014.

10.1109/68.986797.

*Nonlinear Optics applications*, 13(2), 171-176.

13(7), 738-740, DOI:10.1109/68.930432.

34(7), 1263-1269, doi: 10.1109/3.687871.

654-656, doi: 10.1109/68.285570.


[82] Chattopadhyay, T. (2010). All-optical quaternary circuits using quaternary T-gate. *Optik International Journal for Light and Electron Optics*, 121, 1784-1788, DOI:10.1016/ j.ijleo.2009.04.014.

[70] Chattopadhyay, T., & Roy, J.N. (2012). All-optical ordinary quaternary inverter (QNOT) using binary NOT gate. *Optik International Journal for Light and Electron Op‐*

[71] Chattopadhyay, T., & Roy, J.N. (2010). Polarization Encoded All-optical Quaternary Universal Inverter and Designing of Multi-valued Flip-flop. *Optical Engineering*,

[72] Chattopadhyay, T., & Roy, J.N. (2011). Polarization encoded all-optical quaternary successor with the help of SOA assisted Sagnac switch. *Optics communication*, 284(12),

[73] Chattopadhyay, T., & Roy, J.N. (2011). All-optical quaternary Galois field sum of product (GFSOP) circuits. *Optik International Journal for Light and Electron Optics*,

[74] Chattopadhyay, T., & Roy, J. N. (2010, 27-28 March). All-optical quaternary half-add‐ er circuit with the help of Terahertz optical asymmetric demultiplexer (TOAD). Burdwan. *National conference on materials, devices and circuits in communication Tech.*

[75] Chattopadhyay, T., Das, M.K., Roy, J.N., Chakraborty, A.K., & Gayen, D.K. (2011). Interferometric switch based all optical scheme for Conversion of Binary number to its Quaternary Signed Digit form. *IET Circuits, Devices and system, (special issue on 'Optical Computing Circuits, Devices and Systems')*, 5(2), 132-142, doi: 10.1049/iet-cds.

[76] Chattopadhyay, T., & Roy, J.N. (2011). Easy conversion technique of Binary to Qua‐

[77] Chattopadhyay, T., & Roy, J.N. (2009, March 1-4). All-optical conversion of binary number to quaternary signed digit (QSD) number. Kolkata, India. *Proceedings of inter‐*

[78] Chattopadhyay, T., & Roy, J.N. (2011, 28th February- 1st March). All-optical carry free addition using quaternary signed digit (QSD). *18th West Bengal state science & Technol‐*

[79] Chattopadhyay, T., Taraphdar, C., & Roy, J.N. (2009). Quaternary Galois field adder based all-optical multivalued logic circuits. *Applied Optics, (feature issue on 'optical high-performance computing')*, 48(22), E35-E44, http://dx.doi.org/10.1364/AO.48.000E35.

[80] Chattopadhyay, T., Roy, J.N., & Chakraborty, A.K. (2009). Polarization encoded alloptical quaternary R-S flip-flop using binary latch. *Optics Communications*, 282,

[81] Taraphdar, C., Chattopadhyay, T., & Roy, J.N. (2011). Designing of Polarization en‐ coded all-optical ternary multiplexer and Demultiplexer. *Recent Patents on Signal*

*national conference on Trends in optics and photonics (IconTOP 2009)*, 130-137.

ternary Signed Digit and vice versa. *Physics Express*, 1(3), 165-174.

*tics*, in press, doi: 10.1016/j.ijleo.2012.01.035.

2755-2762, DOI:10.1016/j.optcom.2011.02.005.

122(9), 758-763, http://dx.doi.org/10.1016/j.ijleo.2010.06.002.

49(3), 035201, DOI: 10.1117/1.3362897.

108 Design and Architectures for Digital Signal Processing

*(MDCCT'2010)*, TS. 4.12, 50.

2010.0056.

*ogy congress*, 1(2), 3-4.

1287-1293, DOI:10.1016/j.optcom.2008.12.022.

*Processing*, 1(2), 143-155, doi:10.2174/1877612411101020143.


**Section 3**

**Image and Video Processing**

**Image and Video Processing**

**Chapter 5**

**Video Encoder Implementation on**

José Parera-Bermúdez, Javier Casajús-Quirós and

Additional information is available at the end of the chapter

Igor Arambasic

**1. Introduction**

widespread use.

with 8 or 16 cores.

http://dx.doi.org/10.5772/53429

**Tilera's TILEPro64™ Multicore Processor**

The Moore's law states that the transistor number on integrated circuits approximately doubles every two years. This trend has been met since its description in 1965. But this ex‐ ponential growth in transistor count does not always translate into similar growth of CPU performance; some issues such as power density, total power and intra-chip distances are preventing clock speeds above 4.5 GHz. During the past decades advances in semiconduc‐ tor technology and architecture have overcome the obstacles, but at present there is no al‐ ternative technology and all the possibilities of micro-parallelism seem to have been explored. Another major issue is that the speed of dynamic memory has not grown with the same strength as the CPU's speed, while static memory is prohibitively expensive for

The solution being put into practice is the use of the so called multicore CPU, i.e. the integra‐ tion of multiple cores on a single chip. Today nearly all computers, including desktops and laptops, are equipped with CPUs with at least 2 cores and it is not uncommon to see servers

The evolution and the steady decrease in the price of technology have enabled the digital video to be a media component included on any device from small pocket players to profes‐ sional projection equipment on movie theaters. Today the *de facto* standard for video coding is ITU-T/ISO H.264 /MPEG-4 Part 10 or AVC (Advanced Video Coding) [1]. Since its first publication, back in 2003, it has become one of the most commonly used formats due to its flexibility to be applied to a wide variety of applications on a wide variety of networks and systems, including low and high bit rates, low and high resolution video, broadcast, DVD

> © 2013 Parera-Bermúdez et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2013 Parera-Bermúdez et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and reproduction in any medium, provided the original work is properly cited.

### **Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor**

José Parera-Bermúdez, Javier Casajús-Quirós and Igor Arambasic

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/53429

#### **1. Introduction**

The Moore's law states that the transistor number on integrated circuits approximately doubles every two years. This trend has been met since its description in 1965. But this ex‐ ponential growth in transistor count does not always translate into similar growth of CPU performance; some issues such as power density, total power and intra-chip distances are preventing clock speeds above 4.5 GHz. During the past decades advances in semiconduc‐ tor technology and architecture have overcome the obstacles, but at present there is no al‐ ternative technology and all the possibilities of micro-parallelism seem to have been explored. Another major issue is that the speed of dynamic memory has not grown with the same strength as the CPU's speed, while static memory is prohibitively expensive for widespread use.

The solution being put into practice is the use of the so called multicore CPU, i.e. the integra‐ tion of multiple cores on a single chip. Today nearly all computers, including desktops and laptops, are equipped with CPUs with at least 2 cores and it is not uncommon to see servers with 8 or 16 cores.

The evolution and the steady decrease in the price of technology have enabled the digital video to be a media component included on any device from small pocket players to profes‐ sional projection equipment on movie theaters. Today the *de facto* standard for video coding is ITU-T/ISO H.264 /MPEG-4 Part 10 or AVC (Advanced Video Coding) [1]. Since its first publication, back in 2003, it has become one of the most commonly used formats due to its flexibility to be applied to a wide variety of applications on a wide variety of networks and systems, including low and high bit rates, low and high resolution video, broadcast, DVD

storage, RTP/IP packet networks, multimedia telephony systems, etc. In 2004 the standard was extended to enable higher quality video coding by adding several new features (in‐ creased sample bit depth precision, higher-resolution color information, adaptive switching between 4x4 and 8x integer transforms...) required by professional applications.

From a logical point of view, the system behaves as a standalone RTSP (*Real-Time Streaming Protocol*) server [2] that packetizes the video encoded data for RTP delivering [3] over an IP

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

115

The TILE*Pro*64™ [4], the second generation of Tilera's processors, is a fully programmable 64-core processor organized as a two-dimensional array (8x8) of processing elements (each referred to as a tile), connected through the iMesh™, a bunch of two-dimensional mesh net‐ works. The processor also integrates external memory and I/O interfaces connected to the

Each tile contains a Processor Engine, a Cache Engine, and a Switch Engine, which combine

**•** The Processor Engine is a conventional 32 bit VLIW (Very Long Instruction Word) pro‐ cessor with three instructions per bundle and full memory management, protection, and OS support configuring a powerful, full-featured computing system that can independ‐ ently run a Linux operating system. The Tile Processor includes special instructions to support commonly-used embedded operations in DSP, video and network packet proc‐ essing, including: hashing and checksums, instructions to accelerate encryption, SIMD (Single Instruction Multiple Data) instructions for sub-word parallelism, saturating arith‐ metic, multiply-accumulate (MAC) instructions, sum of absolute differences (SAD), and unaligned access acceleration. All arithmetic instructions are of integer type because there

network.

**Figure 1.** Hardware System Architecture

**2.1. The TILE***Pro***64™ processor**

is not a floating point unit.

tiles via the iMesh™ interconnect fabric.

components to make a powerful, full-featured compute engine.

H.264 performs significantly better than any prior standard under a wide variety of circum‐ stances in a wide variety of application environments, and outperforms MPEG-2 video, the DVD standard for movies, typically obtaining the same quality at half the bit rate or less, especially on high bit rate and high resolution situations.

Like other ITU-T standards, H.264 only specifies the syntax of the bitstream and the decod‐ ing procedures for reconstructing the video images; the encoding process is not specified at all allowing the use of different approaches, algorithms and optimizations as long as the bit‐ stream is syntactically correct. Unlike previous standards, it is designed bearing in mind its implementation avoiding complex calculations and favoring the use of just adders and shift‐ ers; nevertheless encoding is far more involved than decoding.

It is easy to find lots of papers and books dealing with almost every aspect of H.264; there are also countless proprietary and open source software libraries and custom hardware im‐ plementations particularly for the consumer market. Therefore, what is special about the im‐ plementation described in the following paragraphs? In brief, the remarkable aspects are:


The performance achieved allows encoding 4K (*DLP Cinema Technology*, 4096x1716 pixels, 24 frames per second) video, the current standard for digital cinema, in real time with just one frame latency.

The study has been undertaken as part of a project that develops optimized hardware for those applications in which real time analysis-synthesis of high definition image streams is needed. The project was led by Datatech (www.datatech-sda.com) and it focused on a par‐ ticular case of search & track applications for the aerospace segment, namely automatic refueling of flying military aircrafts.

#### **2. System architecture**

Figure 1 shows the hardware building blocks of the system. As can be seen the TILE pro‐ cessor is the very heart of the system and only some adapting logic is required to deal with the camera output; the I/O capabilities of the processor, including the Gigabyte Ethernet in‐ terface, make the rest.

From a logical point of view, the system behaves as a standalone RTSP (*Real-Time Streaming Protocol*) server [2] that packetizes the video encoded data for RTP delivering [3] over an IP network.

**Figure 1.** Hardware System Architecture

storage, RTP/IP packet networks, multimedia telephony systems, etc. In 2004 the standard was extended to enable higher quality video coding by adding several new features (in‐ creased sample bit depth precision, higher-resolution color information, adaptive switching

H.264 performs significantly better than any prior standard under a wide variety of circum‐ stances in a wide variety of application environments, and outperforms MPEG-2 video, the DVD standard for movies, typically obtaining the same quality at half the bit rate or less,

Like other ITU-T standards, H.264 only specifies the syntax of the bitstream and the decod‐ ing procedures for reconstructing the video images; the encoding process is not specified at all allowing the use of different approaches, algorithms and optimizations as long as the bit‐ stream is syntactically correct. Unlike previous standards, it is designed bearing in mind its implementation avoiding complex calculations and favoring the use of just adders and shift‐

It is easy to find lots of papers and books dealing with almost every aspect of H.264; there are also countless proprietary and open source software libraries and custom hardware im‐ plementations particularly for the consumer market. Therefore, what is special about the im‐ plementation described in the following paragraphs? In brief, the remarkable aspects are:

**•** The hardware is based on a commercial off-the-shelf multicore processor: the

The performance achieved allows encoding 4K (*DLP Cinema Technology*, 4096x1716 pixels, 24 frames per second) video, the current standard for digital cinema, in real time with just one

The study has been undertaken as part of a project that develops optimized hardware for those applications in which real time analysis-synthesis of high definition image streams is needed. The project was led by Datatech (www.datatech-sda.com) and it focused on a par‐ ticular case of search & track applications for the aerospace segment, namely automatic re-

Figure 1 shows the hardware building blocks of the system. As can be seen the TILE pro‐ cessor is the very heart of the system and only some adapting logic is required to deal with the camera output; the I/O capabilities of the processor, including the Gigabyte Ethernet in‐

between 4x4 and 8x integer transforms...) required by professional applications.

especially on high bit rate and high resolution situations.

ers; nevertheless encoding is far more involved than decoding.

**•** It is targeted to very high quality with very low latency,

**•** It is a software-only solution, and

114 Design and Architectures for Digital Signal Processing

fueling of flying military aircrafts.

**2. System architecture**

terface, make the rest.

frame latency.

TILE*Pro*64™ from Tilera Corporation.

#### **2.1. The TILE***Pro***64™ processor**

The TILE*Pro*64™ [4], the second generation of Tilera's processors, is a fully programmable 64-core processor organized as a two-dimensional array (8x8) of processing elements (each referred to as a tile), connected through the iMesh™, a bunch of two-dimensional mesh net‐ works. The processor also integrates external memory and I/O interfaces connected to the tiles via the iMesh™ interconnect fabric.

Each tile contains a Processor Engine, a Cache Engine, and a Switch Engine, which combine components to make a powerful, full-featured compute engine.

**•** The Processor Engine is a conventional 32 bit VLIW (Very Long Instruction Word) pro‐ cessor with three instructions per bundle and full memory management, protection, and OS support configuring a powerful, full-featured computing system that can independ‐ ently run a Linux operating system. The Tile Processor includes special instructions to support commonly-used embedded operations in DSP, video and network packet proc‐ essing, including: hashing and checksums, instructions to accelerate encryption, SIMD (Single Instruction Multiple Data) instructions for sub-word parallelism, saturating arith‐ metic, multiply-accumulate (MAC) instructions, sum of absolute differences (SAD), and unaligned access acceleration. All arithmetic instructions are of integer type because there is not a floating point unit.

**•** The Cache Engine contains the tile's Translation Lookaside Buffers (TLBs), caches, and cache sequencers. Each tile has 16KB L1 instruction cache; 8KB L1 data cache, and a 64KB unified L2 cache. This delivers a total of 5.5 MB of on chip memory. The cache can be con‐ figured as coherent or incoherent; in the first case, the hardware automatically maintains the consistency of data between processors, converting all on chip memory in a sort of L3 unified cache. Each tile also contains a DMA engine that works together with the cache engine for orchestrating memory data streaming between tiles and external memory, and among the tiles.

Figure 2 shows a block diagram of the processor:

**Figure 2.** TILE*Pro*64™ Block Diagram

**Table 1.** TILE*Pro*64™ Performance Metrics

There are two versions of the processor that differ only in the operating frequency: 700 or

**Operations Per Second** 8-bit 443 BOPs

16-bit 222 BOPs 32-bit 166 BOPs

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

117

866 MHz. A few performance metrics at 866 MHz are listed in the next table:

**Data I/O** 40+ Gbps **Memory I/O** 205 Gbps **Bisection Bandwidth** 2660 Gbps **On-chip Cache Memory Bandwidth** 1774 Gbps

**•** The Switch Engine implements six independent networks. The Switch Engine switches scalar data between tiles through the Static Network (STN) with very low latency. Five dynamic networks (UDN, TDN, MDN, CDN and IDN) aid the Switch Engine by routing packet-based data among tiles, tile caches, external memory, and I/O controllers. Of the five dynamic networks, only the User Dynamic Network (UDN) is user-visible; the others are used to satisfy cache misses from external memory and other tiles, and for various system-related functions. The Static Network in addition to the five Dynamic Networks comprise the interconnect fabric of the Tilera iMesh™. The user does not explicitly need to manage these networks; rather they are used by the system software to efficiently im‐ plement the application-level API abstractions, such as user-generated inter-process sock‐ et-like streams.

It is noteworthy that all the cores are identical forming a homogeneous architecture that con‐ trasts with other notable multicore processors such as the Cell Broadband Engine (from So‐ ny, Toshiba and IBM) or the DaVinci from Texas Instruments. As a result, programming is easier, more portable and more easily scalable. Furthermore, the combination of cores and interconnecting network enable different kinds of parallelism: fine or coarse grain, sharedmemory multithreading, message passing multitasking, etc. making the architecture suita‐ ble for a broad range of parallel problems.

The TILE*Pro*64™ supports the following primary external interfaces:


**•** The Cache Engine contains the tile's Translation Lookaside Buffers (TLBs), caches, and cache sequencers. Each tile has 16KB L1 instruction cache; 8KB L1 data cache, and a 64KB unified L2 cache. This delivers a total of 5.5 MB of on chip memory. The cache can be con‐ figured as coherent or incoherent; in the first case, the hardware automatically maintains the consistency of data between processors, converting all on chip memory in a sort of L3 unified cache. Each tile also contains a DMA engine that works together with the cache engine for orchestrating memory data streaming between tiles and external memory, and

**•** The Switch Engine implements six independent networks. The Switch Engine switches scalar data between tiles through the Static Network (STN) with very low latency. Five dynamic networks (UDN, TDN, MDN, CDN and IDN) aid the Switch Engine by routing packet-based data among tiles, tile caches, external memory, and I/O controllers. Of the five dynamic networks, only the User Dynamic Network (UDN) is user-visible; the others are used to satisfy cache misses from external memory and other tiles, and for various system-related functions. The Static Network in addition to the five Dynamic Networks comprise the interconnect fabric of the Tilera iMesh™. The user does not explicitly need to manage these networks; rather they are used by the system software to efficiently im‐ plement the application-level API abstractions, such as user-generated inter-process sock‐

It is noteworthy that all the cores are identical forming a homogeneous architecture that con‐ trasts with other notable multicore processors such as the Cell Broadband Engine (from So‐ ny, Toshiba and IBM) or the DaVinci from Texas Instruments. As a result, programming is easier, more portable and more easily scalable. Furthermore, the combination of cores and interconnecting network enable different kinds of parallelism: fine or coarse grain, sharedmemory multithreading, message passing multitasking, etc. making the architecture suita‐

**•** Memory: four memory interface channels, each supporting 64-bit DDR2 DRAM up to 800 Mbps, for a peak total bandwidth of 25.6 GB/s. The memory controllers are on-chip.

**•** PCIe: Two 4-lane PCI Express ports configurable as 4-lane, 2-lane or 1-lane (4x, 2x, 1x)

**•** Flexible I/O: 64 bits of dedicated Flexible I/O for programmable I/O and interrupt sup‐

**•** 10Gb Ethernet: Two full-duplex XAUI-based 10Gb ports with integrated MACs.

with integrated MACs, supporting both root complex and endpoint modes.

**•** 10/100/1000 Ethernet: Two on-board RGMII 10/100/1000 Ethernet MACs.

port, with frequency up to 150 MHz and streaming capability.

among the tiles.

116 Design and Architectures for Digital Signal Processing

et-like streams.

ble for a broad range of parallel problems.

**•** HPI: 16-bit host port interface.

**•** UART, I2C and SPI ROM.

The TILE*Pro*64™ supports the following primary external interfaces:

**Figure 2.** TILE*Pro*64™ Block Diagram

There are two versions of the processor that differ only in the operating frequency: 700 or 866 MHz. A few performance metrics at 866 MHz are listed in the next table:


**Table 1.** TILE*Pro*64™ Performance Metrics

#### **2.2. The multicore development environment**

The Tilera MDE [5] provides a complete software environment, including the system soft‐ ware stack, a variety of helpful software libraries, and standard Linux command-line utilit‐ ies. The execution environment includes three layers: the hypervisor, the client operating system (Linux), and user space:

The design of the application has completed all four typical stages established by best practi‐ ces: task decomposition, assignment, orchestration and distribution [7]. The following para‐

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

119

An H.264 encoder [8] [9] [10] consists of a few basic procedures; besides these, the standard defines a wide range of ancillary techniques, many of them optional, designed to provide enough flexibility to be applied to multiple scenarios. The encoder structure (see figure 3) does not differ substantially from that of other encoders but its many details and subtleties

graphs detail the course of action.

**3.1. Task decomposition**

*3.1.1. H.264 encoder procedures*

allow for a much more efficient compression.

**Figure 3.** Block Diagram of Basic Encoder Procedures in H.264


The MDE also provides a complete suite of tools for all phases of program development, starting with authoring or porting an application, through debugging, and into performance evaluation. These tools include:


#### **3. Encoder parallelization**

Programming a parallel application is not an easy job. The design space is enormous: differ‐ ent kinds of parallelism, data granularity, tools… The algorithm being implemented, the performance objectives and the computing platform impose some constraints but do not de‐ termine the design choices. Fortunately, the Tilera's platform supports a broad range of pos‐ sibilities. See, for example [6], which explores several alternatives for the H.264 encoder.

The design of the application has completed all four typical stages established by best practi‐ ces: task decomposition, assignment, orchestration and distribution [7]. The following para‐ graphs detail the course of action.

#### **3.1. Task decomposition**

**2.2. The multicore development environment**

118 Design and Architectures for Digital Signal Processing

system (Linux), and user space:

and virtual memory allocation.

evaluation. These tools include:

**3. Encoder parallelization**

a Linux box.

The Tilera MDE [5] provides a complete software environment, including the system soft‐ ware stack, a variety of helpful software libraries, and standard Linux command-line utilit‐ ies. The execution environment includes three layers: the hypervisor, the client operating

**•** The hypervisor is the lowest layer of the software stack. It abstracts hardware details of the processor, manages communication between tiles, from tiles to I/O controllers, and provides a low-level virtual-memory. This layer also provides I/O drivers that run on dedicated tiles and, therefore, do not run Linux or user space applications. The running drivers, the tiles on which they run and their parameters can be configured at boot time. **•** The supervisor layer, composed of SMP Linux, provides system calls and I/O devices for user-space applications and libraries. This layer enables multi-process applications and multi-threaded processes to exploit multiple tiles for increased performance. The OS soft‐ ware manages hardware resources and provides higher-level services, such as processes

**•** The application layer runs user space programs that can invoke Linux system calls and link against standard libraries just as on any other Linux platform. Tilera provides the

The MDE also provides a complete suite of tools for all phases of program development, starting with authoring or porting an application, through debugging, and into performance

**•** C/C++ compiler, assembler and linker, and other standard Unix tools. This tool chain is compatible with that of GNU; specifically, the compiler supports ANSI C99 standard as well as GNU extensions. The tools enable the use of portable source code, easy to pro‐ gram and with the same concurrency support as that available for Intel x86 processors in

**•** A standard, open-source gdb debugger with support for the Tile Processor architecture.

**•** A custom version of the open-source Eclipse IDE providing a GUI interface for all stages of program development: authoring, building, running, debugging and profiling.

Programming a parallel application is not an easy job. The design space is enormous: differ‐ ent kinds of parallelism, data granularity, tools… The algorithm being implemented, the performance objectives and the computing platform impose some constraints but do not de‐ termine the design choices. Fortunately, the Tilera's platform supports a broad range of pos‐ sibilities. See, for example [6], which explores several alternatives for the H.264 encoder.

**•** A software simulator that provides cycle-accurate profiling and tracing.

standard C/C++ run-time and other processor specific libraries.

#### *3.1.1. H.264 encoder procedures*

An H.264 encoder [8] [9] [10] consists of a few basic procedures; besides these, the standard defines a wide range of ancillary techniques, many of them optional, designed to provide enough flexibility to be applied to multiple scenarios. The encoder structure (see figure 3) does not differ substantially from that of other encoders but its many details and subtleties allow for a much more efficient compression.

**Figure 3.** Block Diagram of Basic Encoder Procedures in H.264

The upper part of the figure (yellow) shows the encoding process. The frame being encoded, in YCbCr color space format, is divided into macroblocks, i.e. chunks of 16x16 luminance (Y) pixels and their corresponding chrominance (Cb, Cr) pixels, whose size varies according to the subsampling method (8x8 if 4:2:0, 8x16 if 4:2:2 or 16x16 if 4:4:4). The luminance and chro‐ minance channels are processed separately using the same techniques.

using variable length or arithmetic coding, both adapted to the context to further reduce the number of bits. Hence, their names are context-adaptive variable length coding (CAVLC)

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

121

The quantization output also feeds the decoding process (the purple box in figure 3). This process is needed at the encoder because it reconstructs the macroblocks and frames to be used for prediction. The first procedure is the inverse quantization; actually it is a rescaling as quantization cannot be inverted. Afterwards the coefficients are inverse transformed to form a residual which is added to the prediction to get the reconstructed macroblock. If us‐ ing inter prediction, the result could be optionally filtered for reducing blocking distortion by smoothing the macroblock edges. Note that typically the reconstructed macroblock will differ from the original due to the loss of precision caused by quantization. Ultimately, H.

In order to fulfill the requirements of real-time, low-latency and high quality the encoder implements only a subset of the standard features and techniques available in the standard. This selection does not prevent the encoder to comply with the standard because H.264 al‐ lows a high degree of flexibility in the techniques used. Specifically, the implemented encod‐

**•** Intra only prediction, i.e. all predicted pixels are computed using only the current frame;

**•** CAVLC entropy coding; the alternative method, CABAC, is much more efficient from the standpoint of the bit rate but it cannot be parallelized due to its recursive nature.

These tradeoffs, and some others discussed below, adversely affect the compression ratio of the encoder resulting in an increased bit rate. Fortunately, neither the best compression ratio nor constant bit rate are requirements of the implementation. These goals are distinct from those required for consumer applications for which there are lot of solutions, but they are essential in many professional applications: remote monitoring, remote assistance, content generation, broadcast, video surveillance. The H.264 standard dedicates some specific pro‐ files for this kind of applications (High, High10 and High10 Intra), that in some cases the

Undoubtedly the temporal prediction improves the bit rate but not for free; in order to ob‐ tain a high quality a large number of reference frames are needed, scene changes can lead to devastating effects, especially if a constant bit rate is desired, and latency, measured from end to end, i.e. camera to monitor, increases linearly with the number of reference frames,

In our implementation, the encoding of a frame begins as soon as the first 16 lines of pixels are available and once the row is encoded it is sent so that the decoding can start even be‐ fore the whole frame is encoded. Such an extremely low latency is only possible using spa‐ tial prediction. Additional advantages of the intra-only prediction are: 1) ease of frame by

and context-adaptive binary arithmetic coding (CABAC).

er has undergone the two following main tradeoffs:

industry has adopted, such as Panasonic's AVC-Intra.

and can reach 1 second, only for decoding.

otherwise, the very low latency requirement couldn't be achieved.

264 is a lossy compressor.

*3.1.2. Implementation tradeoffs*

In the prediction phase the encoder builds a macroblock using previously encoded data, ei‐ ther from the current frame (spatial or intra prediction) or from other frames (temporal or inter prediction). Intra prediction can be carried out for the whole macroblock in 4 modes or dividing it into 8x8 or 4x4 blocks in 9 modes. Each mode, related to a spatial direction, is just an extrapolation computed as averages of the neighboring pixels. Inter prediction is more involved since it tries to find a description of the macroblock by estimating its motion with respect to similar regions of previous frames. The search is performed on a variable number of reference frames, in several rectangular section sizes and with increased pixel accuracy to allow for sub pixel motion. Finally, the predicted macroblock can be expressed in terms of motions vectors from regions of the reference frames.

Once a macroblock is predicted the encoder subtracts the predicted pixel values from the in‐ put macroblock to form a residual. The prediction methods supported by H.264 make it pos‐ sible to accurately predict the input macroblock, thus resulting in an outstanding video compression because, being a differential encoder, the residual values are small and very of‐ ten nulls. Unfortunately, the computational complexity is too high, which poses a severe trouble for a real-time implementation.

The residual data is transformed using an approximate 4x4 or 8x8 2D DCT (Discrete Cosine Transform). This transform has the particular feature of compacting the energy of the input in the low frequency coefficients and thus the transformed residual usually contains a few non-zero values close to the matrix's upper left corner. H.264 does not use standard DCTs but modified so that their kernels consist solely of integer numbers eliminating the need for floating point calculations.

The modified DCT output should be quantized by dividing the coefficients by an integer, but the derivation of the transform left pending a scaling factor. Both numbers are combined to avoid division and, incidentally, to accommodate the quality parameter QP. The overall computation reduces the precision of the coefficients according to the desired quality gov‐ erned by the value of QP: the larger the value, the poorer the quality but the higher the com‐ pression and vice versa. The rate-distortion control procedure can update the value of QP for each frame or macroblock in order to balance the opposing goals of high quality and low bit rate. Usually, the rate-distortion control procedure aims a maximum or constant bit rate but it is also possible to encode aiming constant quality in which case this procedure does nothing.

The quantized DCT coefficients are scanned in zigzag to sort them according to increasing spatial frequency; then they are converted into binary codes along with other signaling in‐ formation (macroblock partitioning, prediction modes or motion vectors…) and inserted in‐ to the output bitstream. The binary codes are computed by the entropy coder procedure using variable length or arithmetic coding, both adapted to the context to further reduce the number of bits. Hence, their names are context-adaptive variable length coding (CAVLC) and context-adaptive binary arithmetic coding (CABAC).

The quantization output also feeds the decoding process (the purple box in figure 3). This process is needed at the encoder because it reconstructs the macroblocks and frames to be used for prediction. The first procedure is the inverse quantization; actually it is a rescaling as quantization cannot be inverted. Afterwards the coefficients are inverse transformed to form a residual which is added to the prediction to get the reconstructed macroblock. If us‐ ing inter prediction, the result could be optionally filtered for reducing blocking distortion by smoothing the macroblock edges. Note that typically the reconstructed macroblock will differ from the original due to the loss of precision caused by quantization. Ultimately, H. 264 is a lossy compressor.

#### *3.1.2. Implementation tradeoffs*

The upper part of the figure (yellow) shows the encoding process. The frame being encoded, in YCbCr color space format, is divided into macroblocks, i.e. chunks of 16x16 luminance (Y) pixels and their corresponding chrominance (Cb, Cr) pixels, whose size varies according to the subsampling method (8x8 if 4:2:0, 8x16 if 4:2:2 or 16x16 if 4:4:4). The luminance and chro‐

In the prediction phase the encoder builds a macroblock using previously encoded data, ei‐ ther from the current frame (spatial or intra prediction) or from other frames (temporal or inter prediction). Intra prediction can be carried out for the whole macroblock in 4 modes or dividing it into 8x8 or 4x4 blocks in 9 modes. Each mode, related to a spatial direction, is just an extrapolation computed as averages of the neighboring pixels. Inter prediction is more involved since it tries to find a description of the macroblock by estimating its motion with respect to similar regions of previous frames. The search is performed on a variable number of reference frames, in several rectangular section sizes and with increased pixel accuracy to allow for sub pixel motion. Finally, the predicted macroblock can be expressed in terms of

Once a macroblock is predicted the encoder subtracts the predicted pixel values from the in‐ put macroblock to form a residual. The prediction methods supported by H.264 make it pos‐ sible to accurately predict the input macroblock, thus resulting in an outstanding video compression because, being a differential encoder, the residual values are small and very of‐ ten nulls. Unfortunately, the computational complexity is too high, which poses a severe

The residual data is transformed using an approximate 4x4 or 8x8 2D DCT (Discrete Cosine Transform). This transform has the particular feature of compacting the energy of the input in the low frequency coefficients and thus the transformed residual usually contains a few non-zero values close to the matrix's upper left corner. H.264 does not use standard DCTs but modified so that their kernels consist solely of integer numbers eliminating the need for

The modified DCT output should be quantized by dividing the coefficients by an integer, but the derivation of the transform left pending a scaling factor. Both numbers are combined to avoid division and, incidentally, to accommodate the quality parameter QP. The overall computation reduces the precision of the coefficients according to the desired quality gov‐ erned by the value of QP: the larger the value, the poorer the quality but the higher the com‐ pression and vice versa. The rate-distortion control procedure can update the value of QP for each frame or macroblock in order to balance the opposing goals of high quality and low bit rate. Usually, the rate-distortion control procedure aims a maximum or constant bit rate but it is also possible to encode aiming constant quality in which case this procedure does

The quantized DCT coefficients are scanned in zigzag to sort them according to increasing spatial frequency; then they are converted into binary codes along with other signaling in‐ formation (macroblock partitioning, prediction modes or motion vectors…) and inserted in‐ to the output bitstream. The binary codes are computed by the entropy coder procedure

minance channels are processed separately using the same techniques.

motions vectors from regions of the reference frames.

trouble for a real-time implementation.

120 Design and Architectures for Digital Signal Processing

floating point calculations.

nothing.

In order to fulfill the requirements of real-time, low-latency and high quality the encoder implements only a subset of the standard features and techniques available in the standard. This selection does not prevent the encoder to comply with the standard because H.264 al‐ lows a high degree of flexibility in the techniques used. Specifically, the implemented encod‐ er has undergone the two following main tradeoffs:


These tradeoffs, and some others discussed below, adversely affect the compression ratio of the encoder resulting in an increased bit rate. Fortunately, neither the best compression ratio nor constant bit rate are requirements of the implementation. These goals are distinct from those required for consumer applications for which there are lot of solutions, but they are essential in many professional applications: remote monitoring, remote assistance, content generation, broadcast, video surveillance. The H.264 standard dedicates some specific pro‐ files for this kind of applications (High, High10 and High10 Intra), that in some cases the industry has adopted, such as Panasonic's AVC-Intra.

Undoubtedly the temporal prediction improves the bit rate but not for free; in order to ob‐ tain a high quality a large number of reference frames are needed, scene changes can lead to devastating effects, especially if a constant bit rate is desired, and latency, measured from end to end, i.e. camera to monitor, increases linearly with the number of reference frames, and can reach 1 second, only for decoding.

In our implementation, the encoding of a frame begins as soon as the first 16 lines of pixels are available and once the row is encoded it is sent so that the decoding can start even be‐ fore the whole frame is encoded. Such an extremely low latency is only possible using spa‐ tial prediction. Additional advantages of the intra-only prediction are: 1) ease of frame by frame video editing, 2) the resilience against transmission errors, since an error affects only one frame, and 3) a significant saving of memory, which is especially important for very large resolutions.

#### *3.1.3. Amdahl's law*

The very first step in parallelizing an application is to determine if it is worth. Regardless of costs, running platform, software architecture and any other constraint a parallel application can run faster only a limited amount compared to its sequential version. Amdahl's law states that if *P* is the fraction of code that can be made parallel and *S* = (1 - *P*) is the fraction not parallelizable then the maximum speed up that can be achieved by using *N* processors is

$$Suppose \up U \text{ (} N \text{)}=\frac{1}{\mathbb{S} \text{ } \* \xleftarrow{P}} \tag{1}$$

result of this analysis indicates that the NAL unit management is clearly a hot spot in the code that could ruin the overall performance. Obviously, an optimization is needed in the

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

123

No parallel program can be built without knowing the data dependencies that the algorithm imposes. As previously stated, the basic procedures of the encoder are the pixels prediction and the entropy encoding of the residuals; in our case, intra prediction and CAVLC. It is clear that the second procedure must follow the first one, since it's not feasible to encode any data without having calculated it. Aside from this obvious fact, an analysis of the H.264 en‐

**•** The input image is partitioned for processing into so called macroblocks, square chunks

**•** The macroblocks are processed in raster scan order, i.e. from left to right and from top to

**•** Each macroblock is predicted using some data from previously encoded macroblocks, specifically the boundary pixels of the upper, upper right and left macroblocks. The only exception to this rule is when the neighbouring macroblocks are not available; for exam‐ ple, the first macroblock of an image does not use any additional information because

**•** In order to compute the entropy encoding each macroblock needs a quantization parame‐ ter, QP. There is no provision to determine how a macroblock selects this parameter, but usually this job is entrusted to the block labelled as "Rate Distortion Control" in figure 3, because it affects the number of bits generated in the entropy encoder and ultimately the bit rate of the whole encoder. The quantization parameter can be seen as a quality param‐ eter: the lower the value the better the quality but also the higher the bit rate. Our imple‐ mentation allows to select between constant quality (fixed QP) and constant bit rate (adaptable QP), but for the ease of parallelizing the latter option is applied in a frame by

In summary, the data dependencies at this algorithm level are the boundary pixels from

After analyzing the extent of parallelization and data dependencies it is the time to analyze the tasks that make up the algorithm. Here we mean by task not the usual computing term

The core encoding algorithm assuming the above mentioned tradeoffs could be described as a kind of streaming with frames as elements and macroblocks as the units of computation. At the system level there must be a single task that implements the RTSP service, waiting for

neighboring macroblocks and a frame constant quantization parameter.

but any procedure of the algorithm that could be accomplished in parallel.

Tilera side to fulfill the goal.

coder algorithm from the data flow standpoint shows:

*3.1.4. Data dependencies*

of 16x16 pixels.

their neighbours do not exist.

bottom.

frame basis.

*3.1.5. Tasks*

The results of applying Amdahl's law to a given problem are just a rough approximation to reality but serve to get an estimate of the maximum parallel performance and to focus atten‐ tion on potential bottlenecks and hot spots that can be found in the algorithm under devel‐ opment. So, the essential starting point in parallelization is to get an optimized sequential code, in order to determine the value of *S*.

The available literature dealing with H.264 encoding focuses on algorithmic description or performance improvement but usually forget to emphasize the inherently non-parallelizable part: the composition of the bitstream. Once the input raw video is encoded the resulting data must be packaged into NAL (Network Abstraction Layer) units which are byte aligned structures with header and trailing data. The data, known as RBSP (Raw Byte Sequence Pay‐ load), is written into the NAL units using a strict syntax in which the macroblock raster or‐ der must be preserved. The number of bits generated depends on the image, making it impossible to compose the NAL units without complying with the order. Furthermore, the Annex B of the standard states that RBSP data must be checked against patterns of bytes that can confuse the framing alignment while decoding. Those patterns must be disambigu‐ ated by byte stuffing the RBSP, i.e. inserting a fixed 0x03 byte each time they occur. Again, this procedure is neither predictable nor parallelizable.

The execution of the optimized sequential code on a Linux box equipped with an Intel Core2 Duo (T7700) CPU @ 2.40 GHz reveals that the fraction of time spent handling the composi‐ tion of NAL units is 0.45%, yielding the value of *S* = 0.0045. The encoded frames per second (fps) for a 4096x1716 video are 0.75. These figures have been obtained without taking into account the input or output in order to accurately measure the time spent in the algorithm. Solving equation (1) for *N* = 60 processors results in a speed up of 47.4x that applied to the throughput gives 35.56 fps, enough for the digital cinema format (24 fps). A similar run on the Tilera platform @ 866 MHz yields *S* = 0.015 (1.5%) and a throughput of 0.67 fps. Solving again for *N* = 60 processors results in a speed up of 31.8x and a final throughput of 21.33 fps, less than the requirement for the above mentioned format. The different values of *S* are mainly due to the unequal facilities fitted in the CPUs for handling bytes and bit fields. The result of this analysis indicates that the NAL unit management is clearly a hot spot in the code that could ruin the overall performance. Obviously, an optimization is needed in the Tilera side to fulfill the goal.

#### *3.1.4. Data dependencies*

frame video editing, 2) the resilience against transmission errors, since an error affects only one frame, and 3) a significant saving of memory, which is especially important for very

The very first step in parallelizing an application is to determine if it is worth. Regardless of costs, running platform, software architecture and any other constraint a parallel application can run faster only a limited amount compared to its sequential version. Amdahl's law states that if *P* is the fraction of code that can be made parallel and *S* = (1 - *P*) is the fraction not parallelizable then the maximum speed up that can be achieved by using *N* processors is

> <sup>S</sup> <sup>+</sup> *<sup>P</sup> N*

The results of applying Amdahl's law to a given problem are just a rough approximation to reality but serve to get an estimate of the maximum parallel performance and to focus atten‐ tion on potential bottlenecks and hot spots that can be found in the algorithm under devel‐ opment. So, the essential starting point in parallelization is to get an optimized sequential

The available literature dealing with H.264 encoding focuses on algorithmic description or performance improvement but usually forget to emphasize the inherently non-parallelizable part: the composition of the bitstream. Once the input raw video is encoded the resulting data must be packaged into NAL (Network Abstraction Layer) units which are byte aligned structures with header and trailing data. The data, known as RBSP (Raw Byte Sequence Pay‐ load), is written into the NAL units using a strict syntax in which the macroblock raster or‐ der must be preserved. The number of bits generated depends on the image, making it impossible to compose the NAL units without complying with the order. Furthermore, the Annex B of the standard states that RBSP data must be checked against patterns of bytes that can confuse the framing alignment while decoding. Those patterns must be disambigu‐ ated by byte stuffing the RBSP, i.e. inserting a fixed 0x03 byte each time they occur. Again,

The execution of the optimized sequential code on a Linux box equipped with an Intel Core2 Duo (T7700) CPU @ 2.40 GHz reveals that the fraction of time spent handling the composi‐ tion of NAL units is 0.45%, yielding the value of *S* = 0.0045. The encoded frames per second (fps) for a 4096x1716 video are 0.75. These figures have been obtained without taking into account the input or output in order to accurately measure the time spent in the algorithm. Solving equation (1) for *N* = 60 processors results in a speed up of 47.4x that applied to the throughput gives 35.56 fps, enough for the digital cinema format (24 fps). A similar run on the Tilera platform @ 866 MHz yields *S* = 0.015 (1.5%) and a throughput of 0.67 fps. Solving again for *N* = 60 processors results in a speed up of 31.8x and a final throughput of 21.33 fps, less than the requirement for the above mentioned format. The different values of *S* are mainly due to the unequal facilities fitted in the CPUs for handling bytes and bit fields. The

(1)

*speedup*(*<sup>N</sup>* )= <sup>1</sup>

large resolutions.

122 Design and Architectures for Digital Signal Processing

*3.1.3. Amdahl's law*

code, in order to determine the value of *S*.

this procedure is neither predictable nor parallelizable.

No parallel program can be built without knowing the data dependencies that the algorithm imposes. As previously stated, the basic procedures of the encoder are the pixels prediction and the entropy encoding of the residuals; in our case, intra prediction and CAVLC. It is clear that the second procedure must follow the first one, since it's not feasible to encode any data without having calculated it. Aside from this obvious fact, an analysis of the H.264 en‐ coder algorithm from the data flow standpoint shows:


In summary, the data dependencies at this algorithm level are the boundary pixels from neighboring macroblocks and a frame constant quantization parameter.

#### *3.1.5. Tasks*

After analyzing the extent of parallelization and data dependencies it is the time to analyze the tasks that make up the algorithm. Here we mean by task not the usual computing term but any procedure of the algorithm that could be accomplished in parallel.

The core encoding algorithm assuming the above mentioned tradeoffs could be described as a kind of streaming with frames as elements and macroblocks as the units of computation. At the system level there must be a single task that implements the RTSP service, waiting for a client connection and then delivering the RTP packets. At the frame level the following tasks can be identified:

directions: four modes for 16x16 luminance and chrominance and nine modes for the rest. Summing up all modes by iterating all the luminance prediction modes for each possible chrominance prediction mode yields a total search space of 736 combinations, each with its associated metric. The standard says nothing about how to compute these metrics and, therefore, how to select the best prediction mode for each block. There are two main ap‐ proaches to assess this measure: 1) in the spatial domain, calculating the cumulative sum of absolute differences between actual and predicted pixels and (SAD); and 2) calculating the same sum but using the data in the DCT transformed domain (SATD). The latter provides, in general, better results but the quality or bit rate difference is not significant when the vid‐ eo resolution is high. By means of a test suite we have determined that using SAD instead of SATD on high-definition (HD) and above formats, the bit rate increases by only 1% whereas the computational load is 30% lower. Needless to say, the approach chosen is the use of SAD. Luckily, it also allows taking advantage of some of the more specific and powerful in‐ structions of the TILE*Pro*64™ processor: the "sum of absolute difference" SIMD group.

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

125

In whatever case, these computations are very time consuming; note that the prediction of 4x4 and 8x8 blocks requires the reconstructed neighboring blocks since this will be the infor‐ mation available at the decoder. Therefore, once a mode is selected as best for a given block, it must be reconstructed emulating the decoder procedure in order for the neighbors to use its boundaries. This circumstance has promoted a lot of research over the last years [11] aimed to diminish the search space making available fast methods to "predict" the best pre‐ dictor from among a substantially reduced set modes without compromising too much the bit rate. We have chosen for our implementation a simple yet effective fast mode decision algorithm called "Selective Intra Prediction" [12]. The key idea of this algorithm stems from the fact that the dominating direction of a bigger block is similar to that of a smaller block and therefore it is feasible to avoid the computation of the unlikely modes after the determi‐ nation of the best 16x16 mode. The algorithm has been combined with the usual early-termi‐ nation technique, but in spite of this the fraction of time dedicated to the selection of the

Previous sections have explored the opportunities for parallelism highlighting the hot spots of the encoder. Now it is time to choose the most appropriate type of parallelism and to logi‐

The best parallelization pattern for achieving high throughput is the pipeline; if in addition its number of stages is not large latency can be low enough. However, a video encoder is not a good candidate for pipelining because, among other considerations, the computational burden of tasks is very dissimilar, the flow of control is not regular as it depends on data, data must be shared or copied and, specifically for the TILE processor, the number of stages

predictor exceeds 60% after manually optimizing the code.

**3.2. Task assignment**

*3.2.1. Parallel pattern*

cally organize the tasks.

should be no less than 60.


At the macroblock level, the task list is as follows:


Note that some tasks at macroblock level can be interspersed with those at the frame level, e.g. once a macroblock is CAVLC encoded the resulting data could be written to the NAL unit, and therefore there is no need of collecting the data from all macroblocks and writing it afterwards. The rearrangement of task order and the intermixing at the described or even finer levels broaden the parallelization options as long as the data dependencies are met.

Two potential hot spots can be found at the input and output of data. A digital cinema cam‐ era with a chroma sampling of 4:2:0 produces 10,543,104 bytes of raw video data per frame totaling more than 240 Mbytes/s. If we assume a compression ratio of 10, the total output bit rate will exceed 24 Mbytes/s. These figures are not unmanageable, but indicate that the in‐ put and output procedures should be treated with special care and, as far as possible, run them overlapped with the rest of tasks in the algorithm.

Another hot spot is concerned with prediction. The luminance part of each macroblock can be predicted in three pixel sizes: 16x16, the full macroblock, four 8x8 blocks or sixteen 4x4 blocks; the chrominance is always predicted in full size blocks (8x8 if using 4:2:0 chroma for‐ mat) for each component. Each block is explored in several modes related to different spatial directions: four modes for 16x16 luminance and chrominance and nine modes for the rest. Summing up all modes by iterating all the luminance prediction modes for each possible chrominance prediction mode yields a total search space of 736 combinations, each with its associated metric. The standard says nothing about how to compute these metrics and, therefore, how to select the best prediction mode for each block. There are two main ap‐ proaches to assess this measure: 1) in the spatial domain, calculating the cumulative sum of absolute differences between actual and predicted pixels and (SAD); and 2) calculating the same sum but using the data in the DCT transformed domain (SATD). The latter provides, in general, better results but the quality or bit rate difference is not significant when the vid‐ eo resolution is high. By means of a test suite we have determined that using SAD instead of SATD on high-definition (HD) and above formats, the bit rate increases by only 1% whereas the computational load is 30% lower. Needless to say, the approach chosen is the use of SAD. Luckily, it also allows taking advantage of some of the more specific and powerful in‐ structions of the TILE*Pro*64™ processor: the "sum of absolute difference" SIMD group.

In whatever case, these computations are very time consuming; note that the prediction of 4x4 and 8x8 blocks requires the reconstructed neighboring blocks since this will be the infor‐ mation available at the decoder. Therefore, once a mode is selected as best for a given block, it must be reconstructed emulating the decoder procedure in order for the neighbors to use its boundaries. This circumstance has promoted a lot of research over the last years [11] aimed to diminish the search space making available fast methods to "predict" the best pre‐ dictor from among a substantially reduced set modes without compromising too much the bit rate. We have chosen for our implementation a simple yet effective fast mode decision algorithm called "Selective Intra Prediction" [12]. The key idea of this algorithm stems from the fact that the dominating direction of a bigger block is similar to that of a smaller block and therefore it is feasible to avoid the computation of the unlikely modes after the determi‐ nation of the best 16x16 mode. The algorithm has been combined with the usual early-termi‐ nation technique, but in spite of this the fraction of time dedicated to the selection of the predictor exceeds 60% after manually optimizing the code.

#### **3.2. Task assignment**

a client connection and then delivering the RTP packets. At the frame level the following

**•** Compute the Rate-Distortion procedure (usually known as RDO with the O meaning Op‐

Note that some tasks at macroblock level can be interspersed with those at the frame level, e.g. once a macroblock is CAVLC encoded the resulting data could be written to the NAL unit, and therefore there is no need of collecting the data from all macroblocks and writing it afterwards. The rearrangement of task order and the intermixing at the described or even finer levels broaden the parallelization options as long as the data dependencies are met.

Two potential hot spots can be found at the input and output of data. A digital cinema cam‐ era with a chroma sampling of 4:2:0 produces 10,543,104 bytes of raw video data per frame totaling more than 240 Mbytes/s. If we assume a compression ratio of 10, the total output bit rate will exceed 24 Mbytes/s. These figures are not unmanageable, but indicate that the in‐ put and output procedures should be treated with special care and, as far as possible, run

Another hot spot is concerned with prediction. The luminance part of each macroblock can be predicted in three pixel sizes: 16x16, the full macroblock, four 8x8 blocks or sixteen 4x4 blocks; the chrominance is always predicted in full size blocks (8x8 if using 4:2:0 chroma for‐ mat) for each component. Each block is explored in several modes related to different spatial

timization). The result is the quantization parameter to be applied to the frame.

**•** Open and initialize a Network Abstraction Layer (NAL) unit

**•** Encode the frame in macroblock chunks in raster scan order.

**•** Read the raw input pixels of the frame.

124 Design and Architectures for Digital Signal Processing

**•** Write the encoded data to the NAL unit.

**•** Update RDO with the frame information.

**•** Read the macroblock raw input pixels.

**•** Transform and quantize the residual error.

At the macroblock level, the task list is as follows:

**•** Get the boundary pixels from neighbouring macroblocks.

**•** Inverse transform and quantize the transformed data.

**•** Encode the residual transformed data using CAVLC.

them overlapped with the rest of tasks in the algorithm.

tasks can be identified:

**•** Close the NAL unit.

**•** Deliver the NAL unit.

**•** Select the best prediction.

**•** Compute the residual error.

#### *3.2.1. Parallel pattern*

Previous sections have explored the opportunities for parallelism highlighting the hot spots of the encoder. Now it is time to choose the most appropriate type of parallelism and to logi‐ cally organize the tasks.

The best parallelization pattern for achieving high throughput is the pipeline; if in addition its number of stages is not large latency can be low enough. However, a video encoder is not a good candidate for pipelining because, among other considerations, the computational burden of tasks is very dissimilar, the flow of control is not regular as it depends on data, data must be shared or copied and, specifically for the TILE processor, the number of stages should be no less than 60.

If we reject the pipeline approach, the remaining choices to consider are multiprocessing, multithreading or a mix of both. The main difference concerns virtual memory space; a process has its own non shared virtual space while a thread shares it with all other threads. Multithreading demands a more elaborated synchronization among threads but facilitates the inter-thread communications because it is accomplished simply by sharing data in mem‐ ory. Furthermore, the TILE processor implements inter e intra-core cache coherence techni‐ ques that leverage the user of worrying about correctness of data. Based on these considerations the multithreading approach was chosen for the encoder.

the row threads be in charge of reading all the row input pixels, select the predictor, and so on, including the entropy encoding; the last task would be impossible if CABAC were used instead of CAVLC because CABAC, being recursive, needs data from the last macroblock of the previous row. In such a case the row threads could not proceed in parallel or, if they did, they should store all the information of the macroblocks and afterwards apply the entropy encoding. This would represent a severe waste of memory and an unmanageable hot spot. However, the CAVLC encoding so partitioned also has a drawback: since the bit alignment of the row data in the NAL unit is unknown, the whole unit must be realigned. Anyway, the

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

127

The preceding paragraphs have focused on analyzing the processing of macroblock rows but we have said anything about the issue described in Section 3.1.2 concerning Amdahl's Law: the re‐ alignment and byte stuffing of CAVLC encoded data are limiting factors of parallel performance that may saturate the speed up. The implementation dedicates a core, known as framer, running

Figure 4 shows a simplified time line of 12 row threads working in parallel. It can be seen that the whole process resembles a macroblock pipeline, although technically speaking it is

**•** Each row, except the first, must start with a delay at least twice the handling time of a

**•** The total time spent at any frame is much greater than the time required for packetizing

**•** The video input is not sorted in a natural way, e.g. row 0 at frame N (the green one) needs data before row 11 at frame N–1 (the blue one). The input procedure must take this fact

**•** At any moment in time multiple frames may be simultaneously being processed; the worst case arises at the frame boundaries (T0 in the figure). It is not difficult to prove that if we assume a constant, say mean, macroblock processing time value the number of si‐

being *C* the number of macroblocks in a row and *R* the number of rows. Furthermore, the row threads could start working with the frame N+1 before the framing thread has had an opportunity to evict the row data of frame N. A simple *n*-buffer strategy at the row thread

**•** The framing proceeds in bursts at the beginning of a frame because the encoded data is availa‐ ble but afterwards it must wait for the rows to terminate. This drawback puts the framer thread even more under pressure as there may be time intervals during which it cannot per‐

*<sup>R</sup>* (2)

*<sup>n</sup>* <sup>=</sup> 2(*<sup>C</sup>* - 1) <sup>+</sup> *<sup>R</sup>*

a single thread of manually optimized assembler code to address this problem.

**•** All time intervals are sketched alike; actually, times depend on input data.

its encoded data into a NAL unit, as expected for a pipelined structure.

single macroblock to ensure that the boundary pixels are available.

into account and allow for a non ordered access to the data.

entropy encoding in parallel is advantageous.

not. Some details are worth being described:

multaneous frames can be as great as:

output is enough to solve this trouble.

form any work.

#### *3.2.2. Data decomposition*

Another issue has to do with the decomposition of data, i.e. how to partition and distribute the data space among the cores. The encoding problem has not a recursive nature and so it is clearly preferable a geometric decomposition ideally suited to the data dependencies of the algorithm. It seems that macroblock decomposition of data is the right choice. Additionally, this decomposition enables the use of a single-program multiple-data (SPMD) model that eases programming by means of parametrizing the input to the code.

#### *3.2.3. Core processing threads*

So far we have decided to use threads to process macroblocks; the question now is how to organize those threads. A digital cinema video frame is composed of 27456 macroblocks; if any single thread is responsible for a single macroblock we would need the same amount of threads. Even if spread into 60 cores the number far exceeds the Linux threading facilities. A better choice is to partition the data not into macroblocks but into rows of macroblocks; this yields just 108 threads, a more manageable figure that not compromises the SPMD model.

Such a thread assignment can be still improved. Programs typically let the threads die once they have finished their work, but thread creation and termination has some overhead that can be obviated by recycling them. This is a simple technique, usually seen in digital signal processing, in which a thread created at startup runs forever until explicitly killed. For this technique to be useful it requires two synchronization points, the first to ensure that data is available while the second to signal that the work is finished. The exchange between thread management overhead and synchronization is worthwhile.

Further improvement arises if we avoid thread scheduling and time sharing in any single core as it eliminates the operating system kernel overhead devoted to task switching. In such a scenario each thread dynamically selects the next row to be processed as soon it has finished with the current. This technique, known as thread pooling, is especially well suited to the TILE architecture since the pool can be spread among the cores, each one running a single thread. The MDE has a provision for exploiting this setting: the so called dataplanes in which the standard Linux kernel is substituted by a zero overhead kernel. A nice result of using a thread pool is a fair load balancing, which is not always easy to get.

It remains to determine the scope of the row processing, i.e. which algorithm tasks the row threads perform. Referring to the above list of macroblock level tasks it is worthwhile that the row threads be in charge of reading all the row input pixels, select the predictor, and so on, including the entropy encoding; the last task would be impossible if CABAC were used instead of CAVLC because CABAC, being recursive, needs data from the last macroblock of the previous row. In such a case the row threads could not proceed in parallel or, if they did, they should store all the information of the macroblocks and afterwards apply the entropy encoding. This would represent a severe waste of memory and an unmanageable hot spot. However, the CAVLC encoding so partitioned also has a drawback: since the bit alignment of the row data in the NAL unit is unknown, the whole unit must be realigned. Anyway, the entropy encoding in parallel is advantageous.

If we reject the pipeline approach, the remaining choices to consider are multiprocessing, multithreading or a mix of both. The main difference concerns virtual memory space; a process has its own non shared virtual space while a thread shares it with all other threads. Multithreading demands a more elaborated synchronization among threads but facilitates the inter-thread communications because it is accomplished simply by sharing data in mem‐ ory. Furthermore, the TILE processor implements inter e intra-core cache coherence techni‐ ques that leverage the user of worrying about correctness of data. Based on these

Another issue has to do with the decomposition of data, i.e. how to partition and distribute the data space among the cores. The encoding problem has not a recursive nature and so it is clearly preferable a geometric decomposition ideally suited to the data dependencies of the algorithm. It seems that macroblock decomposition of data is the right choice. Additionally, this decomposition enables the use of a single-program multiple-data (SPMD) model that

So far we have decided to use threads to process macroblocks; the question now is how to organize those threads. A digital cinema video frame is composed of 27456 macroblocks; if any single thread is responsible for a single macroblock we would need the same amount of threads. Even if spread into 60 cores the number far exceeds the Linux threading facilities. A better choice is to partition the data not into macroblocks but into rows of macroblocks; this yields just 108 threads, a more manageable figure that not compromises the SPMD model.

Such a thread assignment can be still improved. Programs typically let the threads die once they have finished their work, but thread creation and termination has some overhead that can be obviated by recycling them. This is a simple technique, usually seen in digital signal processing, in which a thread created at startup runs forever until explicitly killed. For this technique to be useful it requires two synchronization points, the first to ensure that data is available while the second to signal that the work is finished. The exchange between thread

Further improvement arises if we avoid thread scheduling and time sharing in any single core as it eliminates the operating system kernel overhead devoted to task switching. In such a scenario each thread dynamically selects the next row to be processed as soon it has finished with the current. This technique, known as thread pooling, is especially well suited to the TILE architecture since the pool can be spread among the cores, each one running a single thread. The MDE has a provision for exploiting this setting: the so called dataplanes in which the standard Linux kernel is substituted by a zero overhead kernel. A nice result of

It remains to determine the scope of the row processing, i.e. which algorithm tasks the row threads perform. Referring to the above list of macroblock level tasks it is worthwhile that

using a thread pool is a fair load balancing, which is not always easy to get.

considerations the multithreading approach was chosen for the encoder.

eases programming by means of parametrizing the input to the code.

management overhead and synchronization is worthwhile.

*3.2.2. Data decomposition*

126 Design and Architectures for Digital Signal Processing

*3.2.3. Core processing threads*

The preceding paragraphs have focused on analyzing the processing of macroblock rows but we have said anything about the issue described in Section 3.1.2 concerning Amdahl's Law: the re‐ alignment and byte stuffing of CAVLC encoded data are limiting factors of parallel performance that may saturate the speed up. The implementation dedicates a core, known as framer, running a single thread of manually optimized assembler code to address this problem.

Figure 4 shows a simplified time line of 12 row threads working in parallel. It can be seen that the whole process resembles a macroblock pipeline, although technically speaking it is not. Some details are worth being described:


$$\mathcal{H} = \left\lceil \frac{2(C - 1) + R}{R} \right\rceil \tag{2}$$

being *C* the number of macroblocks in a row and *R* the number of rows. Furthermore, the row threads could start working with the frame N+1 before the framing thread has had an opportunity to evict the row data of frame N. A simple *n*-buffer strategy at the row thread output is enough to solve this trouble.

**•** The framing proceeds in bursts at the beginning of a frame because the encoded data is availa‐ ble but afterwards it must wait for the rows to terminate. This drawback puts the framer thread even more under pressure as there may be time intervals during which it cannot per‐ form any work.

On the other hand, the input driver is programmed as a server that handles the necessary buffering for extracting and reordering the camera data and delivers it at the pace enforced

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

129

The aim of the orchestration phase is to design the mechanisms that will ensure the proper synchronization among threads; i.e. that all control and data dependencies are met. Section 3.1.3 dealt with data dependencies; the control dependencies arise from the task assignment

The synchronization primitives used are those provided by the POSIX 1b and 1c extensions, available in any Linux box and in the Tilera's software stack. The use of these primitives is easier than programming custom ones, although its performance is not always the best; but having selected a coarse data grain for the implementation their impact is very limited.

In essence, we use semaphores for synchronizing threads and read-write locks to protect the macroblock boundary data. The use of the latter instead of the usual mutexes allows a high‐ er degree of parallelism as it does not blocks any reading thread if the writer has not ac‐ quired the lock. Bearing in mind that we have designed the assignment of tasks to threads so that there is only one thread on each core, the read-write locks can have spin flavor to

So far we have always used 60 as the number of cores dedicated to encoding but the Tilera processor has 64; let see why. On the one hand, the framer thread, being the major potential bottleneck of the system, claims a core for itself; on the other hand, for the input and output hypervisor drivers to do their work overlapped with the algorithmic computation they each need a dedicated core. The remaining core hosts all the other auxiliary threads: RTSP server,

The last issue to be addressed is how to distribute the threads, i.e. to determine in which physical cores they will run. The best way to make the distribution is keeping as far as possi‐ ble the data locality since the latency for accessing an adjacent core's cache memory is much cheaper than accessing any other core's cache. This arrangement is easily attained for the row processing cores; unfortunately, there is no way for the framer core to be adjacent to all row processing cores. The selected distribution is shown in figure 5, in which row cores are shown in blue with the closed adjacency path in light blue. The framer core is shown in dark blue, input and output cores in orange and the auxiliary core in green. The position of input and output deserves a comment; they are physically very close to their corresponding hard‐

This distribution scales almost linearly for any video resolution with at least 60 rows of mac‐ roblocks, i.e. 960 pixels high, including high-definition (1080 pixels, 68 rows) and above. With lower resolutions the row cores will not all be active so there will be a degradation of

to threads, specifically to take advantage of the always live thread pattern.

avoid putting the threads to sleep while waiting for the lock.

RTCP, RDO and the C main thread that just waits for the program to exit.

by the row threads requests.

**3.3. Orchestration**

**3.4. Distribution**

ware as Tilera recommends.

performance.

**Figure 4.** Time Line of Row Processing

#### *3.2.4. Auxiliary threads*

There are two tasks that still remain to be assigned to threads: the RTSP service and the RDO. The RTSP service can advantageously be implemented using two threads: the first one devoted to the service itself while the second in charge of the subsidiary real-time control protocol (RTCP). No one of these threads requires a great amount of CPU resources but the logical division facilitates software coding.

With regard to RDO, it could be as complex as desired in order to obtain accurate estimates of the bit rate and so select the best quantization parameter QP. But we have said repeatedly that optimizing the bit rate is not a priority of our implementation and a simple and fast PID (proportional integral derivative) controller algorithm is enough for our purposes. The only remarkable question is that adjusting the algorithm parameters should be made taking into account that input data are delayed due to the pipelined behavior of the processing. This RDO computation could be performed by the framer thread but we preferred to do it in a separate thread for the system to be more flexible in case of need.

#### *3.2.5. Input and output*

The best solution for input and output is that their functionality runs into two separate hy‐ pervisor drivers. Doing so, all I/O data could flow over the I/O Dynamic Network (IDN) that connects all tiles with the on-chip devices alleviating the burden of memory sharing at the user level. In addition, this schema obviates the intermediate level of buffering needed be‐ tween the program and the Linux kernel drivers.

The output driver is just a packet based service tailored to the handling of the RTP payload over IP. A notable optimization feature is that the driver uses only fixed size buffers that fit into the Ethernet jumbo packets with two objectives: to reduce overhead and to avoid IP fragmentation.

On the other hand, the input driver is programmed as a server that handles the necessary buffering for extracting and reordering the camera data and delivers it at the pace enforced by the row threads requests.

#### **3.3. Orchestration**

The aim of the orchestration phase is to design the mechanisms that will ensure the proper synchronization among threads; i.e. that all control and data dependencies are met. Section 3.1.3 dealt with data dependencies; the control dependencies arise from the task assignment to threads, specifically to take advantage of the always live thread pattern.

The synchronization primitives used are those provided by the POSIX 1b and 1c extensions, available in any Linux box and in the Tilera's software stack. The use of these primitives is easier than programming custom ones, although its performance is not always the best; but having selected a coarse data grain for the implementation their impact is very limited.

In essence, we use semaphores for synchronizing threads and read-write locks to protect the macroblock boundary data. The use of the latter instead of the usual mutexes allows a high‐ er degree of parallelism as it does not blocks any reading thread if the writer has not ac‐ quired the lock. Bearing in mind that we have designed the assignment of tasks to threads so that there is only one thread on each core, the read-write locks can have spin flavor to avoid putting the threads to sleep while waiting for the lock.

#### **3.4. Distribution**

**Figure 4.** Time Line of Row Processing

128 Design and Architectures for Digital Signal Processing

logical division facilitates software coding.

separate thread for the system to be more flexible in case of need.

tween the program and the Linux kernel drivers.

There are two tasks that still remain to be assigned to threads: the RTSP service and the RDO. The RTSP service can advantageously be implemented using two threads: the first one devoted to the service itself while the second in charge of the subsidiary real-time control protocol (RTCP). No one of these threads requires a great amount of CPU resources but the

With regard to RDO, it could be as complex as desired in order to obtain accurate estimates of the bit rate and so select the best quantization parameter QP. But we have said repeatedly that optimizing the bit rate is not a priority of our implementation and a simple and fast PID (proportional integral derivative) controller algorithm is enough for our purposes. The only remarkable question is that adjusting the algorithm parameters should be made taking into account that input data are delayed due to the pipelined behavior of the processing. This RDO computation could be performed by the framer thread but we preferred to do it in a

The best solution for input and output is that their functionality runs into two separate hy‐ pervisor drivers. Doing so, all I/O data could flow over the I/O Dynamic Network (IDN) that connects all tiles with the on-chip devices alleviating the burden of memory sharing at the user level. In addition, this schema obviates the intermediate level of buffering needed be‐

The output driver is just a packet based service tailored to the handling of the RTP payload over IP. A notable optimization feature is that the driver uses only fixed size buffers that fit into the Ethernet jumbo packets with two objectives: to reduce overhead and to avoid IP

*3.2.4. Auxiliary threads*

*3.2.5. Input and output*

fragmentation.

So far we have always used 60 as the number of cores dedicated to encoding but the Tilera processor has 64; let see why. On the one hand, the framer thread, being the major potential bottleneck of the system, claims a core for itself; on the other hand, for the input and output hypervisor drivers to do their work overlapped with the algorithmic computation they each need a dedicated core. The remaining core hosts all the other auxiliary threads: RTSP server, RTCP, RDO and the C main thread that just waits for the program to exit.

The last issue to be addressed is how to distribute the threads, i.e. to determine in which physical cores they will run. The best way to make the distribution is keeping as far as possi‐ ble the data locality since the latency for accessing an adjacent core's cache memory is much cheaper than accessing any other core's cache. This arrangement is easily attained for the row processing cores; unfortunately, there is no way for the framer core to be adjacent to all row processing cores. The selected distribution is shown in figure 5, in which row cores are shown in blue with the closed adjacency path in light blue. The framer core is shown in dark blue, input and output cores in orange and the auxiliary core in green. The position of input and output deserves a comment; they are physically very close to their corresponding hard‐ ware as Tilera recommends.

This distribution scales almost linearly for any video resolution with at least 60 rows of mac‐ roblocks, i.e. 960 pixels high, including high-definition (1080 pixels, 68 rows) and above. With lower resolutions the row cores will not all be active so there will be a degradation of performance.

In parallel runs the cache problem becomes more evident, as shows the following table:

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

131

**Image Size Simulator Hardware 1280x720** 239,82 fps 181,65 fps **1920x1080** 155,98 fps 99,44 fps **3840x2160** 37,37 fps 25,66 fps

It is worth to mention that the performance boosts around 32.5% in mean by avoiding the 8x8 block encoding of luma. This figure puts a little more spice to the controversy over the

The following graph shows the throughput measured as time per macroblock (blue) and

It can be seen that the throughput degrades abruptly when the number of used row cores is

The next graph shows the speed up as a function of the number of row cores. The shape of

the graph is quite linear but the slope is less than 1 as predicted by Amdahl's Law.

less than available; the 1280x720 uses 45 cores, while the other uses all 60.

**Table 2.** Encoded Frames per Second

**Figure 6.** Throughput

inclusion of this technique in the standard.

number of macroblocks per second (green) versus resolution.

**Figure 5.** Distribution of Threads into Processor Cores

#### **4. Results**

In order to evaluate the results the freely distributable test video sequences Park-Joy has been encoded in different sizes using a constant quantization parameter QP = 18. This figure allows an encoding without noticeable visual degradation. Park-Joy contains small figures of running people; sometimes large objects - unfocused trees near the camera - move to the left as a result of a strictly horizontal camera movement, overlapping the entire scene. At the end of the sequence the camera slows the motion.

In sequential runs the mean time spent in macroblock encoding is 40.27 μs in Linux and 79.54 μs on the TILE processor @ 866 MHz. The differences are due to the operating clock and the architectural dissimilarities. It is easy to see that the optimizations undertaken in the Tilera side have been successful since the clock speed is reduced by a factor of 2.77 while the time ratio is only 1.98. Note that the Linux code has been optimized only at the C level and thus not using the SIMD instructions provided by the MMX or SSE instruction extensions.

The same run on the Tilera's simulator in functional mode in, which the cache hazards are not fully considered, yields 57.07 μs. It is apparent that the TILE core cache memory is not large enough to hold all code and data and thus incurring in a high rate of capacity misses.


In parallel runs the cache problem becomes more evident, as shows the following table:

**Table 2.** Encoded Frames per Second

It is worth to mention that the performance boosts around 32.5% in mean by avoiding the 8x8 block encoding of luma. This figure puts a little more spice to the controversy over the inclusion of this technique in the standard.

The following graph shows the throughput measured as time per macroblock (blue) and number of macroblocks per second (green) versus resolution.

#### **Figure 6.** Throughput

**Figure 5.** Distribution of Threads into Processor Cores

130 Design and Architectures for Digital Signal Processing

end of the sequence the camera slows the motion.

In order to evaluate the results the freely distributable test video sequences Park-Joy has been encoded in different sizes using a constant quantization parameter QP = 18. This figure allows an encoding without noticeable visual degradation. Park-Joy contains small figures of running people; sometimes large objects - unfocused trees near the camera - move to the left as a result of a strictly horizontal camera movement, overlapping the entire scene. At the

In sequential runs the mean time spent in macroblock encoding is 40.27 μs in Linux and 79.54 μs on the TILE processor @ 866 MHz. The differences are due to the operating clock and the architectural dissimilarities. It is easy to see that the optimizations undertaken in the Tilera side have been successful since the clock speed is reduced by a factor of 2.77 while the time ratio is only 1.98. Note that the Linux code has been optimized only at the C level and thus not using the SIMD instructions provided by the MMX or SSE instruction extensions. The same run on the Tilera's simulator in functional mode in, which the cache hazards are not fully considered, yields 57.07 μs. It is apparent that the TILE core cache memory is not large enough to hold all code and data and thus incurring in a high rate of capacity misses.

**4. Results**

It can be seen that the throughput degrades abruptly when the number of used row cores is less than available; the 1280x720 uses 45 cores, while the other uses all 60.

The next graph shows the speed up as a function of the number of row cores. The shape of the graph is quite linear but the slope is less than 1 as predicted by Amdahl's Law.

range of high definition resolutions. Computational complexity consequently increases by

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

133

The techniques by which those enhancements are realized should be analyzed from the

As we mentioned before, the CABAC procedure as defined in H.264 is not amenable to par‐ allelization. In HEVC special care has been taken to reduce data dependencies in the partic‐ ular version of CABAC it implements. However, data dependencies have not disappeared, and this poses severe problems – albeit less so – for implementation on parallel processors.

In order to ameliorate spatial prediction, new modes have been defined in HEVC. In partic‐ ular the number of modes is 35 versus 9 in H.264. We have already said that 60% computa‐ tion time is taken by analysis and selection of the optimal encoding mode. Therefore one must expect a considerable increase in computation needs due simply to the number of pre‐ diction modes that must be explored; more even so, given the fact that the image is divided not in uniform macroblocks but in coding tree blocks (or coding units CU) with an inner

The overall performance of Tilera's TILE*Pro*64™ can be said to be outstanding for video coding applications. In the particular case of low-latency H.264 encoding the largest difficul‐ ties arises for the highest resolution values of the video stream. For these, large amounts of memory are required, exceeding what is readily available within the processor; the main limitation resulting from the relatively small cache memory. Notwithstanding this fact, we have been able to find memory management schemes and workarounds that make real time encoding possible even at the highest resolutions (4096x1716, 24 fps) contemplated in this

For other video codec applications, expectations are high. Inter frame prediction can proba‐ bly be traded off by lower resolution values. The main conclusion being that the processor architecture is adequate for established coders, whose bases were laid a few years ago and

New developments such as HVEC hold enormous promise, but the difficulties surrounding real-time implementation are challenging, to say the least. It is likely that several years of research are needed to significantly advance in that direction. Of course this raises the ques‐ tion whether the architecture will be up to the coding schemes under development and/or

The results of this work do not stop at video coding. Applications to novel fields such as virtual advertising and augmented reality in medicine are under study for current and fu‐

an estimated factor of two to ten, and maybe more.

structure of variable sizes of their own (64x64, 32x32, 16x16).

which are still the subject of implementation research.

what enhancements will be necessary.

point of view of our implementation.

**6. Conclusion**

work.

ture projects.

**Figure 7.** Speed Up

#### **4.1. Some TILE***Pro***64™ troubles**

Despite the enormous amount of silicon and functionality provided by the TILE*Pro*64™ pro‐ cessor, some flaws have been detected:


#### **5. The future of video coding**

Video coding technology will not stop at H.264. A new draft standard known as HEVC (aka H.265 and MPEG-H Part 2) is still under development. It features important improvements over H.264 centered on achieving bit rate reductions of about 50% and supports a wider range of high definition resolutions. Computational complexity consequently increases by an estimated factor of two to ten, and maybe more.

The techniques by which those enhancements are realized should be analyzed from the point of view of our implementation.

As we mentioned before, the CABAC procedure as defined in H.264 is not amenable to par‐ allelization. In HEVC special care has been taken to reduce data dependencies in the partic‐ ular version of CABAC it implements. However, data dependencies have not disappeared, and this poses severe problems – albeit less so – for implementation on parallel processors.

In order to ameliorate spatial prediction, new modes have been defined in HEVC. In partic‐ ular the number of modes is 35 versus 9 in H.264. We have already said that 60% computa‐ tion time is taken by analysis and selection of the optimal encoding mode. Therefore one must expect a considerable increase in computation needs due simply to the number of pre‐ diction modes that must be explored; more even so, given the fact that the image is divided not in uniform macroblocks but in coding tree blocks (or coding units CU) with an inner structure of variable sizes of their own (64x64, 32x32, 16x16).

#### **6. Conclusion**

**Figure 7.** Speed Up

code.

**4.1. Some TILE***Pro***64™ troubles**

132 Design and Architectures for Digital Signal Processing

cessor, some flaws have been detected:

**5. The future of video coding**

documentation [13] contain examples, tips and tricks.

generation of Tilera processors, the TILE-Gx series.

Despite the enormous amount of silicon and functionality provided by the TILE*Pro*64™ pro‐

**•** It is quite hard to optimize the code using intrinsics or assembler; it would be nice if the

**•** The processor instruction set architecture contains basic instructions for bit and byte rear‐ rangement at the register level; these include the "byte exchange", "byte/word inter‐ leave", and "masked merge word" instructions. However, it lacks bit-field extract and insert and byte/word shuffle instructions. These capabilities are incorporated in the new

**•** Correct use of branches is difficult, even for the compiler; branch mispredictions result in pipeline hazards that increase instruction latency. Fortunately, the feedback based optimi‐ zation technique [14] alleviates this issue but it is cumbersome when optimizing source

**•** Finally, the most important limitation for video encoding is the amount of cache per core; 64 Kbytes of L2 is not enough for code and data, leading to many cache-capacity misses and therefore many stalled cycles. The TILE-Gx series has 256 Kbytes of L2 cache per

Video coding technology will not stop at H.264. A new draft standard known as HEVC (aka H.265 and MPEG-H Part 2) is still under development. It features important improvements over H.264 centered on achieving bit rate reductions of about 50% and supports a wider

core, without any doubt a must for achieving better video encoding performance.

The overall performance of Tilera's TILE*Pro*64™ can be said to be outstanding for video coding applications. In the particular case of low-latency H.264 encoding the largest difficul‐ ties arises for the highest resolution values of the video stream. For these, large amounts of memory are required, exceeding what is readily available within the processor; the main limitation resulting from the relatively small cache memory. Notwithstanding this fact, we have been able to find memory management schemes and workarounds that make real time encoding possible even at the highest resolutions (4096x1716, 24 fps) contemplated in this work.

For other video codec applications, expectations are high. Inter frame prediction can proba‐ bly be traded off by lower resolution values. The main conclusion being that the processor architecture is adequate for established coders, whose bases were laid a few years ago and which are still the subject of implementation research.

New developments such as HVEC hold enormous promise, but the difficulties surrounding real-time implementation are challenging, to say the least. It is likely that several years of research are needed to significantly advance in that direction. Of course this raises the ques‐ tion whether the architecture will be up to the coding schemes under development and/or what enhancements will be necessary.

The results of this work do not stop at video coding. Applications to novel fields such as virtual advertising and augmented reality in medicine are under study for current and fu‐ ture projects.

#### **Acknowledgements**

The. authors gratefully acknowledge the support provided by project IDI-20100823 of Span‐ ish Government's *Ministerio de Economía y Competitividad,* under the leadership of Datatech SDA who also acknowledges that support. Project TEC2009-14219-C03-01 also provided support for this work.

[8] Richardson I. *The H.264 Advanced Video Compression Standard, Second Edition*, John Wi‐

Video Encoder Implementation on Tilera's TILEPro64™ Multicore Processor

http://dx.doi.org/10.5772/53429

135

[9] Wiegand T. & Sullivan G.J. *Overview of the H.264/AVC Video Coding Standard*, IEEE Transactions on Circuits and Systems for Video Technology 2003;13(7):560-576. doi:

[10] Sullivan G.J., Topiwala P. & Luthra A. *The H.264/AVC Advanced Video Coding Stand‐ ard: Overview and Introduction to the Fidelity Range Extensions*, SPIE Conference on Ap‐ plications of Digital Image Processing XXVII, Special Session on Advances in the

[11] Milani S. *Spatial prediction in the H.264/AVC FRExt coder and its optimization,* In: Miron S. (ed.) *Signal Processing*, Rijeka: InTech; 2010. http://www.intechopen.com/books/ signal-processing/spatial-prediction-in-the-h-264-avc-frext-coder-and-its-optimiza‐

[12] Park J.S. & Song, H.J. *Selective Intra Prediction Mode Decision for H.264/AVC Encoders*, World Academy of Science, Engineering and Technology 13, 2008. http://

www.waset.org/journals/waset/v13/v13-104.pdf (accessed 9 September 2012)

New Emerging Standard: H.264/AVC, 2004. doi: 10.1117/12.564457

ley & Sons, 2010.

10.1109/TCSVT.2003.815165

tion (accessed 9 September 2012)

[13] Tilera Corporation, *User Architecture Manual*, 2010.

[14] Tilera Corporation, *Optimization Guide*, 2010.

The authors also acknowledge the continuing support and cooperation of Datatech SDA for ongoing developments of Tilera's processor capabilities: real-time video analysis, virtual ad‐ vertising and augmented reality in medicine.

#### **Author details**

José Parera-Bermúdez, Javier Casajús-Quirós and Igor Arambasic

\*Address all correspondence to: jose.parera@upm.es

Department of Signals, Systems and Radiocommunications, Polytechnic University of Ma‐ drid, Spain

#### **References**


**Acknowledgements**

134 Design and Architectures for Digital Signal Processing

support for this work.

**Author details**

drid, Spain

**References**

*or*, 2009.

dison-Wesley, 2011.

vertising and augmented reality in medicine.

José Parera-Bermúdez, Javier Casajús-Quirós and Igor Arambasic

\*Address all correspondence to: jose.parera@upm.es

The. authors gratefully acknowledge the support provided by project IDI-20100823 of Span‐ ish Government's *Ministerio de Economía y Competitividad,* under the leadership of Datatech SDA who also acknowledges that support. Project TEC2009-14219-C03-01 also provided

The authors also acknowledge the continuing support and cooperation of Datatech SDA for ongoing developments of Tilera's processor capabilities: real-time video analysis, virtual ad‐

Department of Signals, Systems and Radiocommunications, Polytechnic University of Ma‐

[1] ITU-T Rec. H.264 | ISO/IEC 14496-10 version 16, *Advanced video coding for generic au‐ diovisual services*, January 2012. http://www.itu.int/rec/dologin\_pub.asp? lang=e&id=T-REC-H.264-201201-I!!PDF-E&type=items (accessed 9 September 2012)

[2] RFC2326, *Real Time Streaming Protocol (RTSP)*, IETF, 1998. http://

[3] RFC6184, *RTP Payload Format for H.264 Video*, IETF, 2011. http://

[4] Tilera Corporation, *TILE Processor Architecture Overview for the TILEPro Series*, 2009.

[5] Tilera Corporation, *Multicore Development Environment: Programming the TILE Process‐*

[6] Takeuchi, Y., Nakata, Y., Kawaguchi, H. & Yoshimoto, M. *Scalable parallel processing for H.264 encoding application to multi/many-core processor*. International Conference on Intelligent Control and Information Processing (ICICIP), August 13-15, 2010, Dalian,

[7] Gove D. *Multicore Application Programming: for Windows, Linux and Oracle Solaris*, Ad‐

datatracker.ietf.org/doc/rfc2326/ (accessed 9 September 2012)

datatracker.ietf.org/doc/rfc6184/ (accessed 9 September 2012)

China. doi: 10.1109/ICICIP.2010.5565292

**Chapter 6**

**Low Complexity Interpolation Filters for Motion**

Georgios Georgis, George Lentaris and

Additional information is available at the end of the chapter

Dionysios Reisis

**1. Introduction**

the frame.

http://dx.doi.org/10.5772/51703

pending on the design configuration [12].

**Estimation and Application to the H.264 Encoders**

Techniques for image super-resolution play an important role in a plethora of applications, which include video compression and motion estimation. The detection of the fractional dis‐ placements among frames facilitates the removal of temporal redundancy and improves the video quality by 2-4 dB PSNR [12], [2]. However, the increased complexity of the Fractional Motion Estimation (FME) process adds a significant computational load to the encoder and sets constraints to real-time designs. Researchers have performed timing analysis for the motion estimation process and they reported that FME accounts for almost half of the entire motion estimation period, which in turn accounts for 60-90% of the total encoding time de‐

The FME bases on an interpolation procedure to increase the resolution of any frame region by generating sub-pixels between the original pixels. In mathematics, interpolation refers to the construction of an interpolant function whose plot covers (i.e. passes through) all re‐ quired points. Known points of a sample area are referred to as having integer interval or displacement, depending on whether they are time or frequency-domain (TD or FD) sam‐ ples respectively. Similarly, unknown samples which have to be approximated through an interpolant function, are said to have fractional interval or displacement respectively. In im‐ ages, the interpolation takes place in a two-dimensional frequency-domain grid, where the problem of calculating fractional displacements can be facilitated by focusing on an area of four initially known pixels which reside on the corners of a unit square (Fig. 1). Hence, re‐ gardless of the interpolation factor, it is adequate to calculate pixels with arbitrary displace‐ ments in the unit square and extend the calculation for every unit square, which belongs to

> © 2013 Georgis et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2013 Georgis et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

### **Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders**

Georgios Georgis, George Lentaris and Dionysios Reisis

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51703

#### **1. Introduction**

Techniques for image super-resolution play an important role in a plethora of applications, which include video compression and motion estimation. The detection of the fractional dis‐ placements among frames facilitates the removal of temporal redundancy and improves the video quality by 2-4 dB PSNR [12], [2]. However, the increased complexity of the Fractional Motion Estimation (FME) process adds a significant computational load to the encoder and sets constraints to real-time designs. Researchers have performed timing analysis for the motion estimation process and they reported that FME accounts for almost half of the entire motion estimation period, which in turn accounts for 60-90% of the total encoding time de‐ pending on the design configuration [12].

The FME bases on an interpolation procedure to increase the resolution of any frame region by generating sub-pixels between the original pixels. In mathematics, interpolation refers to the construction of an interpolant function whose plot covers (i.e. passes through) all re‐ quired points. Known points of a sample area are referred to as having integer interval or displacement, depending on whether they are time or frequency-domain (TD or FD) sam‐ ples respectively. Similarly, unknown samples which have to be approximated through an interpolant function, are said to have fractional interval or displacement respectively. In im‐ ages, the interpolation takes place in a two-dimensional frequency-domain grid, where the problem of calculating fractional displacements can be facilitated by focusing on an area of four initially known pixels which reside on the corners of a unit square (Fig. 1). Hence, re‐ gardless of the interpolation factor, it is adequate to calculate pixels with arbitrary displace‐ ments in the unit square and extend the calculation for every unit square, which belongs to the frame.

© 2013 Georgis et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Georgis et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Most of the non-adaptive techniques presented in the bibliography, base on solving piece‐ wise polynomial functions of varying degrees in order to calculate the interpolated signal. The resulting polynomial solution leads to sets of coefficients to be applied on consecutive sample points in the grid, which most often extend beyond the unit square. Examples of the above approach are first, the Bilinear interpolation [8] with first order polynomials and us‐ ing two pixels in each dimension and second, the Bicubic interpolation [9] which is derived from third order polynomials and uses four pixels in each dimension. On the other hand, Lanczos interpolation coefficients [10] stem from windowing a *sinc* function. Therefore, the number of pixels required by the Lanczos approach depends on the choice of the order of the interpolation function. More complex techniques applied to video encoding, employ edge-detection, error function minimization, or super-resolution (SR) procedures originating from theoretical signal processing methods. Among these techniques, the most commonly used is the edge-detection, which characterizes pixels or areas in an image belonging to an edge (luminance inconsistency). Edge-detection is also utilized for preventing aliasing fre‐ quency components to be encoded and transmitted.

olds to bypass the interpolation process based on the computed SAD value. Recent developments towards replacing the H.264 / AVC (High Efficiency Video Coding or H.265 or MPEG-H part 2) [16], combine Rate-Distortion minimization and adjustments to local im‐ age characteristics [15], [17], [18], [19]. Effectively, these techniques switch between standard and directionally adaptive interpolation kernels and they take this decision by examining

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

http://dx.doi.org/10.5772/51703

139

Conventional Super-resolution (SR) techniques are generally considered to be prohibitively expensive when encoding video sequences. However, in many cases the learning-based su‐ per-resolution techniques are considered to be valid [20]. Consisting of a training phase, where low and high-resolution image patches are matched and a synthesis phase, where low resolution patches kept in the dictionary are used to oversample, learning-based SR pro‐ vides increased PSNR whilst expanding storage and memory access requirements. Re‐ searchers and engineers have also focused on methodologies for designing the H.264 6-tap filter, which are able to efficiently support its increased memory requirements [2] [6] [7]. The H.264 filter needs quite a number of data to be stored for its operation because its specifica‐ tions include a kernel with coefficients 1, - 5, 20, 20, - 5, 1 , which are multiplied with six consecutive pixels of the frame either in column or row format. The resulting six prod‐ ucts are accumulated and normalized for the generation of a single half-pixel, which is pro‐ duced between the 3*rd* and the 4*th* tap. The operation described above must be repeated for producing each "horizontal" and "vertical" half-pixel by sliding the kernel on the frame, both in row and column order. Moreover, there exist as many "diagonal" half-pixels to be generated by applying the kernel on previously computed horizontal or vertical half-pixels. That is to say, depending on its position, we must process 6 or 36 frame pixels to compute a single half-pixel. To avoid the cost of implementing the H.264 filter in the Estimation mod‐ ule, the current chapter studies a set of interpolation techniques and compares their per‐ formance. The techniques presented here are similar to the standard filter but they use less than 6 taps [8] [9] [10]. Moreover, a subset of these techniques features the exploitation of

The chapter is organized as follows: Section 2 shows three commonly used interpolation techniques, proposes three novel techniques and describes the differences among those com‐ monly used and the proposed. Section III reports the performance results achieved by the interpolation techniques and by comparing these shows the gains of using the proposed. Fi‐

The current section presents six interpolation techniques. The first three (3) are known in the literature and are commonly used techniques. The other three (3) have been recently intro‐

duced [13] and their design targets the improvement of the interpolation process.

each frame either on a pixel or macroblock basis.

gradients in the image [11].

nally, Section IV concludes the chapter.

**2. Interpolation techniques**

Modern compression standards specify the exact filter to use in the Motion Compensation module, a fact allowing the encoder and the decoder to create and use identical reference frames. In particular, H.264/AVC specifies a 6-tap filter for generating sub-pixels between the pixels of the original image, which are called half-pixels with accuracy <sup>1</sup> <sup>2</sup> [3]. Also, it de‐ fines a low cost 2-tap interpolation filter for generating sub-pixels between half- and original pixels, which are defined as quarter-pixels with accuracy <sup>1</sup> <sup>4</sup> . Even though it is a common practice among the encoder designers to integrate the standard 6-tap filter also in the Esti‐ mation module (before Compensation), the fact is that the interpolation technique used for detecting the displacements (not computing their residual) is an open choice following cer‐ tain performance trade-offs.

Aiming at speeding up the Estimation, a process of considerably higher computational de‐ mand than the Compensation, this chapter builds on the potential to implement a lower complexity interpolation technique instead of using the costly H.264 6-tap filter. For this purpose, we show the results of integrating in the Estimation module several distinct interpo‐ lation techniques not included in the H.264 standard. We keep the standard H.264/AVC Compensation and we measure the impact of the above techniques first on the time re‐ quired to process the up-sampling and second on the video quality achieved by the predic‐ tion engine.

Related results in the bibliography include techniques, which lead to avoid or replace the standard computations [4] [5] [13], or minimize the search area [14]. Researchers in [4] calcu‐ late the number of operations required for each pixel in cases where 8-to-2-tap filters and the Sum of Absolute Differences (SAD) metric is utilized. Then, they perform statistical analysis in CIF sequences encoded using bitrates from 0.5 to 1Mbps, to determine the recurrence of a motion vector when the aforementioned filter lengths are applied. The authors of [5] and [13] initially focus on reducing the number of taps and the multiplication operations, by pro‐ posing a filter which requires only shifts and additions. Then they propose adaptive thresh‐ olds to bypass the interpolation process based on the computed SAD value. Recent developments towards replacing the H.264 / AVC (High Efficiency Video Coding or H.265 or MPEG-H part 2) [16], combine Rate-Distortion minimization and adjustments to local im‐ age characteristics [15], [17], [18], [19]. Effectively, these techniques switch between standard and directionally adaptive interpolation kernels and they take this decision by examining each frame either on a pixel or macroblock basis.

Conventional Super-resolution (SR) techniques are generally considered to be prohibitively expensive when encoding video sequences. However, in many cases the learning-based su‐ per-resolution techniques are considered to be valid [20]. Consisting of a training phase, where low and high-resolution image patches are matched and a synthesis phase, where low resolution patches kept in the dictionary are used to oversample, learning-based SR pro‐ vides increased PSNR whilst expanding storage and memory access requirements. Re‐ searchers and engineers have also focused on methodologies for designing the H.264 6-tap filter, which are able to efficiently support its increased memory requirements [2] [6] [7]. The H.264 filter needs quite a number of data to be stored for its operation because its specifica‐ tions include a kernel with coefficients 1, - 5, 20, 20, - 5, 1 , which are multiplied with six consecutive pixels of the frame either in column or row format. The resulting six prod‐ ucts are accumulated and normalized for the generation of a single half-pixel, which is pro‐ duced between the 3*rd* and the 4*th* tap. The operation described above must be repeated for producing each "horizontal" and "vertical" half-pixel by sliding the kernel on the frame, both in row and column order. Moreover, there exist as many "diagonal" half-pixels to be generated by applying the kernel on previously computed horizontal or vertical half-pixels. That is to say, depending on its position, we must process 6 or 36 frame pixels to compute a single half-pixel. To avoid the cost of implementing the H.264 filter in the Estimation mod‐ ule, the current chapter studies a set of interpolation techniques and compares their per‐ formance. The techniques presented here are similar to the standard filter but they use less than 6 taps [8] [9] [10]. Moreover, a subset of these techniques features the exploitation of gradients in the image [11].

The chapter is organized as follows: Section 2 shows three commonly used interpolation techniques, proposes three novel techniques and describes the differences among those com‐ monly used and the proposed. Section III reports the performance results achieved by the interpolation techniques and by comparing these shows the gains of using the proposed. Fi‐ nally, Section IV concludes the chapter.

#### **2. Interpolation techniques**

Most of the non-adaptive techniques presented in the bibliography, base on solving piece‐ wise polynomial functions of varying degrees in order to calculate the interpolated signal. The resulting polynomial solution leads to sets of coefficients to be applied on consecutive sample points in the grid, which most often extend beyond the unit square. Examples of the above approach are first, the Bilinear interpolation [8] with first order polynomials and us‐ ing two pixels in each dimension and second, the Bicubic interpolation [9] which is derived from third order polynomials and uses four pixels in each dimension. On the other hand, Lanczos interpolation coefficients [10] stem from windowing a *sinc* function. Therefore, the number of pixels required by the Lanczos approach depends on the choice of the order of the interpolation function. More complex techniques applied to video encoding, employ edge-detection, error function minimization, or super-resolution (SR) procedures originating from theoretical signal processing methods. Among these techniques, the most commonly used is the edge-detection, which characterizes pixels or areas in an image belonging to an edge (luminance inconsistency). Edge-detection is also utilized for preventing aliasing fre‐

Modern compression standards specify the exact filter to use in the Motion Compensation module, a fact allowing the encoder and the decoder to create and use identical reference frames. In particular, H.264/AVC specifies a 6-tap filter for generating sub-pixels between

fines a low cost 2-tap interpolation filter for generating sub-pixels between half- and original

practice among the encoder designers to integrate the standard 6-tap filter also in the Esti‐ mation module (before Compensation), the fact is that the interpolation technique used for detecting the displacements (not computing their residual) is an open choice following cer‐

Aiming at speeding up the Estimation, a process of considerably higher computational de‐ mand than the Compensation, this chapter builds on the potential to implement a lower complexity interpolation technique instead of using the costly H.264 6-tap filter. For this purpose, we show the results of integrating in the Estimation module several distinct interpo‐ lation techniques not included in the H.264 standard. We keep the standard H.264/AVC Compensation and we measure the impact of the above techniques first on the time re‐ quired to process the up-sampling and second on the video quality achieved by the predic‐

Related results in the bibliography include techniques, which lead to avoid or replace the standard computations [4] [5] [13], or minimize the search area [14]. Researchers in [4] calcu‐ late the number of operations required for each pixel in cases where 8-to-2-tap filters and the Sum of Absolute Differences (SAD) metric is utilized. Then, they perform statistical analysis in CIF sequences encoded using bitrates from 0.5 to 1Mbps, to determine the recurrence of a motion vector when the aforementioned filter lengths are applied. The authors of [5] and [13] initially focus on reducing the number of taps and the multiplication operations, by pro‐ posing a filter which requires only shifts and additions. Then they propose adaptive thresh‐

<sup>2</sup> [3]. Also, it de‐

<sup>4</sup> . Even though it is a common

the pixels of the original image, which are called half-pixels with accuracy <sup>1</sup>

quency components to be encoded and transmitted.

138 Design and Architectures for Digital Signal Processing

pixels, which are defined as quarter-pixels with accuracy <sup>1</sup>

tain performance trade-offs.

tion engine.

The current section presents six interpolation techniques. The first three (3) are known in the literature and are commonly used techniques. The other three (3) have been recently intro‐ duced [13] and their design targets the improvement of the interpolation process.

**2.2 Bicubic**

plications.

**2.3 Lanczos**

12 <sup>50</sup>*π*<sup>2</sup> , - <sup>12</sup>

<sup>9</sup>*π*<sup>2</sup> , <sup>6</sup> *<sup>π</sup>* <sup>2</sup> , <sup>6</sup>

zontal (HH) pixel *Y <sup>D</sup>*

in Fig. 1.

*<sup>π</sup>* <sup>2</sup> , - <sup>12</sup>

**2.4 Data-Dependent Triangulation**

<sup>9</sup>*π*<sup>2</sup> , <sup>12</sup>

tion compensation with coefficients 1, - 5,20,20, - 5,1 .

*HH* at (*i*, *j*+

*YD*

*YD*

*HH* <sup>=</sup>*Cli pdivD*

*HV* <sup>=</sup>*Cli pdivD*

*<sup>R</sup>* (*w*1*Yg*

*<sup>R</sup>* (*w*1*Yg*

1

The Bicubic technique uses as a base the solution of third order polynomials [9]. In this chapter we examine the parameterized form of the underlying equation using *a* ∈ [−1, 0] to provide sharpness variance in the interpolated image. We focus on the following val‐ ues: *a*= −1, *a*= −0.75, and *a*=−0.5. These values result in three distinct kernels, which are characterized by the convolution coefficients -1,5, 5, - 1 , -3,19,19, - 3 and -1,9, 9, - 1 , respectively. Such a quadruplet is multiplied with four (4) consecutive image pixels to gen‐ erate their intermediate half-pixel. To compute the half-diagonal pixel, the Bicubic techni‐ que requires the calculations of the corresponding four half-horizontal (a total of 16 multiplications) and then apply the coefficients on the resulting pixels to produce the tar‐ get half-diagonal. Hence, overall it uses 16 image pixels with the requirement of 20 multi‐

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

This technique is similar to the H.264/AVC interpolation and with a third order Lanczos equation, it uses a 6-tap FIR filter. Overall, the technique bases on the Sinc function [10]. In this chapter we examine the kernel with coefficients given by

procedure, as in the case of the H.264/AVC filter (a single half-diagonal pixel depends on 36 integer pixels). Note here that, the H.264/AVC standard defines a 6-tap filter for use in mo‐

The first of the recently introduced techniques in [13] is actually a modification of the ap‐ proach, which was presented in [11]. The authors in [11] use an edge-detection technique for determining the exact set of integer pixels, which will be given as input to the interpolation function. We study here a special case of Data-Dependent Triangulation (DDT), which ex‐ amines only 4 pixels. To describe the technique, we consider the generation of the half-hori‐

We examine the luma differences of pixels {*g*, *h*, *q*, *r*} to determine whether an edge crosses their enclosed region: if it holds that |*Yg* - *Yr*|> |*Yh* - *Yq*| , then we will detect an edge at *hq*, else we will detect an edge at *rg*. In the first case, that is there is an edge at *hq* which is denoted as *<sup>h</sup> ED <sup>q</sup>* , we assume that pixels {*g*, *h*, *q*} form a homogeneous triangular and we compute:

<sup>2</sup> ) and the half-vertical (HV) pixel *Y <sup>D</sup>*

+ *w*1*Yh* + *w*2*Yq*)

+ *w*1*Yq*

<sup>50</sup>*π*<sup>2</sup> . Lanczos half-pixels are generated by a trivial convolution

*HV* at (*i*+

<sup>+</sup> *<sup>w</sup>*2*Yh* ) (1)

1

http://dx.doi.org/10.5772/51703

141

<sup>2</sup> , *j*) as shown

**Figure 1.** Pixels on the image grid and magnification of a 1×1 area showing sub-pixel positions (right). The symbols facilitate the description of filters.

Each video frame consists of pixels and we consider each pixel of the original image located at a distinct position (*i*, *j*) of a two dimensional (2D) grid with *i*, *j ∊*N denoting the vertical and horizontal coordinates of the pixel, respectively. The sub-pixels can be generated next to any pixel (*i*, *j*) at the positions (*i*+*k*, *j*+*l*) with *k*,*<sup>l</sup>* <sup>∊</sup> {0, <sup>1</sup> <sup>4</sup> , <sup>1</sup> <sup>2</sup> , <sup>3</sup> <sup>4</sup> } *.*

We distinguish between quarter-pixels and half-pixels, for which *k*,*<sup>l</sup>* <sup>∉</sup> { <sup>1</sup> <sup>4</sup> , <sup>3</sup> <sup>4</sup> } . The half-pix‐ els are further categorized as half-horizontal, half-vertical, or half-diagonal (those located at the positions given by (*i* + 1 <sup>2</sup> , *j* + 1 <sup>2</sup> ) ). Fig. 1 depicts part of the original image grid and mag‐ nifies a small area while the right-hand side magnifies an interior square region to show all sub-pixel positions (according to H.264/AVC). Moreover, Fig. 1 marks pixels and regions on the grid to be used as references with designated letters as a notation to be followed for the remaining of the paper.

A half-pixel is generated by an interpolation procedure operating on a set of neighboring, integer-position pixels located around the position of interest. We study the following inter‐ polations:

#### **2.1 Bilinear**

This technique is actually the simplest of all the techniques presented in this chapter. In practice, this technique consists of a simple averaging of the two original pixels, which are adjacent to the half-horizontal or the half-vertical pixel to be generated (i.e., 2-tap FIR filter) [8]. For the half-diagonal (HD), the technique computes the average of the four (4) pixels {*g*, *h*, *q*, *r*} surrounding the half-diagonal position as shown in (Fig. 1).

#### **2.2 Bicubic**

The Bicubic technique uses as a base the solution of third order polynomials [9]. In this chapter we examine the parameterized form of the underlying equation using *a* ∈ [−1, 0] to provide sharpness variance in the interpolated image. We focus on the following val‐ ues: *a*= −1, *a*= −0.75, and *a*=−0.5. These values result in three distinct kernels, which are characterized by the convolution coefficients -1,5, 5, - 1 , -3,19,19, - 3 and -1,9, 9, - 1 , respectively. Such a quadruplet is multiplied with four (4) consecutive image pixels to gen‐ erate their intermediate half-pixel. To compute the half-diagonal pixel, the Bicubic techni‐ que requires the calculations of the corresponding four half-horizontal (a total of 16 multiplications) and then apply the coefficients on the resulting pixels to produce the tar‐ get half-diagonal. Hence, overall it uses 16 image pixels with the requirement of 20 multi‐ plications.

#### **2.3 Lanczos**

**Figure 1.** Pixels on the image grid and magnification of a 1×1 area showing sub-pixel positions (right). The symbols

Each video frame consists of pixels and we consider each pixel of the original image located at a distinct position (*i*, *j*) of a two dimensional (2D) grid with *i*, *j ∊*N denoting the vertical and horizontal coordinates of the pixel, respectively. The sub-pixels can be generated next to

els are further categorized as half-horizontal, half-vertical, or half-diagonal (those located at

nifies a small area while the right-hand side magnifies an interior square region to show all sub-pixel positions (according to H.264/AVC). Moreover, Fig. 1 marks pixels and regions on the grid to be used as references with designated letters as a notation to be followed for the

A half-pixel is generated by an interpolation procedure operating on a set of neighboring, integer-position pixels located around the position of interest. We study the following inter‐

This technique is actually the simplest of all the techniques presented in this chapter. In practice, this technique consists of a simple averaging of the two original pixels, which are adjacent to the half-horizontal or the half-vertical pixel to be generated (i.e., 2-tap FIR filter) [8]. For the half-diagonal (HD), the technique computes the average of the four (4) pixels {*g*,

<sup>4</sup> , <sup>1</sup> <sup>2</sup> , <sup>3</sup> <sup>4</sup> } *.*

<sup>2</sup> ) ). Fig. 1 depicts part of the original image grid and mag‐

<sup>4</sup> , <sup>3</sup>

<sup>4</sup> } . The half-pix‐

facilitate the description of filters.

140 Design and Architectures for Digital Signal Processing

the positions given by (*i* +

remaining of the paper.

polations:

**2.1 Bilinear**

any pixel (*i*, *j*) at the positions (*i*+*k*, *j*+*l*) with *k*,*<sup>l</sup>* <sup>∊</sup> {0, <sup>1</sup>

1 <sup>2</sup> , *j* + 1

We distinguish between quarter-pixels and half-pixels, for which *k*,*<sup>l</sup>* <sup>∉</sup> { <sup>1</sup>

*h*, *q*, *r*} surrounding the half-diagonal position as shown in (Fig. 1).

This technique is similar to the H.264/AVC interpolation and with a third order Lanczos equation, it uses a 6-tap FIR filter. Overall, the technique bases on the Sinc function [10]. In this chapter we examine the kernel with coefficients given by 12 <sup>50</sup>*π*<sup>2</sup> , - <sup>12</sup> <sup>9</sup>*π*<sup>2</sup> , <sup>6</sup> *<sup>π</sup>* <sup>2</sup> , <sup>6</sup> *<sup>π</sup>* <sup>2</sup> , - <sup>12</sup> <sup>9</sup>*π*<sup>2</sup> , <sup>12</sup> <sup>50</sup>*π*<sup>2</sup> . Lanczos half-pixels are generated by a trivial convolution procedure, as in the case of the H.264/AVC filter (a single half-diagonal pixel depends on 36 integer pixels). Note here that, the H.264/AVC standard defines a 6-tap filter for use in mo‐ tion compensation with coefficients 1, - 5,20,20, - 5,1 .

#### **2.4 Data-Dependent Triangulation**

The first of the recently introduced techniques in [13] is actually a modification of the ap‐ proach, which was presented in [11]. The authors in [11] use an edge-detection technique for determining the exact set of integer pixels, which will be given as input to the interpolation function. We study here a special case of Data-Dependent Triangulation (DDT), which ex‐ amines only 4 pixels. To describe the technique, we consider the generation of the half-hori‐ zontal (HH) pixel *Y <sup>D</sup> HH* at (*i*, *j*+ 1 <sup>2</sup> ) and the half-vertical (HV) pixel *Y <sup>D</sup> HV* at (*i*+ 1 <sup>2</sup> , *j*) as shown in Fig. 1.

We examine the luma differences of pixels {*g*, *h*, *q*, *r*} to determine whether an edge crosses their enclosed region: if it holds that |*Yg* - *Yr*|> |*Yh* - *Yq*| , then we will detect an edge at *hq*, else we will detect an edge at *rg*. In the first case, that is there is an edge at *hq* which is denoted as *<sup>h</sup> ED <sup>q</sup>* , we assume that pixels {*g*, *h*, *q*} form a homogeneous triangular and we compute:

$$\begin{aligned} \mathcal{Y}\_D^{\text{HH}} &= \text{Cli} \, p\_{\text{div}\_D}^R \{ w\_1 \mathcal{Y}\_g + w\_1 \mathcal{Y}\_h + w\_2 \mathcal{Y}\_q \} \\ \mathcal{Y}\_D^{\text{HV}} &= \text{Cli} \, p\_{\text{div}\_D}^R \{ w\_1 \mathcal{Y}\_g + w\_1 \mathcal{Y}\_q + w\_2 \mathcal{Y}\_h \} \end{aligned} \tag{1}$$

Where *Clipdiv <sup>D</sup> <sup>R</sup>* is a normalization function (divides by *divD* =2*w*<sup>1</sup> <sup>+</sup> *<sup>w</sup>*2 , clips value in [0, 255]). Factors *w*1 > *w*2 are used to increase the luma weights of the neighbors residing next to the generated sub-pixel. The examination of a large number of factors has resulted in highest PSNR for *w*1 = 7 and *w*2 = 2 (given that *divD* =2<sup>4</sup> ). The second case refers to the detection of an edge at *rg* (when there is the edge *<sup>r</sup> ED <sup>g</sup>* ). In this case, we use the same idea as above (orientation and weights) but we modify accordingly the luma inputs of (1). In the case of a homogene‐ ous square *ghqr* the technique degenerates to a simple bilinear filter (i.e. *w*1 =1, *w*2 =0).

**2.5 CrossHD**

clude if *<sup>h</sup> E<sup>χ</sup>*

or *Y<sup>χ</sup>*

**2.6 CxScale**

*<sup>q</sup>* or *<sup>r</sup> <sup>E</sup><sup>χ</sup>*

*HD* <sup>=</sup> *Clip*<sup>2</sup>

a horizontal edge, i.e. *Ec*

*HH if <sup>q</sup>*

*HH if <sup>q</sup>*

tion by checking:

assume *Ec*

assume *Ec*

in Fig. 1, we get that: *Y <sup>A</sup>*<sup>1</sup> =*Yc* + *Y <sup>D</sup>* + *Yg*

*<sup>R</sup>*(*Yr* <sup>+</sup> *Yg*) *if <sup>r</sup> <sup>E</sup><sup>χ</sup>*

geneous square *A*2 ), it will average the pixels {*g*, *h*, *q*, *r*}.

bicubic interpolators. The technique includes three steps:

**2.** The possible refinement of its direction to an assumed diagonal.

ning with the HH pixel, we examine |(*Y <sup>f</sup>* + *Yg*) - (*Yh* + *Yo*)|<|(*Yc* + *Yd* ) - (*Yq*

**3.** The selection of inputs to a bicubic or a bilinear function.

*<sup>d</sup>* |*Yc* - *Yr* | > |*Yd* - *Yq* | (from *q* to *d*)

*<sup>d</sup>* |*Yc* - *Yr* | < |*Yd* - *Yq* | (from *r* to *c*)

**1.** The detection of a horizontal or vertical edge.

*HH g*

The second approach is called CrossHD [13] and bases on an edge-oriented technique. The advantages of CrossHD compared to the DDT mentioned above, is that it improves on the locality of the aforementioned DDT detections by comparing the luminance difference of areas –instead of single pixels. This technique computes the luma of a small square area by adding the pixels, which are located at its four corners. For instance, for the example given


edge, the technique will examine the areas *A*4 , *A*<sup>2</sup> , and *A*<sup>5</sup> . Finally, the HD pixel is generat‐

The third approach extends the aforementioned ideas to develop a technique called CxScale [13], which improves both the edge detection and the subsequent kernel selection. Here, the edge detection mechanism examines the luma gradients over an area of 8 neighboring inte‐ ger pixels and the half-pixels are generated afterwards via a conditional use of bilinear and

The specifics of these steps depend on the position of the half-pixel to be generated. Begin‐

*<sup>h</sup>* . When we detect a vertical edge (when ">"), we refine its direc‐

ed by averaging the pixels, which reside on the detected edge: *Y <sup>χ</sup>*

*<sup>g</sup>* . Note that, in the case of examining whether there exists a horizontal

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

*<sup>g</sup>* . If the technique does not detect any edge (i.e., in the homo‐

+ *Yh* . The technique examines the outcome of the

*HD* <sup>=</sup> *Clip*<sup>2</sup>

*<sup>R</sup>*(*Yh* <sup>+</sup> *Yq*) *if <sup>h</sup> <sup>E</sup><sup>χ</sup>*

http://dx.doi.org/10.5772/51703

143

+ *Yr*)| to detect

*<sup>q</sup>* ,

The technique generates the half-diagonal pixel by including a second gradient check, which follows the detection of the edge *<sup>h</sup> ED <sup>q</sup>* , or the edge *<sup>r</sup> ED <sup>g</sup>* . The idea is to identify the most homogeneous triangle in the enclosed area *A*2 shown in the Fig. 1. Thereby, in the case of *<sup>h</sup> ED <sup>q</sup>* , we check |*Yg* - *Yq*<sup>|</sup> <sup>+</sup> <sup>|</sup>*Yg* - *Yh* |<|*Yr* - *Yq*<sup>|</sup> <sup>+</sup> <sup>|</sup>*Yr* - *Yh* <sup>|</sup> , otherwise we check |*Yh* - *Yg*| + |*Yh* - *Yr*|<|*Yq* - *Yg*| + |*Yg* - *Yr*| to decide if the HD pixel resides *above* (<) or *be‐ low* (>) the edge. Extending our notation with *abv* and *blw* superscripts, we describe the modified DDT (mDDT) computation as:

Where the values of the *w*1 , *w*2 and *Clipdiv <sup>D</sup> <sup>R</sup>* are as described in (1).

An alternative approach uses the equation 1 to develop a simpler HD generation technique, we call this technique *mDDT'* , which relies directly on the first DDT check and performs a bilinear operation on the two pixels of the detected edge, i.e., *Y D' HD* <sup>=</sup> *Clip*<sup>2</sup> *<sup>R</sup>*(*Yh* <sup>+</sup> *Yq*) *if <sup>h</sup> ED <sup>q</sup>* .

We further improve the *mDDT'* and produce the (*mDDT'*) technique by modifying the final operation to subtract the remaining two off-diagonal pixels (as a high-pass FIR), i.e., *Y D'' HD* <sup>=</sup>*ClipD'' <sup>R</sup>* (*w*1*Yh* <sup>+</sup> *<sup>w</sup>*1*Yq* - *<sup>w</sup>*2*Yg* - *<sup>w</sup>*2*Yr*) *if <sup>h</sup> ED <sup>q</sup>* . The latter operation although it increases the amount of calculations, it results in better PSNR compared to the *mDDT'* .

#### **2.5 CrossHD**

Where *Clipdiv <sup>D</sup>*

for *w*1 = 7 and *w*2 = 2 (given that *divD* =2<sup>4</sup>

142 Design and Architectures for Digital Signal Processing

modified DDT (mDDT) computation as:

Where the values of the *w*1 , *w*2 and *Clipdiv <sup>D</sup>*

*Y D''*

bilinear operation on the two pixels of the detected edge, i.e., *Y D'*

*<sup>R</sup>* is a normalization function (divides by *divD* =2*w*<sup>1</sup> <sup>+</sup> *<sup>w</sup>*2 , clips value in [0, 255]).

). The second case refers to the detection of an edge

Factors *w*1 > *w*2 are used to increase the luma weights of the neighbors residing next to the generated sub-pixel. The examination of a large number of factors has resulted in highest PSNR

at *rg* (when there is the edge *<sup>r</sup> ED <sup>g</sup>* ). In this case, we use the same idea as above (orientation and weights) but we modify accordingly the luma inputs of (1). In the case of a homogene‐

The technique generates the half-diagonal pixel by including a second gradient check, which follows the detection of the edge *<sup>h</sup> ED <sup>q</sup>* , or the edge *<sup>r</sup> ED <sup>g</sup>* . The idea is to identify the most homogeneous triangle in the enclosed area *A*2 shown in the Fig. 1. Thereby, in the case of *<sup>h</sup> ED <sup>q</sup>* , we check |*Yg* - *Yq*<sup>|</sup> <sup>+</sup> <sup>|</sup>*Yg* - *Yh* |<|*Yr* - *Yq*<sup>|</sup> <sup>+</sup> <sup>|</sup>*Yr* - *Yh* <sup>|</sup> , otherwise we check |*Yh* - *Yg*| + |*Yh* - *Yr*|<|*Yq* - *Yg*| + |*Yg* - *Yr*| to decide if the HD pixel resides *above* (<) or *be‐ low* (>) the edge. Extending our notation with *abv* and *blw* superscripts, we describe the

*<sup>R</sup>* are as described in (1).

*HD* <sup>=</sup> *Clip*<sup>2</sup>

*<sup>R</sup>*(*Yh* <sup>+</sup> *Yq*) *if <sup>h</sup> ED <sup>q</sup>* .

An alternative approach uses the equation 1 to develop a simpler HD generation technique, we call this technique *mDDT'* , which relies directly on the first DDT check and performs a

We further improve the *mDDT'* and produce the (*mDDT'*) technique by modifying the final operation to subtract the remaining two off-diagonal pixels (as a high-pass FIR), i.e.,

*HD* <sup>=</sup>*ClipD'' <sup>R</sup>* (*w*1*Yh* <sup>+</sup> *<sup>w</sup>*1*Yq* - *<sup>w</sup>*2*Yg* - *<sup>w</sup>*2*Yr*) *if <sup>h</sup> ED <sup>q</sup>* . The latter operation although it increases

the amount of calculations, it results in better PSNR compared to the *mDDT'* .

ous square *ghqr* the technique degenerates to a simple bilinear filter (i.e. *w*1 =1, *w*2 =0).

The second approach is called CrossHD [13] and bases on an edge-oriented technique. The advantages of CrossHD compared to the DDT mentioned above, is that it improves on the locality of the aforementioned DDT detections by comparing the luminance difference of areas –instead of single pixels. This technique computes the luma of a small square area by adding the pixels, which are located at its four corners. For instance, for the example given in Fig. 1, we get that: *Y <sup>A</sup>*<sup>1</sup> =*Yc* + *Y <sup>D</sup>* + *Yg* + *Yh* . The technique examines the outcome of the |*Y <sup>A</sup>*<sup>4</sup> - *Y <sup>A</sup>*5|>|*Y <sup>A</sup>*<sup>1</sup> - *Y <sup>A</sup>*3| operation to decide if there exists a vertical (>) or horizontal (<) edge crossing the area *A*2 . In the case of a vertical edge crossing the area *A*2 , we examine independently the areas *A*1 , *A*2 , and *A*<sup>3</sup> by using the simple DDT check to identify the direc‐ tions of the edges crossing each of these three (3) areas. The majority of the edge directions found within *A*1 , *A*2 , and *A*3 refines the assumed edge direction within *A*<sup>2</sup> , i.e., we con‐ clude if *<sup>h</sup> E<sup>χ</sup> <sup>q</sup>* or *<sup>r</sup> <sup>E</sup><sup>χ</sup> <sup>g</sup>* . Note that, in the case of examining whether there exists a horizontal edge, the technique will examine the areas *A*4 , *A*<sup>2</sup> , and *A*<sup>5</sup> . Finally, the HD pixel is generat‐ ed by averaging the pixels, which reside on the detected edge: *Y <sup>χ</sup> HD* <sup>=</sup> *Clip*<sup>2</sup> *<sup>R</sup>*(*Yh* <sup>+</sup> *Yq*) *if <sup>h</sup> <sup>E</sup><sup>χ</sup> <sup>q</sup>* , or *Y<sup>χ</sup> HD* <sup>=</sup> *Clip*<sup>2</sup> *<sup>R</sup>*(*Yr* <sup>+</sup> *Yg*) *if <sup>r</sup> <sup>E</sup><sup>χ</sup> <sup>g</sup>* . If the technique does not detect any edge (i.e., in the homo‐ geneous square *A*2 ), it will average the pixels {*g*, *h*, *q*, *r*}.

#### **2.6 CxScale**

The third approach extends the aforementioned ideas to develop a technique called CxScale [13], which improves both the edge detection and the subsequent kernel selection. Here, the edge detection mechanism examines the luma gradients over an area of 8 neighboring inte‐ ger pixels and the half-pixels are generated afterwards via a conditional use of bilinear and bicubic interpolators. The technique includes three steps:


The specifics of these steps depend on the position of the half-pixel to be generated. Begin‐ ning with the HH pixel, we examine |(*Y <sup>f</sup>* + *Yg*) - (*Yh* + *Yo*)|<|(*Yc* + *Yd* ) - (*Yq* + *Yr*)| to detect a horizontal edge, i.e. *Ec HH g <sup>h</sup>* . When we detect a vertical edge (when ">"), we refine its direc‐ tion by checking:

assume *Ec HH if <sup>q</sup> <sup>d</sup>* |*Yc* - *Yr* | > |*Yd* - *Yq* | (from *q* to *d*) assume *Ec HH if <sup>q</sup> <sup>d</sup>* |*Yc* - *Yr* | < |*Yd* - *Yq* | (from *r* to *c*)

assume *Ec HH if <sup>A</sup>*<sup>1</sup> *<sup>A</sup>*<sup>3</sup> |*Yc* - *Yr* | =|*Yd* - *Yq*| (strictly vertical)

Else, we assume a homogeneous area. Finally, we compute

$$\begin{aligned} \boldsymbol{Y}\_{\mathcal{C}}^{H} = \begin{cases} \mathrm{clip}\_{32}^{R} \{ \cdot 3 \boldsymbol{Y}\_{f} + 19 \boldsymbol{Y}\_{g} + 19 \boldsymbol{Y}\_{h} \cdot \text{"} \mathbf{3} \boldsymbol{Y}\_{o} \} \circ \boldsymbol{f}\_{g}^{h} \boldsymbol{E}\_{c}^{HH} \\ \mathrm{clip}\_{32}^{R} \{ \cdot 3 \boldsymbol{Y}\_{c} + 19 \boldsymbol{Y}\_{d} + 19 \boldsymbol{Y}\_{q} \cdot \text{"} \mathbf{3} \boldsymbol{Y}\_{r} \} \circ \boldsymbol{f}\_{g}^{d} \boldsymbol{E}\_{c}^{HH} \\ \mathrm{clip}\_{32}^{R} \{ \cdot 3 \boldsymbol{Y}\_{d} + 19 \boldsymbol{Y}\_{c} + 19 \boldsymbol{Y}\_{r} \cdot \text{"} \mathbf{3} \boldsymbol{Y}\_{q} \} \circ \boldsymbol{f}\_{r}^{c} \boldsymbol{E}\_{c}^{HH} \\ \mathrm{clip}\_{2}^{R} \{ \boldsymbol{Y}\_{g} + \boldsymbol{Y}\_{h} \} \end{cases} \tag{3}$$

we let the estimation procedure to employ one of the six interpolation techniques described in the previous Section, which will detect the fractional motion. The compensation proce‐ dure bases solely on the resulting motion vectors for constructing the frame-predictors ac‐ cording to the standard 6-tap filter. Hence, we use a setup, which ensures that the encoder and the decoder will still be able to use identical reference frames for their predictions, i.e., we avoid the accumulation of errors introduced to the coding process due to the encoder and the decoder. More specifically, the estimation algorithm computes the Sum of Absolute

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

http://dx.doi.org/10.5772/51703

145

Differences (SAD) for comparing 4×4 pixel candidates and it operates in two phases:

**1.** A "Diamond Search" matches the block to the best integer position candidate,

examining 8 candidate blocks located at distance ±

**Neighbor**

**Resolution a=-1 a=-0.5 a=-0.75**

ular, the DPSNR for each interpolation technique.

**Filter: H.264 Nearest**

264 compensation).

**2.** An exhaustive search in the vicinity of the integer match detects fractional motion by

Overall, the only parameter varying in this scheme is the interpolation technique used in the second phase of the algorithm, and thus, the quality variations among the output sequences (predictor frames) depend only on the efficiency of the interpolation. The results are shown in the following test reports, which display the PSNR of the output sequences and in partic‐

We have performed the simulations to measure the quality and the processing time by test‐ ing a variety of well-known videos and up to five (5) frame resolutions for each. The simula‐ tions setup with *videos*, number of frames and *resolution* has been: The *car-phone* with 90 frames, the *foreman* with 400 frames, the *container* with 300 frames in *QCIF*, the *coastguard, foreman, news* with 300 frames each in *CIF* and finally, the *blue sky, pedestrian, riverbed, rush-hour* with 100 each in *SD1*, *720p* and *1080p*. Our prediction engine is written in C, it uses 1 reference

QCIF 35.0379 -2.2069 -0.0142 -0.0214 -0.0105 0.0009 -0.3359 -0.2265 CIF 34.2930 -1.4229 -0.0150 -0.0340 -0.0166 -0.0042 -0.3697 -0.1994 SD1 33.1775 -0.5483 -0.0170 -0.0192 -0.0118 -0.0030 -0.2071 -0.1249 720p 32.3743 -0.3316 -0.0130 -0.0151 -0.0096 -0.0029 -0.1021 -0.0866 1080p 33.0837 -0.2084 -0.0122 -0.0172 -0.0123 -0.0042 -0.0810 -0.0697 **total 33.4971 -0.8843 -0.0144 -0.0209 -0.0120 -0.0027 -0.2116 -0.1372**

**Table 1.** PSNR of the H.264/AVC filter and DPSNR of other techniques when estimating in HH+HV positions (with H.

frame and it is designed to efficiently substitute any filter. We begin by distinguishing be‐ tween horizontal/vertical and diagonal interpolation. Table 1 reports the PSNR results of the algorithm examining fractional displacements only at the horizontal and vertical directions (4 candidates). The table shows the results of two 6-tap filters (H.264/AVC, Lanczos), three

1 <sup>2</sup> pixels.

**Bicubic Lanczos CxScale DDT**

Similarly, the generation of the HV pixel begins by examining |(*Yc* + *Yg*) - (*Yq* + *Yu*)| < |(*Y <sup>f</sup>* + *Y <sup>p</sup>*) - (*Yh* + *Yr*)| to detect a vertical edge *Ec HV q <sup>g</sup>* . If we de‐ tect a horizontal edge (>), we refine its direction and we compute the pixel *Yc HV* as follows:

$$\begin{aligned} \text{assume } & \frac{h}{r} E\_c^{HH} \quad \text{if} \quad |\, \boldsymbol{Y}\_f \cdot \boldsymbol{Y}\_r| > |\, \boldsymbol{Y}\_g \cdot \boldsymbol{Y}\_h| \quad \text{(from } p \text{ to } h\text{)}\\ \text{assume } & \, ^r\_f E\_c^{HH} \text{ if } \mid \boldsymbol{Y}\_f \cdot \boldsymbol{Y}\_r \vert < \boldsymbol{Y}\_g \cdot \boldsymbol{Y}\_h \mid \text{(from } f \text{ to } r) \end{aligned}$$

$$\text{assume } \, ^{A5}\_{A4} E\_c^{HH} \text{ if } \mid \boldsymbol{Y}\_f \cdot \boldsymbol{Y}\_r \vert = \mid \boldsymbol{Y}\_g \cdot \boldsymbol{Y}\_h \mid \text{(strictly horizontal)}$$

$$\begin{aligned} \text{Clip}\_{32}^R \text{(-3Y}\_c \text{+} \, \text{19Y}\_g \, + \, \text{19Y}\_g \cdot \text{3Y}\_u \text{)} \text{ if } \, ^q\_g E\_c^{HV} \\ \text{Clip}\_{32}^R \text{(-3Y}\_f \, + \, \text{19Y}\_h \, + \, \text{19Y}\_f \cdot \text{3Y}\_l \, \text{ } \, ^q\_f \text{H}^{HV} \end{aligned} \tag{12.1}$$

$$Y\_{\mathbb{C}}^{HV} = \begin{cases} \text{Clip}\_{32}^{R} \text{Clip}\_{32}^{T} & \text{if} & \text{if} & \text{p} & \text{m} \times \text{p} \times \text{p} \\ \text{Clip}\_{32}^{R} \text{\{-3Y}\_{h} + 19Y\_{f} + 19Y\_{r} - 3Y\_{p}\} & \text{if} \, ^{r}\_{f}E\_{c}^{HV} \\ \text{Clip}\_{2}^{R} \text{\{Y}\_{g} + Y\_{q}\} & \text{otherwise} \end{cases} \tag{4}$$

To conclude the CxScale description, we refer to the HD pixel generation, which begins by examining |(*Yb* + *Yg*) - (*Yr* + *Yw*)| > |(*Ye* + *Yh* ) - (*Yq* + *Yt*)| to detect an edge at *Ec HD q <sup>h</sup>* . Oth‐ erwise, we assume *Ec HD r <sup>g</sup>* . Then

$$\begin{array}{c} \mathrm{i}Y\_{\mathbb{C}}^{HD} = \begin{cases} \mathrm{Clip}\_{32}^{R} \text{(-} 3Y\_{\varepsilon} + 19Y\_{h} + 19Y\_{q} - 3Y\_{t} \text{)} & \text{if} \, \prescript{h}{}{}\_{q} \mathrm{E}\_{\varepsilon}^{HD} \\\\ \mathrm{Clip}\_{32}^{R} \text{(-} 3Y\_{b} + 19Y\_{q} + 19Y\_{r} - 3Y\_{w} \text{)} & \text{if} \, \prescript{g}{}{}\_{r} \mathrm{E}\_{\varepsilon}^{HD} \end{array} \tag{5}$$

#### **3. Performance Evaluation**

To evaluate the performance of the interpolation techniques in the considered application, we execute multiple motion estimation procedures and the entire application is completed by including the standard H.264/AVC motion compensation. For the realization of each test, we let the estimation procedure to employ one of the six interpolation techniques described in the previous Section, which will detect the fractional motion. The compensation proce‐ dure bases solely on the resulting motion vectors for constructing the frame-predictors ac‐ cording to the standard 6-tap filter. Hence, we use a setup, which ensures that the encoder and the decoder will still be able to use identical reference frames for their predictions, i.e., we avoid the accumulation of errors introduced to the coding process due to the encoder and the decoder. More specifically, the estimation algorithm computes the Sum of Absolute Differences (SAD) for comparing 4×4 pixel candidates and it operates in two phases:

**1.** A "Diamond Search" matches the block to the best integer position candidate,

assume *Ec*

*HH if <sup>A</sup>*<sup>1</sup>

144 Design and Architectures for Digital Signal Processing

*YC HD* ={

*HH p*

*HH f*

*HH A*4

> *YC HV* ={

erwise, we assume *Ec*


assume *Ec*

assume *Ec*

assume *Ec*

*<sup>A</sup>*<sup>3</sup> |*Yc* - *Yr* | =|*Yd* - *Yq*| (strictly vertical)

Else, we assume a homogeneous area. Finally, we compute

*<sup>R</sup>* (-3*<sup>Y</sup> <sup>f</sup>* <sup>+</sup> <sup>19</sup>*Yg*

*<sup>h</sup>* if |*<sup>Y</sup> <sup>f</sup>* - *Yr*|>|*<sup>Y</sup> <sup>p</sup>* - *Yh* | (from *p* to *h*)

*<sup>r</sup>* if |*<sup>Y</sup> <sup>f</sup>* - *Yr*|<|*<sup>Y</sup> <sup>p</sup>* - *Yh* | (from *f* to *r*)

*<sup>R</sup>* (-3*Yc* <sup>+</sup> <sup>19</sup>*Yg*

*<sup>A</sup>*5 if |*<sup>Y</sup> <sup>f</sup>* - *Yr*|=|*<sup>Y</sup> <sup>p</sup>* - *Yh* | (strictly horizontal)

+ 19*Yh* - 3*Yo*) *if Ec*

+ *Yh* ) *otherwise*

Similarly, the generation of the HV pixel begins by examining

+ *Yu*)| < |(*Y <sup>f</sup>* + *Y <sup>p</sup>*) - (*Yh* + *Yr*)| to detect a vertical edge *Ec*

+ 19*Yq* - 3*Yu*) *if Ec*

+ *Yq*) *otherwise*

To conclude the CxScale description, we refer to the HD pixel generation, which begins by

*<sup>R</sup>* (-3*Ye* <sup>+</sup> <sup>19</sup>*Yh* <sup>+</sup> <sup>19</sup>*Yq* - <sup>3</sup>*Yt*) *if Ec*

To evaluate the performance of the interpolation techniques in the considered application, we execute multiple motion estimation procedures and the entire application is completed by including the standard H.264/AVC motion compensation. For the realization of each test,

+ 19*Yr* - 3*Yw*) *if Ec*

*<sup>R</sup>* (-3*<sup>Y</sup> <sup>f</sup>* <sup>+</sup> <sup>19</sup>*Yh* <sup>+</sup> <sup>19</sup>*<sup>Y</sup> <sup>p</sup>* - <sup>3</sup>*Yr*) *if Ec*

*<sup>R</sup>* (-3*Yh* <sup>+</sup> <sup>19</sup>*<sup>Y</sup> <sup>f</sup>* <sup>+</sup> <sup>19</sup>*Yr* - <sup>3</sup>*<sup>Y</sup> <sup>p</sup>*) *if Ec*

*<sup>R</sup>* (-3*Yc* <sup>+</sup> <sup>19</sup>*Yd* <sup>+</sup> <sup>19</sup>*Yq* - <sup>3</sup>*Yr*) *if Ec*

*<sup>R</sup>* (-3*Yd* <sup>+</sup> <sup>19</sup>*Yc* <sup>+</sup> <sup>19</sup>*Yr* - <sup>3</sup>*Yq*) *if Ec*

tect a horizontal edge (>), we refine its direction and we compute the pixel *Yc*

*HH g h*

(3)

(4)

*HD q*

*<sup>g</sup>* (5)

*<sup>h</sup>* . Oth‐

*HV q*

*<sup>g</sup>* . If we de‐

*HV* as follows:

*HH q d*

*HH r c*

*HV g q*

> *HV p h*

*HV f r*

+ *Yt*)| to detect an edge at *Ec*

*HD q h*

*HD r*

*Clip*<sup>32</sup>

*Clip*<sup>32</sup>

*Clip*<sup>32</sup>

*Clip*<sup>2</sup> *<sup>R</sup>*(*Yg*

*Clip*<sup>32</sup>

*Clip*<sup>32</sup>

*Clip*<sup>32</sup>

*Clip*<sup>2</sup> *<sup>R</sup>*(*Yg*

examining |(*Yb* + *Yg*) - (*Yr* + *Yw*)| > |(*Ye* + *Yh* ) - (*Yq*

*<sup>g</sup>* . Then

*Clip*<sup>32</sup>

*Clip*<sup>32</sup>

*<sup>R</sup>* (-3*Yb* <sup>+</sup> <sup>19</sup>*Yg*

*HD r*

*YC HD* ={

**3. Performance Evaluation**

**2.** An exhaustive search in the vicinity of the integer match detects fractional motion by examining 8 candidate blocks located at distance ± 1 <sup>2</sup> pixels.

Overall, the only parameter varying in this scheme is the interpolation technique used in the second phase of the algorithm, and thus, the quality variations among the output sequences (predictor frames) depend only on the efficiency of the interpolation. The results are shown in the following test reports, which display the PSNR of the output sequences and in partic‐ ular, the DPSNR for each interpolation technique.

We have performed the simulations to measure the quality and the processing time by test‐ ing a variety of well-known videos and up to five (5) frame resolutions for each. The simula‐ tions setup with *videos*, number of frames and *resolution* has been: The *car-phone* with 90 frames, the *foreman* with 400 frames, the *container* with 300 frames in *QCIF*, the *coastguard, foreman, news* with 300 frames each in *CIF* and finally, the *blue sky, pedestrian, riverbed, rush-hour* with 100 each in *SD1*, *720p* and *1080p*. Our prediction engine is written in C, it uses 1 reference


**Table 1.** PSNR of the H.264/AVC filter and DPSNR of other techniques when estimating in HH+HV positions (with H. 264 compensation).

frame and it is designed to efficiently substitute any filter. We begin by distinguishing be‐ tween horizontal/vertical and diagonal interpolation. Table 1 reports the PSNR results of the algorithm examining fractional displacements only at the horizontal and vertical directions (4 candidates). The table shows the results of two 6-tap filters (H.264/AVC, Lanczos), three 4-tap filters (Bicubic), and two edge-detection based techniques (DDT, CxScale). Moreover, for sake of comparison, we include the PSNR results achieved by the Nearest Neighbor (NN) technique [8]. The table 1 shows the low PSNR results of the Nearest Neighbor (NN) technique [8], which evades interpolation computations by simply forwarding the value of the integer pixel next to the HH/HV position. This technique practically, does not involve fractional motion detection.

**Filter: CxScale mDDT [11] CrossHD mDDT'**

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

http://dx.doi.org/10.5772/51703

147

QCIF -0.1010 -0.1728 -0.1095 -0.1299 -0.1595 CIF -0.0740 -0.1731 -0.1219 -0.1414 -0.1595 SD1 -0.0396 -0.1217 -0.0860 -0.0972 -0.1070 720p -0.0294 -0.0816 -0.0636 -0.0696 -0.0760 1080p -0.0420 -0.0746 -0.05889 -0.0637 -0.0725 **total -0.0495 -0.1220 -0.0864 -0.0984 -0.1121**

**Table 3.** DPSNR of CxScale, mDDT, [11], CrossHD and *mDDT'* when estimating in HD positions (with H.264

More precisely, we deduce that the HD part of CxScale employs an effective

**Interpolation technique**

**Table 4.** Quality vs. Time when estimating in HH+HV+HD positions.

that the performance of the DDT and the CxScale techniques improves as the frame resolu‐

Next, we consider the report of results regarding the efficiency of the techniques interpolat‐ ing half-diagonal pixels, which are more computationally demanding than the interpolation of HH/HV pixels. We program the search procedure to examine only 4 HD candidates. Tables 2 and 3 present the resulting PSNR for the techniques of Table 1, plus four edge-detection based techniques: CrossHD, the proposed HD generation based on DDT (mDDT), its first alternative ( *mDDT'* ), and the technique of [11] using bilinear filtering at its last stage. We can mention here that, when compared to the HH/HV candidates, the HD candidates add slight‐ ly less quality to the algorithm, especially in low resolution videos (e.g., as reported in the NN results). Qualitatively, we draw similar conclusions with Table 1 verifying that the Bicubic filtering, especially the kernel with values -3,19,19, - 3 prevails the over edge-detection based techniques. However, the latter show different behavior when compared to the HH/HV case.

**PSNR(dB) Time (μsec)per**

**QCIF SD1 1080p MB**

H.264/AVC 35.4263 33.3687 33.1513 46.0 Lanczos -0.0032 -0.0050 -0.0076 46.0 Bicubic, a=-0.75 -0.0215 -0.0177 -0.0202 30.6 DDT⊕ mDDT -0.3513 -0.1782 -0.1018 21.3 DDT⊕ mDDT' -0.3341 -0.1642 -0.0980 16.0 CxScale -0.3801 -0.1798 -0.0904 45.6 DDT⊕ CrossHD -0.3192 -0.1618 -0.0932 32.4 DDT⊕ CxSc(HD) -0.3061 -0.1302 -0.0839 28.3 DDT⊕ [11](HD) -0.2913 -0.1492 -0.0889 53.6

**Resolution**

compensation).

tion increases.

The NN results point out that, even with only 4 HH/HV candidates, the algorithm improves its prediction quality by up to 2 dB at low frame resolutions. Using another technique, the Lanczos 6-tap filter, results in almost equivalent quality with the standard H.264 filter. We approximated the Lanczos coefficients by integer values to achieve low complexity operations.

The exact values of the coefficients were set after extensive testing to 3, - 17,78,78, - 17,3 . The performance of the remaining filters lies between the above two extremes of six taps (Lanczos) and zero taps (NN). More precisely, the best quality was achieved with the Bicu‐ bic filters. We have examined the performance of several Bicubic kernels, with parameters -*a*∈{ <sup>7</sup> <sup>8</sup> , <sup>6</sup> <sup>8</sup> , <sup>5</sup> <sup>8</sup> , <sup>4</sup> <sup>8</sup> , <sup>3</sup> <sup>8</sup> , <sup>2</sup> <sup>8</sup> } and we report the most prominent of these in Table 1. As it is shown, for most frame resolutions the kernel with coefficients -3,19,19, - 3 maximizes the quality and limits the expected PSNR degradation to almost 0.01 dB compared to the H.264 filter. That is, although the kernel with coefficients -1,5, 5, - 1 seems –intuitively– a better ap‐ proximation of the 1, - 5,20,20, - 5,1 kernel of H.264 (approximation achieved by merging the marginal taps, i.e., by assuming equal values for the corresponding pixels), the experi‐ mental results are in favor of *a*=−0.75. For this reason, CxScale adopts the kernel with coeffi‐ cients -3,19,19, - 3 for its Bicubic filtering. Edge-detection based techniques degrade the quality by 0.1 dB, a fact indicating that their induced error surface deviates from the 6-tap filters error surface. However, we note that if we omit the H.264 compensation, these edgedetection based techniques prevail in terms of PSNR, as well as subjective criteria, up to 0.1 dB even when they are compared to 6-taps filters and especially in high-definition videos. Table 1 shows


**Table 2.** PSNR of the H.264/AVC filter and DPSNR of Nearest Neighbor, Bicubic and Lanczos when estimating in HD positions (with H.264 compensation).

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders http://dx.doi.org/10.5772/51703 147


**Table 3.** DPSNR of CxScale, mDDT, [11], CrossHD and *mDDT'* when estimating in HD positions (with H.264 compensation).

4-tap filters (Bicubic), and two edge-detection based techniques (DDT, CxScale). Moreover, for sake of comparison, we include the PSNR results achieved by the Nearest Neighbor (NN) technique [8]. The table 1 shows the low PSNR results of the Nearest Neighbor (NN) technique [8], which evades interpolation computations by simply forwarding the value of the integer pixel next to the HH/HV position. This technique practically, does not involve

The NN results point out that, even with only 4 HH/HV candidates, the algorithm improves its prediction quality by up to 2 dB at low frame resolutions. Using another technique, the Lanczos 6-tap filter, results in almost equivalent quality with the standard H.264 filter. We approximated the Lanczos coefficients by integer values to achieve low complexity operations.

The exact values of the coefficients were set after extensive testing to 3, - 17,78,78, - 17,3 . The performance of the remaining filters lies between the above two extremes of six taps (Lanczos) and zero taps (NN). More precisely, the best quality was achieved with the Bicu‐ bic filters. We have examined the performance of several Bicubic kernels, with parameters

for most frame resolutions the kernel with coefficients -3,19,19, - 3 maximizes the quality and limits the expected PSNR degradation to almost 0.01 dB compared to the H.264 filter. That is, although the kernel with coefficients -1,5, 5, - 1 seems –intuitively– a better ap‐ proximation of the 1, - 5,20,20, - 5,1 kernel of H.264 (approximation achieved by merging the marginal taps, i.e., by assuming equal values for the corresponding pixels), the experi‐ mental results are in favor of *a*=−0.75. For this reason, CxScale adopts the kernel with coeffi‐ cients -3,19,19, - 3 for its Bicubic filtering. Edge-detection based techniques degrade the quality by 0.1 dB, a fact indicating that their induced error surface deviates from the 6-tap filters error surface. However, we note that if we omit the H.264 compensation, these edgedetection based techniques prevail in terms of PSNR, as well as subjective criteria, up to 0.1 dB even when they are compared to 6-taps filters and especially in high-definition videos.

<sup>8</sup> } and we report the most prominent of these in Table 1. As it is shown,

**Bicubic Lanczos**

fractional motion detection.

146 Design and Architectures for Digital Signal Processing


Table 1 shows

positions (with H.264 compensation).

**Filter: H.264 Nearest**

**Neighbor**

**Resolution a=-1 a=-0.5 a=-0.75**

QCIF 34.7318 -1.8864 -0.0288 -0.0436 -0.0143 0.0004 CIF 33.9850 -1.1102 -0.0145 -0.0423 -0.0148 -0.0016 SD1 33.1292 -0.4790 -0.0247 -0.0241 -0.0117 -0.0032 720p 32.3766 -0.3178 -0.0176 -0.0188 -0.0092 -0.0031 1080p 33.0869 -0.1979 -0.0146 -0.0223 -0.0119 -0.0045 **total 33.3785 -0.7512 -0.0202 -0.0292 -0.0121 -0.0004**

**Table 2.** PSNR of the H.264/AVC filter and DPSNR of Nearest Neighbor, Bicubic and Lanczos when estimating in HD

that the performance of the DDT and the CxScale techniques improves as the frame resolu‐ tion increases.

Next, we consider the report of results regarding the efficiency of the techniques interpolat‐ ing half-diagonal pixels, which are more computationally demanding than the interpolation of HH/HV pixels. We program the search procedure to examine only 4 HD candidates. Tables 2 and 3 present the resulting PSNR for the techniques of Table 1, plus four edge-detection based techniques: CrossHD, the proposed HD generation based on DDT (mDDT), its first alternative ( *mDDT'* ), and the technique of [11] using bilinear filtering at its last stage. We can mention here that, when compared to the HH/HV candidates, the HD candidates add slight‐ ly less quality to the algorithm, especially in low resolution videos (e.g., as reported in the NN results). Qualitatively, we draw similar conclusions with Table 1 verifying that the Bicubic filtering, especially the kernel with values -3,19,19, - 3 prevails the over edge-detection based techniques. However, the latter show different behavior when compared to the HH/HV case. More precisely, we deduce that the HD part of CxScale employs an effective


gradient check, which is combined with the Bicubic kernel to improve the quality of CxScale. Table 3 shows that it is the prevailing edge-detection based technique among these in the paper. In cases where the filters are using less taps, the CrossHD technique performs better than the DDT techniques.

In Fig. 2 we show the results of the Objective quality both for conventional H.264 and cus‐ tom motion compensated prediction frames. Custom motion compensation utilizes the in‐ terpolation filter used by the estimation procedure, whereas, conventional compensation uses the H.264 6-tap filter. Several videos of varying resolution were used (QCIF to 1080p). Moreover, Fig. 3 shows how the aforementioned techniques perform with respect to the exe‐ cution time. Fig.2 shows that best results are achieved by the DDT (in computing HH and HV) with CrossHD (in computing HV). The fastest technique among all presented here, is the DDT with CxScale, which also results in the best PSNR when it is used with the H.264

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

http://dx.doi.org/10.5772/51703

149

Figures 4 show interpolated images of the foreman cif sequence (352x288). We use four dis‐ tinct interpolation methods at 4x in both directions to subjectively compare the quality of their results. In all four cases, the quarter pixels are calculated with a simple 2-tap bilinear (averaging) filter, which takes as input the two neighboring integer- or half-pixels (comput‐

**Figure 3.** Comparison of execution time for 5 distinct interpolation procedures. Custom motion compensation utilizes the interpolation filter used by the estimation procedure, whereas, conventional compensation uses the H.264 6-tap

filter. Several videos of varying resolution were used (QCIF to 1080p).

ed in a previous iteration by one of the four methods under evaluation).

standard compensation.

We complete the evaluation by examining all 8 candidates and taking into account the ex‐ amination of all pixels at HH, HV, and HD positions. For each technique, Table 4 reports the PSNR results and the time required (as a complexity measure) for generating 16×16 arbitrary half-pixels (averaging over HH, HV, and HD positions) as measured on a Core 2 x86-64 GPP architecture at 3GHz. Furthermore, we combine distinct HH/HV techniques and HD techni‐ ques by adopting the prevailing edge-detection mechanisms given in Tables 1, 2 and 3 (in Table 4, "*A* ⊕ *B*" stands for "use technique A in HH/HV interpolation and technique B in HD"). Overall, Bicubic reduces the 6-tap filtering time by 33% and keeps the PSNR level as close as 0.02 dB to the maximum. DDT techniques reduce time by 65% (primarily due to the fast HD generation) with a cost of 0.1 dB. CxScale and [11], involve the time consuming gra‐ dient checks. However, the HD part of CxScale combined with DDT (for HH/HV) results in a hybrid technique featuring best PSNR among the edge-detection based techniques with al‐ most 40% time improvement.

**Figure 2.** Comparison of objective quality for 5 distinct interpolation procedures. Objective quality is shown both for conventional H.264 and custom motion compensated prediction frames.

In Fig. 2 we show the results of the Objective quality both for conventional H.264 and cus‐ tom motion compensated prediction frames. Custom motion compensation utilizes the in‐ terpolation filter used by the estimation procedure, whereas, conventional compensation uses the H.264 6-tap filter. Several videos of varying resolution were used (QCIF to 1080p). Moreover, Fig. 3 shows how the aforementioned techniques perform with respect to the exe‐ cution time. Fig.2 shows that best results are achieved by the DDT (in computing HH and HV) with CrossHD (in computing HV). The fastest technique among all presented here, is the DDT with CxScale, which also results in the best PSNR when it is used with the H.264 standard compensation.

gradient check, which is combined with the Bicubic kernel to improve the quality of CxScale. Table 3 shows that it is the prevailing edge-detection based technique among these in the paper. In cases where the filters are using less taps, the CrossHD technique performs

We complete the evaluation by examining all 8 candidates and taking into account the ex‐ amination of all pixels at HH, HV, and HD positions. For each technique, Table 4 reports the PSNR results and the time required (as a complexity measure) for generating 16×16 arbitrary half-pixels (averaging over HH, HV, and HD positions) as measured on a Core 2 x86-64 GPP architecture at 3GHz. Furthermore, we combine distinct HH/HV techniques and HD techni‐ ques by adopting the prevailing edge-detection mechanisms given in Tables 1, 2 and 3 (in Table 4, "*A* ⊕ *B*" stands for "use technique A in HH/HV interpolation and technique B in HD"). Overall, Bicubic reduces the 6-tap filtering time by 33% and keeps the PSNR level as close as 0.02 dB to the maximum. DDT techniques reduce time by 65% (primarily due to the fast HD generation) with a cost of 0.1 dB. CxScale and [11], involve the time consuming gra‐ dient checks. However, the HD part of CxScale combined with DDT (for HH/HV) results in a hybrid technique featuring best PSNR among the edge-detection based techniques with al‐

**Figure 2.** Comparison of objective quality for 5 distinct interpolation procedures. Objective quality is shown both for

conventional H.264 and custom motion compensated prediction frames.

better than the DDT techniques.

148 Design and Architectures for Digital Signal Processing

most 40% time improvement.

Figures 4 show interpolated images of the foreman cif sequence (352x288). We use four dis‐ tinct interpolation methods at 4x in both directions to subjectively compare the quality of their results. In all four cases, the quarter pixels are calculated with a simple 2-tap bilinear (averaging) filter, which takes as input the two neighboring integer- or half-pixels (comput‐ ed in a previous iteration by one of the four methods under evaluation).

**Figure 3.** Comparison of execution time for 5 distinct interpolation procedures. Custom motion compensation utilizes the interpolation filter used by the estimation procedure, whereas, conventional compensation uses the H.264 6-tap filter. Several videos of varying resolution were used (QCIF to 1080p).

**Figure 4.-5.** Comparison of the H.264 filter (up) to the DDT ⊕ CxScale (down) on the "foreman" sequence. The exam‐ ple shows the two frames at their increased size (1408x1152) after interpolation from cif (352x288). DDT ⊕ CxScale (down) alleviates aliasing effects.

**Figure 6.-7.** Comparison of the DDT ⊕ CrossHD filter (up) to the DDT ⊕ [11] (down). Frames are shown at their in‐ creased size (1408x1152) after interpolation from "foreman" cif (352x288). DDT ⊕ CrossHD produces very similar sub‐

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

http://dx.doi.org/10.5772/51703

151

Aiming at a significant complexity reduction under negligible video quality degradation, the paper proposed three novel interpolation techniques for use in the estimation process preceding the standard H.264/AVC motion compensation module of the encoder. Moreover, we evaluated their performance and compared their efficiency to three commonly used techniques. The results showed that the techniques using 4-tap Bicubic kernels constitute the most prominent substitute of the standard 6-tap filter. Further reduction of the estimation time was achieved via combinations of simple edge-detection based techniques. Future work includes parallelized implementations in VLSI/FPGA and cost-performance analysis.

jective quality results to DDT ⊕ [11] in considerably less execution time.

**4. Conclusion**

Figure 6 compares the 6-tap H.264 filter (up) to the combination of DDT and CxScale (down). Clearly, the latter produces much better images in terms of aliasing artifacts: the marquee indents on the wall look much sharper on the image below and the helmet is less jagged. Even though DDT ⊕ CxScale uses less taps, it achieves such aliasing reduction due to the employed edge detection mechanism. However, using a small number of taps and a large area as input to the proposed low-complexity comparison-based mechanism could ob‐ scure some finer details. Overall, DDT ⊕ CxScale improves the subjective quality of the en‐ larged image by using less execution time compared to the examined 6-tap filters. Figures 6 compare the combination of DDT ⊕ CrossHD (up) to the combination of DDT ⊕ [11] (down). Subjectively, the DDT ⊕ CrossHD method uses half the execution time of DDT ⊕ [11] to output images with very similar quality. Both methods reduce the aliasing artifacts compared to the examined 6-tap filters.

**Figure 6.-7.** Comparison of the DDT ⊕ CrossHD filter (up) to the DDT ⊕ [11] (down). Frames are shown at their in‐ creased size (1408x1152) after interpolation from "foreman" cif (352x288). DDT ⊕ CrossHD produces very similar sub‐ jective quality results to DDT ⊕ [11] in considerably less execution time.

#### **4. Conclusion**

**Figure 4.-5.** Comparison of the H.264 filter (up) to the DDT ⊕ CxScale (down) on the "foreman" sequence. The exam‐ ple shows the two frames at their increased size (1408x1152) after interpolation from cif (352x288). DDT ⊕ CxScale

Figure 6 compares the 6-tap H.264 filter (up) to the combination of DDT and CxScale (down). Clearly, the latter produces much better images in terms of aliasing artifacts: the marquee indents on the wall look much sharper on the image below and the helmet is less jagged. Even though DDT ⊕ CxScale uses less taps, it achieves such aliasing reduction due to the employed edge detection mechanism. However, using a small number of taps and a large area as input to the proposed low-complexity comparison-based mechanism could ob‐ scure some finer details. Overall, DDT ⊕ CxScale improves the subjective quality of the en‐ larged image by using less execution time compared to the examined 6-tap filters. Figures 6 compare the combination of DDT ⊕ CrossHD (up) to the combination of DDT ⊕ [11] (down). Subjectively, the DDT ⊕ CrossHD method uses half the execution time of DDT ⊕ [11] to output images with very similar quality. Both methods reduce the aliasing artifacts

(down) alleviates aliasing effects.

150 Design and Architectures for Digital Signal Processing

compared to the examined 6-tap filters.

Aiming at a significant complexity reduction under negligible video quality degradation, the paper proposed three novel interpolation techniques for use in the estimation process preceding the standard H.264/AVC motion compensation module of the encoder. Moreover, we evaluated their performance and compared their efficiency to three commonly used techniques. The results showed that the techniques using 4-tap Bicubic kernels constitute the most prominent substitute of the standard 6-tap filter. Further reduction of the estimation time was achieved via combinations of simple edge-detection based techniques. Future work includes parallelized implementations in VLSI/FPGA and cost-performance analysis.

#### **Author details**

Georgios Georgis, George Lentaris and Dionysios Reisis\*

\*Address all correspondence to: dreisis@phys.uoa.gr

Electronics Laboratory, Physics Deparment, National and Kapodistrian University of Athens (NKUA), Greece

[10] Burger, W., & Burge, M. (2008). Digital Image Processing, an Algorithmic approach

Low Complexity Interpolation Filters for Motion Estimation and Application to the H.264 Encoders

http://dx.doi.org/10.5772/51703

153

[11] Su, D., & Willis, P. (2004). Image Interpolation by Pixel-Level Data-Dependent Trian‐ gulation. *Computer Graph. For.*, doi : 10.1111/j.1467-8659.2004.00752.x, 23(2), 189-201.

[12] Chen, Tung-Chien, Huang, Yu-Wen, & Chen, Liang-Gee. (2004). Analysis and design of macroblock pipelining for H.264/AVC VLSI architecture. *IEEE Intl. Symp. on Cir‐*

[13] Hyun, C. J., Kim, S. D., & Sunwoo, M. H. (2006). Efficient memory reuse and sub-pix‐ el interpolation algorithms for ME/MC of H.264/AVC. *IEEE Workshop on Signal Proc‐ essing Systems Design and Implementation*, October, doi : 10.1109/SIPS.2006.352612,

[14] Song, Y., Ma, Y., Liu, Z., Ikenaga, T., & Goto, S. (2008). Hardware-oriented directionbased fast fractional motion estimation algorithm in H.264/AVC. *IEEE International Conference on Multimedia and Expo*, 1009-1012, June, doi : 10.1109/ICME.2008.4607608.

[15] Vatis, Y, & Ostermann, J. (2009). Adaptive Interpolation Filter for H.264/AVC. *Cir‐ cuits and Systems for Video Technology, IEEE Transactions on*, 19(2), 179-192, Feb., doi:

[16] Hsueh-Ming, Hang, Peng, Wen-Hsiao, Chia-Hsin, Chan, & Chun-Chi, Chen. (2010). Towards the Next Video Standard: High Efficiency Video Coding. *Proceedings of the Second APSIPA Annual Summit and Conference*, 609-618, Biopolis, Singapore, 14-17 De‐

[17] Dmytro, Rusanovskyy, Ugur, Kemal, Hallapuro, Antti, Lainema, Jani, & Gabbouj, Moncef. (2009). Video Coding With Low-Complexity Directional Adaptive Interpola‐ tion Filters. *IEEE Transactions on Circuits and Systems for Video Technology*, 19(8), Au‐

[18] Fuldseth, A., Bjontegaard, G., Rusanovskyy, D., Ugur, K., & Lainema, J. (2008). Low complexity directional interpolation filter. Berlin, Germany, ITU-T Q.6/SG16, VCEG-

[19] Zhang, Kai, Guo, Xun, An, Jicheng, Huang, Yu-Wen, Lei, S., & Gao, Wen. (2012). A Single-Pass-Based Localized Adaptive Interpolation Filter for Video Coding. *Circuits and Systems for Video Technology, IEEE Transactions on*, 22(1), 43-55, Jan., doi: 10.1109/

[20] Cho, Jaehyun, Lee, Dong-Bok, Cheol, Shin Jeong, & Song, Byung Cheol. (2011). Block-adaptive interpolation filter for sub-pixel motion compensation. *19th European*

[21] Georgis, G, Lentaris, G, & Reisis, D. (2012). Study of Interpolation Filters for Motion Estimation with Application in H.264/AVC Encoders. *IEEE Intl. Conference on Circuits*

*and Systems (ICECS)*, 9-12, Beirut, doi : 10.1109/ICECS.2011.6122201, 9-12.

*cuits and Systems (ISCAS)*, 273-276, doi : 10.1109/ISCAS.2004.1329261.

using Java. *1st ed. New York, USA: Springer*.

377-38.

cember.

AI12, July.

TCSVT.2011.2157194.

10.1109/TCSVT.2008.2009259.

gust, doi : 10.1109/TCSVT.2009.2022708.

*Signal Processing Conference (EUSIPCO)*, 2156-2160.

#### **References**


[10] Burger, W., & Burge, M. (2008). Digital Image Processing, an Algorithmic approach using Java. *1st ed. New York, USA: Springer*.

**Author details**

152 Design and Architectures for Digital Signal Processing

(NKUA), Greece

**References**

2004.1327034, 9-12.

662-666.

Georgios Georgis, George Lentaris and Dionysios Reisis\*

doi :10.1109/TCSVT.2006.872783, 16(4), 507-522.

Generic Audiovisual Services. *ITU-T*, 167-169, Mar.

*ing*, 5308, 31-36, Jan, doi: 10.1117/12.532336.

Sept, doi : 10.1007/s11265-008-0224-4.

1981.1163711, 29(6), 1153-1160.

Electronics Laboratory, Physics Deparment, National and Kapodistrian University of Athens

[1] Yu-Wen, Huang, Bing-Yu, Hsieh, Shao-Yi, Chien, Shyh-Yih, Ma, & Liang-Gee, Chen. (2006). Analysis and Complexity Reduction of Multiple Reference Frames Motion Es‐ timation in H.264/AVC. *IEEE Transactions on Circuits and Systems for Video Technology*,

[2] Tung-Chien, Chen, Yu-Wen, Huang, & Liang-Gee, Chen. (2004). Fully utilized and reusable architecture for fractional motion estimation of H.264/AVC. *IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP)*, doi : 10.1109/ICASSP.

[3] ITU, Telecommunication Standardization Sector. (2010). Advanced Video Coding for

[4] Gupta, P. S. S. B. K., & Korada, R. (2004). Novel algorithm to reduce the complexity of quarter-pixel motion estimation. *Proc. of Visual Communications and Image Process‐*

[5] Hyun, C. J., & Sunwoo, M. H. (2009). Low Power Complexity-Reduced ME and Inter‐ polation Algorithms for H.264/AVC. *J. of Signal Processing Systems*, 56(2), 285-293,

[6] Changqi, Yang, Goto, S., & Ikenaga, T. (2006). High performance VLSI architecture of fractional motion estimation in H.264 for HDTV. *IEEE Intl. Symposium on Circuits and*

[7] Chao-Yang , Kao, Cheng-Long, Wu, & Lin, Youn-Long. (2010). A high performance three-engine architecture for H.264/AVC fractional motion estimation. *IEEE Tran. On Very Large Scale Integration Systems*, April, doi : 10.1109/ICME.2008.4607389, 18(4),

[8] Dodgson, N. A. (1997). Quadratic Interpolation for Image Resampling. *IEEE Trans. on*

[9] Keys, R.G. (1981). Cubic Convolution Interpolation for Digital Image Processing. *IEEE Transactions on Acoustics, Speech and Signal Processing*, Dec, doi : 10.1109/TASSP.

*Systems (ISCAS)*, September, doi : 10.1109/ISCAS.2006.1693157.

*Image Processing*, 6(9), 1322-1326, Sept, doi: 10.1109/83.623195.

\*Address all correspondence to: dreisis@phys.uoa.gr


**Chapter 7**

**A Real-Time Video Encoding Scheme Based on the**

Real-time video communication over the internet and other heterogeneous IP networks has become a significant part of modern communications, underlining the need for highly effi‐ cient video coding algorithms. The most desirable characteristic of such an algorithm would be the ability to maintain satisfactory visual quality while achieving good compression. Ad‐ ditional advantageous characteristics would be low computational complexity and real-time performance, allowing the algorithm to be used in a wide variety of less powerful comput‐ ers. Transmission of video over the network would benefit by the ability to adapt to the net‐ work's end-to-end bandwidth and transmitter/receiver resources, as well as by resistance to packet losses that might occur. Additionally, scalability and resistance to noise would be highly advantageous characteristics for a modern video compression algorithm. Most state of the art video compression techniques like the H.264, DivX/Xvid, MPEG2 fail to achieve real time performance without the use of dedicated hardware due to their high computa‐ tional complexity. Moreover, in order to achieve optimal compression and quality they de‐ pend on multipass statistical and structural analysis of the whole video content, which cannot happen in cases of live video stream generation as in the case of video-conferencing. In this chapter, a more elaborate analysis of a novel algorithm for high-quality real-time vid‐ eo encoding, originally proposed in [1], is presented. The algorithm is designed for content obtained from low resolution sources like web cameras, surveillance cameras, etc. Critical to the efficiency of video encoding algorithm design is the selection of a suitable image repre‐ sentation method. Texture representation methods proposed in the literature that utilize the Fourier transform, the Discrete Cosine transform, the Wavelet transform as well as other fre‐ quency domain methods have been extensively used for image and video encoding. Never‐

> © 2013 Katsigiannis et al.; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

> © 2013 Katsigiannis et al.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

**Contourlet Transform**

Dimitris Maroulis

**1. Introduction**

http://dx.doi.org/10.5772/51735

Stamos Katsigiannis, Georgios Papaioannou and

Additional information is available at the end of the chapter

### **A Real-Time Video Encoding Scheme Based on the Contourlet Transform**

Stamos Katsigiannis, Georgios Papaioannou and Dimitris Maroulis

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51735

#### **1. Introduction**

Real-time video communication over the internet and other heterogeneous IP networks has become a significant part of modern communications, underlining the need for highly effi‐ cient video coding algorithms. The most desirable characteristic of such an algorithm would be the ability to maintain satisfactory visual quality while achieving good compression. Ad‐ ditional advantageous characteristics would be low computational complexity and real-time performance, allowing the algorithm to be used in a wide variety of less powerful comput‐ ers. Transmission of video over the network would benefit by the ability to adapt to the net‐ work's end-to-end bandwidth and transmitter/receiver resources, as well as by resistance to packet losses that might occur. Additionally, scalability and resistance to noise would be highly advantageous characteristics for a modern video compression algorithm. Most state of the art video compression techniques like the H.264, DivX/Xvid, MPEG2 fail to achieve real time performance without the use of dedicated hardware due to their high computa‐ tional complexity. Moreover, in order to achieve optimal compression and quality they de‐ pend on multipass statistical and structural analysis of the whole video content, which cannot happen in cases of live video stream generation as in the case of video-conferencing.

In this chapter, a more elaborate analysis of a novel algorithm for high-quality real-time vid‐ eo encoding, originally proposed in [1], is presented. The algorithm is designed for content obtained from low resolution sources like web cameras, surveillance cameras, etc. Critical to the efficiency of video encoding algorithm design is the selection of a suitable image repre‐ sentation method. Texture representation methods proposed in the literature that utilize the Fourier transform, the Discrete Cosine transform, the Wavelet transform as well as other fre‐ quency domain methods have been extensively used for image and video encoding. Never‐

theless, these methods have some limitations that have been partially addressed by the Contourlet Transform (CT) [2], which our video encoding algorithm is based on. The Con‐ tourlet Transform offers multiscale and directional decomposition, providing anisotropy and directionality, features missing from traditional transforms like the Discrete Wavelet Transform [2]. In recent years, the Contourlet Transform has been successfully utilised in a variety of texture analysis applications, including synthetic aperture radar (SAR) [3], medi‐ cal and natural image classification [4], image denoising [5], despeckling of images, image compression, etc. By harnessing the computational power offered by modern graphics proc‐ essing units (GPUs), a gpu-based contourlet transform is able to provide an image represen‐ tation method with advantageous characteristics, while maintaining a fast performance.

combined result is the contourlet filter bank, which is a double iterated filter bank that decom‐ poses images into directional subbands at multiple scales. The contourlet coefficients have a similarity with wavelet coefficients since most of them are almost zero and only few of them, lo‐ cated near the edge of the objects, have large magnitudes [10]. In the presented algorithm, the Cohen and Daubechies 9-7 filters [11] have been utilized for the Laplacian Pyramid. For the Di‐ rectional Filter Bank, these filters were mapped into their corresponding 2D filters using the McClellan transform as proposed by Do and Vetterli in [2]. It must be noted that these filters are not considered as optimal. The creation of optimal filters for the contourlet filter bank remains an open research topic. An outline of the Contourlet Transform is presented on Figure 1, while

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

157

**Figure 2.** Example of contourlet transform decomposition of a greyscale image. Three levels of decomposition with the Laplacian Pyramid were applied, each then decomposed into four directional subbands using the Directional Filter Bank.

an example of decomposition is shown on Figure 2.

**Figure 1.** The Contourlet Filter Bank.

The rest of this chapter is organised in four sections. First, some background knowledge and information needed for better understanding the algorithm is presented in section 2. Then, the aforementioned video encoding algorithm is presented in section 3, whereas an experi‐ mental study for the evaluation of the algorithm is provided in section 4. Conclusions and future perspectives of this work are presented in section 5.

#### **2. Background**

#### **2.1. The Contourlet Transform**

The Contourlet Transform (CT) is a directional multiscale image representation scheme pro‐ posed by Do and Vetterli, which is effective in representing smooth contours in different direc‐ tions of an image, thus providing directionality and anisotropy [2]. The method utilizes a double filter bank in which, first the Laplacian Pyramid (LP) [6] detects the point discontinui‐ ties of the image and then the Directional Filter Bank (DFB) [7] links those point discontinuities into linear structures. The LP provides a way to obtain multiscale decomposition. In each LP level, a downsampled lowpass version of the original image and a more detailed image with the supplementary high frequencies containing the point discontinuities are obtained. This scheme can be iterated continuously in the lowpass image and is restricted only by the size of the origi‐ nal image due to the downsampling. The DFB is a 2D directional filter bank that can achieve perfect reconstruction, which is an important characteristic for image and video encoding ap‐ plications. The simplified DFB used for the contourlet transform consists of two stages and leads to *2 <sup>l</sup>* subbands with wedge-shaped frequency partitioning [8], with *l* being the level of de‐ composition. The first stage of the DFB is a two-channel quincunx filter bank [9] with fan filters that divides the 2D spectrum into vertical and horizontal directions, while the second stage is a shearing operator that just reorders the samples. By adding a 45 degrees shearing operator and its inverse before and after a two-channel filter bank, a different directional frequency partition is obtained (diagonal directions), while maintaining the ability to perfectly reconstruct the orig‐ inal image, since the sampling locations coincide with the (integer) pixel grid.

The combination of the LP and the DFB is a double filter bank named Pyramidal Directional Fil‐ ter Bank (PDFB). In order to capture the directional information, bandpass images from the LP decomposition are fed into a DFB. This scheme can be repeated on the coarser image levels. The combined result is the contourlet filter bank, which is a double iterated filter bank that decom‐ poses images into directional subbands at multiple scales. The contourlet coefficients have a similarity with wavelet coefficients since most of them are almost zero and only few of them, lo‐ cated near the edge of the objects, have large magnitudes [10]. In the presented algorithm, the Cohen and Daubechies 9-7 filters [11] have been utilized for the Laplacian Pyramid. For the Di‐ rectional Filter Bank, these filters were mapped into their corresponding 2D filters using the McClellan transform as proposed by Do and Vetterli in [2]. It must be noted that these filters are not considered as optimal. The creation of optimal filters for the contourlet filter bank remains an open research topic. An outline of the Contourlet Transform is presented on Figure 1, while an example of decomposition is shown on Figure 2.

**Figure 1.** The Contourlet Filter Bank.

theless, these methods have some limitations that have been partially addressed by the Contourlet Transform (CT) [2], which our video encoding algorithm is based on. The Con‐ tourlet Transform offers multiscale and directional decomposition, providing anisotropy and directionality, features missing from traditional transforms like the Discrete Wavelet Transform [2]. In recent years, the Contourlet Transform has been successfully utilised in a variety of texture analysis applications, including synthetic aperture radar (SAR) [3], medi‐ cal and natural image classification [4], image denoising [5], despeckling of images, image compression, etc. By harnessing the computational power offered by modern graphics proc‐ essing units (GPUs), a gpu-based contourlet transform is able to provide an image represen‐ tation method with advantageous characteristics, while maintaining a fast performance.

The rest of this chapter is organised in four sections. First, some background knowledge and information needed for better understanding the algorithm is presented in section 2. Then, the aforementioned video encoding algorithm is presented in section 3, whereas an experi‐ mental study for the evaluation of the algorithm is provided in section 4. Conclusions and

The Contourlet Transform (CT) is a directional multiscale image representation scheme pro‐ posed by Do and Vetterli, which is effective in representing smooth contours in different direc‐ tions of an image, thus providing directionality and anisotropy [2]. The method utilizes a double filter bank in which, first the Laplacian Pyramid (LP) [6] detects the point discontinui‐ ties of the image and then the Directional Filter Bank (DFB) [7] links those point discontinuities into linear structures. The LP provides a way to obtain multiscale decomposition. In each LP level, a downsampled lowpass version of the original image and a more detailed image with the supplementary high frequencies containing the point discontinuities are obtained. This scheme can be iterated continuously in the lowpass image and is restricted only by the size of the origi‐ nal image due to the downsampling. The DFB is a 2D directional filter bank that can achieve perfect reconstruction, which is an important characteristic for image and video encoding ap‐ plications. The simplified DFB used for the contourlet transform consists of two stages and

subbands with wedge-shaped frequency partitioning [8], with *l* being the level of de‐

composition. The first stage of the DFB is a two-channel quincunx filter bank [9] with fan filters that divides the 2D spectrum into vertical and horizontal directions, while the second stage is a shearing operator that just reorders the samples. By adding a 45 degrees shearing operator and its inverse before and after a two-channel filter bank, a different directional frequency partition is obtained (diagonal directions), while maintaining the ability to perfectly reconstruct the orig‐

The combination of the LP and the DFB is a double filter bank named Pyramidal Directional Fil‐ ter Bank (PDFB). In order to capture the directional information, bandpass images from the LP decomposition are fed into a DFB. This scheme can be repeated on the coarser image levels. The

inal image, since the sampling locations coincide with the (integer) pixel grid.

future perspectives of this work are presented in section 5.

**2. Background**

leads to *2 <sup>l</sup>*

**2.1. The Contourlet Transform**

156 Design and Architectures for Digital Signal Processing

**Figure 2.** Example of contourlet transform decomposition of a greyscale image. Three levels of decomposition with the Laplacian Pyramid were applied, each then decomposed into four directional subbands using the Directional Filter Bank.

#### **2.2. GPU-based contourlet transform**

By analysing the structure of the contourlet transform, it is evident that its most computational‐ ly intensive part is the calculation of all the 2D convolutions needed for complete decomposi‐ tion or reconstruction. Calculating the convolutions on the CPU using the 2D convolution definition is not feasible for real-time applications since performance suffers significantly due to the computational complexity. Utilizing the DFT or the FFT in order to achieve better per‐ formance provides significantly faster implementations but still fails to achieve satisfactory re‐ al-time performance, especially in mobile platforms such as laptops and tablet PCs. The benefits of the FFT for the calculation of 2D convolution can only be fully exploited by an archi‐ tecture supporting parallel computations. Modern personal computers are commonly equip‐ ped with powerful graphics processors (GPUs), which in the case of live video capture from web or surveillance cameras are underutilized. Intensive, repetitive computations that can be computed in parallel can be accelerated by harnessing this "dormant" computational power. General purpose computing on graphics processing units (GPGPU) is the set of techniques that use a GPU, which is otherwise specialized in handling computations for the display of comput‐ er graphics, in order to perform computations traditionally handled by a CPU. The highly par‐ allel structure of GPUs makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data can be done in parallel.

**2.3. The YCoCg colour space**

Inspired by recent work on real-time RGB frame buffer compression using chrominance sub‐ sampling based on the YCoCg colour transform [13], we investigated the use of these techni‐ ques in conjunction with the contourlet transform to efficiently encode colour video frames.

The human visual system is significantly more sensitive to variations of luminance com‐ pared to variations of chrominance. Encoding the luminance channel of an image with high‐ er accuracy than the chrominance channels provides a simple low complexity compression scheme, while maintaining satisfactory visual quality. Various image and video compres‐ sion algorithms take advantage of this fact in order to achieve increased efficiency. First in‐ troduced in H.264 compression, the RGB to YCoCg transform decomposes a colour image into luminance (Y), orange chrominance (Co) and green chrominance (Cg) components and has been shown to exhibit better decorrelation properties than YCbCr and similar trans‐ forms [14]. It was developed primarily to address some limitations of the different YCbCr colour spaces [15]. The transform and its reverse are calculated by the following equations:

**Image set Number of images Average PSNR (dB)**

**Table 1.** Average PSNR obtained for each image set after transforming from RGB to YCoCg and back using the same

In order for the reverse transform to be perfect and to avoid rounding errors, the Co and Cg components should be stored with higher precision than the RGB components. Experiments using 23 images from the Kodak image set and 18 images from the Canon image set, all ob‐ tained from [16], as well as 963 outdoor scene images obtained from [17], showed that using the same precision for the YCoCg and RGB data when transforming from RGB to YCoCg and back results in an average PSNR of more than 58.87 dB for all the image sets, as shown in Table 1. This

Kodak 23 59.27 Canon 18 59.05 Outdoor scene images 963 58.87

precision for the RGB and YCoCg components.

Y = R/ 4 + G / 2 + B / 4 (1)

Cg = - R/ 4 + G / 2 – B / 4 (3)

Co = R / 2 – B / 2 (2)

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

159

R = Y + Co – Cg (4)

G = Y + Cg (5)

B = Y – Co – Cg (6)

For the GPU implementation of the contourlet transform, the NVIDIA Compute Unified De‐ vice Architecture (CUDA) has been selected due to the extensive capabilities and specialized API it offers. CUDA is a general purpose parallel computing architecture that allows the parallel compute engine in NVIDIA GPUs to be used in order to solve complex computa‐ tional problems that are outside the scope of graphics algorithms. In order to compute the contourlet transform, first the image and the filters are transferred from the main memory to the GPU dedicated memory. Then, the contourlet transform of the image is calculated by migrating all the calculations on the GPU in order to reduce the unnecessary transfers to and from the main memory that introduce delay to the computations. The 2D convolutions required are calculated by means of the FFT. After calculating the contourlet transform of the image, the output is transferred back to the main memory and the GPU memory is freed. Considering that this implementation will be used for video encoding, the filters are loaded once at the GPU memory since they will not change from frame to frame. In order to evalu‐ ate the performance of this approach, various implementations of the contourlet transform were developed, both for the CPU and the GPU. These implementations were based on the FFT (frequency domain) and the 2D convolution definition (spatial domain). Except for the basic GPU implementation using spatial domain convolution, other out-of-core implementa‐ tions were developed, based on the 2D convolution definition and utilizing memory man‐ agement schemes in order to support larger frames when the GPU memory is not sufficient [12]. The GPU implementation based on the FFT outperformed all the aforementioned im‐ plementations in our tests and was therefore the method of choice for our video encoder.

#### **2.3. The YCoCg colour space**

**2.2. GPU-based contourlet transform**

158 Design and Architectures for Digital Signal Processing

where processing of large blocks of data can be done in parallel.

By analysing the structure of the contourlet transform, it is evident that its most computational‐ ly intensive part is the calculation of all the 2D convolutions needed for complete decomposi‐ tion or reconstruction. Calculating the convolutions on the CPU using the 2D convolution definition is not feasible for real-time applications since performance suffers significantly due to the computational complexity. Utilizing the DFT or the FFT in order to achieve better per‐ formance provides significantly faster implementations but still fails to achieve satisfactory re‐ al-time performance, especially in mobile platforms such as laptops and tablet PCs. The benefits of the FFT for the calculation of 2D convolution can only be fully exploited by an archi‐ tecture supporting parallel computations. Modern personal computers are commonly equip‐ ped with powerful graphics processors (GPUs), which in the case of live video capture from web or surveillance cameras are underutilized. Intensive, repetitive computations that can be computed in parallel can be accelerated by harnessing this "dormant" computational power. General purpose computing on graphics processing units (GPGPU) is the set of techniques that use a GPU, which is otherwise specialized in handling computations for the display of comput‐ er graphics, in order to perform computations traditionally handled by a CPU. The highly par‐ allel structure of GPUs makes them more effective than general-purpose CPUs for algorithms

For the GPU implementation of the contourlet transform, the NVIDIA Compute Unified De‐ vice Architecture (CUDA) has been selected due to the extensive capabilities and specialized API it offers. CUDA is a general purpose parallel computing architecture that allows the parallel compute engine in NVIDIA GPUs to be used in order to solve complex computa‐ tional problems that are outside the scope of graphics algorithms. In order to compute the contourlet transform, first the image and the filters are transferred from the main memory to the GPU dedicated memory. Then, the contourlet transform of the image is calculated by migrating all the calculations on the GPU in order to reduce the unnecessary transfers to and from the main memory that introduce delay to the computations. The 2D convolutions required are calculated by means of the FFT. After calculating the contourlet transform of the image, the output is transferred back to the main memory and the GPU memory is freed. Considering that this implementation will be used for video encoding, the filters are loaded once at the GPU memory since they will not change from frame to frame. In order to evalu‐ ate the performance of this approach, various implementations of the contourlet transform were developed, both for the CPU and the GPU. These implementations were based on the FFT (frequency domain) and the 2D convolution definition (spatial domain). Except for the basic GPU implementation using spatial domain convolution, other out-of-core implementa‐ tions were developed, based on the 2D convolution definition and utilizing memory man‐ agement schemes in order to support larger frames when the GPU memory is not sufficient [12]. The GPU implementation based on the FFT outperformed all the aforementioned im‐ plementations in our tests and was therefore the method of choice for our video encoder.

Inspired by recent work on real-time RGB frame buffer compression using chrominance sub‐ sampling based on the YCoCg colour transform [13], we investigated the use of these techni‐ ques in conjunction with the contourlet transform to efficiently encode colour video frames.

The human visual system is significantly more sensitive to variations of luminance com‐ pared to variations of chrominance. Encoding the luminance channel of an image with high‐ er accuracy than the chrominance channels provides a simple low complexity compression scheme, while maintaining satisfactory visual quality. Various image and video compres‐ sion algorithms take advantage of this fact in order to achieve increased efficiency. First in‐ troduced in H.264 compression, the RGB to YCoCg transform decomposes a colour image into luminance (Y), orange chrominance (Co) and green chrominance (Cg) components and has been shown to exhibit better decorrelation properties than YCbCr and similar trans‐ forms [14]. It was developed primarily to address some limitations of the different YCbCr colour spaces [15]. The transform and its reverse are calculated by the following equations:

$$\mathbf{Y} = \mathbf{R}/\mathbf{4} + \mathbf{G}/\mathbf{2} + \mathbf{B}/\mathbf{4} \tag{1}$$

$$\text{Co} = \text{R}/\text{2} - \text{B}/\text{2} \tag{2}$$

$$\mathbf{C}\mathbf{g} = \mathbf{ -R/4 + G/2 - B/4} \tag{3}$$

$$\mathbf{R} = \mathbf{Y} + \mathbf{C}\mathbf{o} - \mathbf{C}\mathbf{g} \tag{4}$$

$$\mathbf{G} = \mathbf{Y} + \mathbf{C}\mathbf{g} \tag{5}$$

$$\mathbf{B} = \mathbf{Y} - \mathbf{C}\mathbf{o} - \mathbf{C}\mathbf{g} \tag{6}$$


**Table 1.** Average PSNR obtained for each image set after transforming from RGB to YCoCg and back using the same precision for the RGB and YCoCg components.

In order for the reverse transform to be perfect and to avoid rounding errors, the Co and Cg components should be stored with higher precision than the RGB components. Experiments using 23 images from the Kodak image set and 18 images from the Canon image set, all ob‐ tained from [16], as well as 963 outdoor scene images obtained from [17], showed that using the same precision for the YCoCg and RGB data when transforming from RGB to YCoCg and back results in an average PSNR of more than 58.87 dB for all the image sets, as shown in Table 1. This loss of quality cannot be perceived by the human visual system, resulting to no visible altera‐ tion of the image. Nevertheless, it indicates the highest quality possible when used for image compression.

The algorithm divides the video frames into two categories; keyframes and internal frames. Keyframes are frames that are encoded using the steps described in the previous paragraph and internal frames are the frames between two key frames. The interval between two key‐ frames is a user defined parameter. At the step before the run-length encoding, when a frame is identified as an internal frame all its components are calculated as the difference between the respective components of the frame and those of the previous key frame. This step is processed on the GPU while all the remaining steps of the algorithm are performed on the CPU unless otherwise stated. Then, run length encoding is applied to the chromatic channels, the low frequency contourlet component of the luminance channel, as well as the directional subbands of the luminance channel. It must be noted that steps that are executed

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

161

The last stage of the algorithm consists of the selection of the optimal precision for each vid‐ eo component. The user can select between lossless or lossy change of precision, directly af‐

**The decoding algorithm**

**Listing 2.** Steps of the decoding algorithm. Highlighted steps refer to calculations performed on the GPU, while the

Exploiting the fact that the human visual system is relatively insensitive to chrominance variations, in order to achieve compression, the chrominance channels Co and Cg are subsampled by a user-defined factor *N* that directly affects the output's visual quality and the compression achieved. The chrominance channels are stored in lower resolution,

on the CPU are inherently serial and cannot be efficiently mapped to a GPU.

fecting the output's visual quality.

6: ELSE IF the frame is an internal frame

**11: Upsample Co and Cg by N**

14: IF frame is NOT the last frame

4: Decode the run-length encoded directional subbands of Y **5: Keep keyframe in memory and discard old keyframe**

8: Decode the run-length encoded directional subbands of Y

other steps refer to calculations performed on the CPU.

**3.1. Chrominance channel subsampling**

7: Decode the run-length encoded Co, Cg, lowpass CT component of Y

**9: Calculate the frame as the sum of the frame and the previous keyframe**

1: Start

10: END IF

**12: Reconstruct Y 13: Convert to RGB**

15: GOTO Start 16: END IF 17: Finish

2: Input encoded frame 3: IF the frame is a keyframe

#### **3. The presented algorithm**

Listings 1 and 2 depict the presented algorithm for encoding and decoding respectively. Input frames are considered to be in the RGB format. The first step of the algorithm is the conversion from RGB to YCoCg colour space for further manipulation of the luminance and chrominance channels. The luminance channel is decomposed using the contourlet transform, while chromi‐ nance channels are subsampled by a user-defined factor *N*. The levels and filters for contourlet transform decomposition are also defined by the user. From the contourlet coefficients ob‐ tained by decomposing the luminance channel, only a user-specified percentage of the most significant ones are retained. Then, the precision allocated for storing the contourlet coefficients is reduced. All computations up to this stage are performed on the GPU, avoiding unnecessary memory transfers from the main memory to the GPU memory and vice versa. After reducing the precision of the retained contourlet coefficients of the luminance channel, the directional subbands are encoded using a run length encoding scheme that encodes only zero valued ele‐ ments. The large sequences of zero-valued contourlet coefficients that occur after the insignifi‐ cant coefficient truncation make run length encoding ideal for their encoding.


**Listing 1.** Steps of the encoding algorithm. Highlighted steps refer to calculations performed on the GPU, while the other steps refer to calculations performed on the CPU.

The algorithm divides the video frames into two categories; keyframes and internal frames. Keyframes are frames that are encoded using the steps described in the previous paragraph and internal frames are the frames between two key frames. The interval between two key‐ frames is a user defined parameter. At the step before the run-length encoding, when a frame is identified as an internal frame all its components are calculated as the difference between the respective components of the frame and those of the previous key frame. This step is processed on the GPU while all the remaining steps of the algorithm are performed on the CPU unless otherwise stated. Then, run length encoding is applied to the chromatic channels, the low frequency contourlet component of the luminance channel, as well as the directional subbands of the luminance channel. It must be noted that steps that are executed on the CPU are inherently serial and cannot be efficiently mapped to a GPU.

The last stage of the algorithm consists of the selection of the optimal precision for each vid‐ eo component. The user can select between lossless or lossy change of precision, directly af‐ fecting the output's visual quality.


**Listing 2.** Steps of the decoding algorithm. Highlighted steps refer to calculations performed on the GPU, while the other steps refer to calculations performed on the CPU.

#### **3.1. Chrominance channel subsampling**

loss of quality cannot be perceived by the human visual system, resulting to no visible altera‐ tion of the image. Nevertheless, it indicates the highest quality possible when used for image

Listings 1 and 2 depict the presented algorithm for encoding and decoding respectively. Input frames are considered to be in the RGB format. The first step of the algorithm is the conversion from RGB to YCoCg colour space for further manipulation of the luminance and chrominance channels. The luminance channel is decomposed using the contourlet transform, while chromi‐ nance channels are subsampled by a user-defined factor *N*. The levels and filters for contourlet transform decomposition are also defined by the user. From the contourlet coefficients ob‐ tained by decomposing the luminance channel, only a user-specified percentage of the most significant ones are retained. Then, the precision allocated for storing the contourlet coefficients is reduced. All computations up to this stage are performed on the GPU, avoiding unnecessary memory transfers from the main memory to the GPU memory and vice versa. After reducing the precision of the retained contourlet coefficients of the luminance channel, the directional subbands are encoded using a run length encoding scheme that encodes only zero valued ele‐ ments. The large sequences of zero-valued contourlet coefficients that occur after the insignifi‐

**The encoding algorithm**

**Listing 1.** Steps of the encoding algorithm. Highlighted steps refer to calculations performed on the GPU, while the

cant coefficient truncation make run length encoding ideal for their encoding.

**9: Calculate the frame as the difference between the frame and the previous keyframe**

**10: Run-length encoding of Co, Cg and the lowpass CT component of Y**

compression.

1: Start

11: END IF

15: GOTO Start 16: END IF 17: Finish

2: Input RGB frame **3: Convert to YCoCg**

**4: Downsample Co and Cg by** *N*

8: IF the frame is an internal frame

13: Adjust precision of all components 14: IF frame is NOT the last frame

**5: Decompose Y with the Contourlet Transform 6: Keep the M% most significant CT coefficients 7: Round the CT coefficients to the** *n-***th decimal**

12: Run-length encoding of the directional subbands of Y

other steps refer to calculations performed on the CPU.

**3. The presented algorithm**

160 Design and Architectures for Digital Signal Processing

Exploiting the fact that the human visual system is relatively insensitive to chrominance variations, in order to achieve compression, the chrominance channels Co and Cg are subsampled by a user-defined factor *N* that directly affects the output's visual quality and the compression achieved. The chrominance channels are stored in lower resolution, thus providing compression. For the reconstruction of the chrominance channels at the decoding stage, the missing chrominance values are replaced with the nearest available subsampled chrominance values. This approach is simple and naïve but has been select‐ ed due to the significantly smaller number of (costly) memory fetches and minimal com‐ putation cost, compared to other methods like bilinear interpolation. Utilizing the nearest neighbour reconstruction approach can introduce artifacts in the form of mosaic patterns in regions with strong chrominance transitions depending on the subsampling factor. In order to address this problem, given adequate computational resources, the re‐ ceiver can choose to use the bilinear interpolation approach. Figure 3 shows an example of subsampling the Co and Cg chrominance channels by various factors, while using the nearest neighbour and the bilinear interpolation approach for reconstruction. Only a small, magnified part of the "baboon" image used is shown for clarity. As demonstrated in Figure 3, subsampling by a factor of 2 or 4 does not have a drastic effect on visual quality. Further subsampling leads to visible artifacts indicating the need for an optimal trade-off between quality and compression.

tional subbands are dropped by means of keeping only the most significant coefficients. The amount of coefficients dropped drastically affects the output's visual quality as well as the compression ratio. Contourlet coefficients with large magnitudes are considered more sig‐ nificant than coefficients with smaller magnitudes. Exploiting this fact, a common method for selecting the most significant contourlet coefficients is to keep the *M* most significant co‐ efficients, or respective percentage, while dropping all the others [2] (coefficient truncation). This procedure leads to a large number of zero-valued sequences inside the elements of the directional subbands, a fact exploited by using run length encoding in order to achieve even higher compression. Considering the values and the distribution of contourlet coefficients at the directional subbands, only the zero-valued coefficients are run length encoded along the horizontal direction. Compression gained by run length encoding of all the different values is minimum and does not justify the increased computational cost. It is worth mentioning that dropping all the contourlet coefficients is similar to lowering the luminance channel's resolution while applying a lowpass filter and then, at the decoding stage, upscaling it with‐

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

163

**Figure 4.** Example of smoothing due to the dropping of contourlet coefficients. The caption indicates the percentage

of the contourlet coefficients retained. Images are cropped and scaled to 200% of their original size.

out reincorporating the high frequency content.

**Figure 3.** Example of chroma subsampling by factor *N* of the Co and Cg channels of the "baboon" image. Row (a) depicts images reconstructed using the nearest neighbour method, while (b) those reconstructed using bilinear interpolation.

#### **3.2. Contourlet Transform decomposition of luminance channel and quality selection**

The luminance channel of the frame is decomposed using the contourlet transform. Decompo‐ sition levels, as well as the filters used are user-defined and directly affect the quality of the out‐ put. Decomposition at multiple scales offers better compression while providing scalability, i.e. multiple resolutions inside the same video stream. This characteristic allows video coding algo‐ rithms to adapt to the network's end-to-end bandwidth and transmitter/receiver resources. The quality for each receiver can be adjusted without re-encoding the video frames at the source, by just dropping the encoded information referring to higher resolution than needed.

In order to achieve compression, after the decomposition of the luminance channel with the contourlet transform, a user-defined amount of the contourlet coefficients from the direc‐ tional subbands are dropped by means of keeping only the most significant coefficients. The amount of coefficients dropped drastically affects the output's visual quality as well as the compression ratio. Contourlet coefficients with large magnitudes are considered more sig‐ nificant than coefficients with smaller magnitudes. Exploiting this fact, a common method for selecting the most significant contourlet coefficients is to keep the *M* most significant co‐ efficients, or respective percentage, while dropping all the others [2] (coefficient truncation). This procedure leads to a large number of zero-valued sequences inside the elements of the directional subbands, a fact exploited by using run length encoding in order to achieve even higher compression. Considering the values and the distribution of contourlet coefficients at the directional subbands, only the zero-valued coefficients are run length encoded along the horizontal direction. Compression gained by run length encoding of all the different values is minimum and does not justify the increased computational cost. It is worth mentioning that dropping all the contourlet coefficients is similar to lowering the luminance channel's resolution while applying a lowpass filter and then, at the decoding stage, upscaling it with‐ out reincorporating the high frequency content.

thus providing compression. For the reconstruction of the chrominance channels at the decoding stage, the missing chrominance values are replaced with the nearest available subsampled chrominance values. This approach is simple and naïve but has been select‐ ed due to the significantly smaller number of (costly) memory fetches and minimal com‐ putation cost, compared to other methods like bilinear interpolation. Utilizing the nearest neighbour reconstruction approach can introduce artifacts in the form of mosaic patterns in regions with strong chrominance transitions depending on the subsampling factor. In order to address this problem, given adequate computational resources, the re‐ ceiver can choose to use the bilinear interpolation approach. Figure 3 shows an example of subsampling the Co and Cg chrominance channels by various factors, while using the nearest neighbour and the bilinear interpolation approach for reconstruction. Only a small, magnified part of the "baboon" image used is shown for clarity. As demonstrated in Figure 3, subsampling by a factor of 2 or 4 does not have a drastic effect on visual quality. Further subsampling leads to visible artifacts indicating the need for an optimal

**Figure 3.** Example of chroma subsampling by factor *N* of the Co and Cg channels of the "baboon" image. Row (a) depicts images reconstructed using the nearest neighbour method, while (b) those reconstructed using bilinear interpolation.

The luminance channel of the frame is decomposed using the contourlet transform. Decompo‐ sition levels, as well as the filters used are user-defined and directly affect the quality of the out‐ put. Decomposition at multiple scales offers better compression while providing scalability, i.e. multiple resolutions inside the same video stream. This characteristic allows video coding algo‐ rithms to adapt to the network's end-to-end bandwidth and transmitter/receiver resources. The quality for each receiver can be adjusted without re-encoding the video frames at the source, by

In order to achieve compression, after the decomposition of the luminance channel with the contourlet transform, a user-defined amount of the contourlet coefficients from the direc‐

**3.2. Contourlet Transform decomposition of luminance channel and quality selection**

just dropping the encoded information referring to higher resolution than needed.

trade-off between quality and compression.

162 Design and Architectures for Digital Signal Processing

**Figure 4.** Example of smoothing due to the dropping of contourlet coefficients. The caption indicates the percentage of the contourlet coefficients retained. Images are cropped and scaled to 200% of their original size.

Keeping only the most significant contourlet coefficients also provides a means to suppress the noise induced by low-quality sensors usually encountered in web-cameras. Random noise is largely unstructured and therefore not likely to generate significant contourlet coef‐ ficients [2]. As a result, keeping only the most significant contourlet coefficients provides en‐ hanced visual quality, which is a highly desirable characteristic since no additional filtering of the video stream is required in order to reduce the noise level. On Figure 4, an example of smoothing due to the dropping of contourlet coefficients is shown. Mosaicing artifacts and noise introduced due to the low quality of the web camera's sensor are suppressed and re‐ placed by a fuzzier texture, resulting in a smoother and more perceptually acceptable image.

the compression achieved for the internal frames until the next keyframe is similar to that of a keyframe. If this scenario occurs, having small intervals between consecutive keyframes re‐ duces the number of non optimally encoded frames. Nevertheless, in cases where the video is expected to be mostly static, like surveillance video for example, a larger interval between key‐

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

165

Except for the YCoCg colour space, the algorithm supports the YCbCr and Greyscale colour spaces without the need to alter its core functionality. The process of encoding greyscale videos consists of handling the video as a colour video with only the luminance channel. All the steps of the algorithm are calculated except for those referring to the chromatic channels. On the oth‐ er hand, due to the similarity of the YCbCr colour space with the YCoCg colour space the algo‐ rithm remains the same. The only difference is the RGB-to-YCbCr conversion at the encoder and the YCbCr-to-RGB conversion at the decoder. The luminance channel is identically han‐ dled as in the YCoCg-based algorithm, and the same holds for the replacement of CoCg chan‐ nels with the CbCr ones. Nevertheless, the CbCr channels have a different range of values compared to CoCg channels. As a consequence, the optimal precision for the CbCr channels dif‐

For evaluating the presented algorithm, two videos were captured using a VGA web camera that supported a maximum resolution of 640x480 pixels. Low resolution web cameras are very common on everyday personal computer systems showcasing the need to design video encoding algorithms that take into consideration the problems arising due to low-quality sensors. The videos captured were a typical video-conference sequence with static back‐ ground showing the upper part of the human body and containing some motion, and a sur‐

The captured videos were encoded using the YCoCg, YCbCr and Greyscale colour spaces. The chrominance channels of the colour videos were subsampled by a factor of 4 and the video stream contained two resolutions: the original VGA (640x480) as well as the lower QVGA (320x240). The method utilized for the reconstruction of the chrominance channels was the nearest neighbour method. The percentage of the most significant contourlet coeffi‐ cients of the luminance channel retained was adjusted for each encoded video, providing re‐ sults of various quality and compression levels. Furthermore, at each scale, the luminance channel's high frequency content was decomposed into four directional subbands. In order to test the algorithm using the YCbCr colour space, the RGB to YCbCr conversion formula for SDTV found in [18] was utilised. For the Greyscale colour space, the aforementioned vid‐ eos were converted from RGB to greyscale using the standard NTSC conversion formula

fers from that of the CoCg channels and has to be taken into consideration.

veillance video with almost no motion depicting the entrance of a building.

[18] that is used for calculating the effective luminance of a pixel:

frames will provide considerably better compression.

**4. Quality and performance analysis**

**3.4. Other supported colour spaces**

At the contourlet transform decomposition stage, 32 bit single precision floating point ele‐ ments are used in order to avoid rounding errors and precision loss. Experiments with the precision allocated for the contourlet coefficients showed that the contourlet transform ex‐ hibits resistance to quality loss due to arithmetic precision reduction. This fact is exploited in order to achieve better compression by reducing the precision of the contourlet coefficients through rounding to a specific decimal point. Visual quality is not affected at all when only one decimal or more are kept. Rounding to the integer provides a PSNR of more than 60 dB when only the directional subbands' coefficients are rounded. Additionally, also rounding the low pass content provides a PSNR of more than 55 dB. In both cases, the loss of quality cannot be perceived by the human visual system and is considered as insignificant. For these experiments, the images were first transformed into the YCoCg colour space. Then the luminance channel was decomposed using the contourlet transform and the contourlet coef‐ ficients were rounded. No alteration was done to the chrominance channels. After the ma‐ nipulation of the contourlet coefficients, the luminance channel was reconstructed and the image was transformed back into the RGB colour space.

#### **3.3. Frame types**

As mentioned before, frames are divided into keyframes and internal frames, with an internal frame being the difference between the current frame and the respective keyframe. Consecu‐ tive frames tend to have small variations, with many identical regions. This fact can be exploit‐ ed by calculating the difference between a frame and the keyframe. This procedure provides components with large sequences of zero values leading to improved compression through the run length encoding stage. Especially in the case of video-conferencing or surveillance video, the background tends to be static, with slight or no variations at all. The occurrence of static background leads to many parts of the consecutive frames to be identical. As a result, calculat‐ ing the difference of each frame from its respective keyframe provides large sequences of zero values leading to improved compression when run length encoding is applied. Run length en‐ coding of the difference of contourlet-transformed images is even more efficient, since static noise is drastically suppressed by the coefficient truncation. Experiments showed that the opti‐ mal compression is achieved for a relatively small interval between keyframes, in the region of 5-7 internal frames, providing small groups of pictures (GOP) that depend to a keyframe. This characteristic makes the algorithm more resistant to packet loses when transmitting over a net‐ work. In the case of a scene change, consecutive frames drastically differ from each other and the compression achieved for the internal frames until the next keyframe is similar to that of a keyframe. If this scenario occurs, having small intervals between consecutive keyframes re‐ duces the number of non optimally encoded frames. Nevertheless, in cases where the video is expected to be mostly static, like surveillance video for example, a larger interval between key‐ frames will provide considerably better compression.

#### **3.4. Other supported colour spaces**

Keeping only the most significant contourlet coefficients also provides a means to suppress the noise induced by low-quality sensors usually encountered in web-cameras. Random noise is largely unstructured and therefore not likely to generate significant contourlet coef‐ ficients [2]. As a result, keeping only the most significant contourlet coefficients provides en‐ hanced visual quality, which is a highly desirable characteristic since no additional filtering of the video stream is required in order to reduce the noise level. On Figure 4, an example of smoothing due to the dropping of contourlet coefficients is shown. Mosaicing artifacts and noise introduced due to the low quality of the web camera's sensor are suppressed and re‐ placed by a fuzzier texture, resulting in a smoother and more perceptually acceptable image.

At the contourlet transform decomposition stage, 32 bit single precision floating point ele‐ ments are used in order to avoid rounding errors and precision loss. Experiments with the precision allocated for the contourlet coefficients showed that the contourlet transform ex‐ hibits resistance to quality loss due to arithmetic precision reduction. This fact is exploited in order to achieve better compression by reducing the precision of the contourlet coefficients through rounding to a specific decimal point. Visual quality is not affected at all when only one decimal or more are kept. Rounding to the integer provides a PSNR of more than 60 dB when only the directional subbands' coefficients are rounded. Additionally, also rounding the low pass content provides a PSNR of more than 55 dB. In both cases, the loss of quality cannot be perceived by the human visual system and is considered as insignificant. For these experiments, the images were first transformed into the YCoCg colour space. Then the luminance channel was decomposed using the contourlet transform and the contourlet coef‐ ficients were rounded. No alteration was done to the chrominance channels. After the ma‐ nipulation of the contourlet coefficients, the luminance channel was reconstructed and the

As mentioned before, frames are divided into keyframes and internal frames, with an internal frame being the difference between the current frame and the respective keyframe. Consecu‐ tive frames tend to have small variations, with many identical regions. This fact can be exploit‐ ed by calculating the difference between a frame and the keyframe. This procedure provides components with large sequences of zero values leading to improved compression through the run length encoding stage. Especially in the case of video-conferencing or surveillance video, the background tends to be static, with slight or no variations at all. The occurrence of static background leads to many parts of the consecutive frames to be identical. As a result, calculat‐ ing the difference of each frame from its respective keyframe provides large sequences of zero values leading to improved compression when run length encoding is applied. Run length en‐ coding of the difference of contourlet-transformed images is even more efficient, since static noise is drastically suppressed by the coefficient truncation. Experiments showed that the opti‐ mal compression is achieved for a relatively small interval between keyframes, in the region of 5-7 internal frames, providing small groups of pictures (GOP) that depend to a keyframe. This characteristic makes the algorithm more resistant to packet loses when transmitting over a net‐ work. In the case of a scene change, consecutive frames drastically differ from each other and

image was transformed back into the RGB colour space.

164 Design and Architectures for Digital Signal Processing

**3.3. Frame types**

Except for the YCoCg colour space, the algorithm supports the YCbCr and Greyscale colour spaces without the need to alter its core functionality. The process of encoding greyscale videos consists of handling the video as a colour video with only the luminance channel. All the steps of the algorithm are calculated except for those referring to the chromatic channels. On the oth‐ er hand, due to the similarity of the YCbCr colour space with the YCoCg colour space the algo‐ rithm remains the same. The only difference is the RGB-to-YCbCr conversion at the encoder and the YCbCr-to-RGB conversion at the decoder. The luminance channel is identically han‐ dled as in the YCoCg-based algorithm, and the same holds for the replacement of CoCg chan‐ nels with the CbCr ones. Nevertheless, the CbCr channels have a different range of values compared to CoCg channels. As a consequence, the optimal precision for the CbCr channels dif‐ fers from that of the CoCg channels and has to be taken into consideration.

#### **4. Quality and performance analysis**

For evaluating the presented algorithm, two videos were captured using a VGA web camera that supported a maximum resolution of 640x480 pixels. Low resolution web cameras are very common on everyday personal computer systems showcasing the need to design video encoding algorithms that take into consideration the problems arising due to low-quality sensors. The videos captured were a typical video-conference sequence with static back‐ ground showing the upper part of the human body and containing some motion, and a sur‐ veillance video with almost no motion depicting the entrance of a building.

The captured videos were encoded using the YCoCg, YCbCr and Greyscale colour spaces. The chrominance channels of the colour videos were subsampled by a factor of 4 and the video stream contained two resolutions: the original VGA (640x480) as well as the lower QVGA (320x240). The method utilized for the reconstruction of the chrominance channels was the nearest neighbour method. The percentage of the most significant contourlet coeffi‐ cients of the luminance channel retained was adjusted for each encoded video, providing re‐ sults of various quality and compression levels. Furthermore, at each scale, the luminance channel's high frequency content was decomposed into four directional subbands. In order to test the algorithm using the YCbCr colour space, the RGB to YCbCr conversion formula for SDTV found in [18] was utilised. For the Greyscale colour space, the aforementioned vid‐ eos were converted from RGB to greyscale using the standard NTSC conversion formula [18] that is used for calculating the effective luminance of a pixel:

$$Y(i, \ j) = 0.2989 \cdot R(i, \ j) + 0.5870 \cdot G(i, \ j) + 0.1140 \cdot B(i, \ j) \tag{7}$$

**Video conference sample**

Only keyframes

 4.96:1 11.06:1 4.93:1 12.05:1 2.09:1 3.49:1 6.44:1 14.39:1 6.44:1 16.31:1 2.94:1 4.44:1 7.36:1 16.39:1 7.37:1 18.98:1 3.56:1 5.02:1 8.71:1 19.46:1 8.70:1 23.15:1 4.57:1 5.85:1 0.5 9.07:1 20.24:1 9.06:1 24.33:1 4.87:1 6.07:1 0.2 9.22:1 20.62:1 9.21:1 24.81:1 5.00:1 6.17:1 11.71:1 25.84:1 11.71:1 32.89:1 7.65:1 7.53:1

**Table 3.** Compression ratios achieved for the video conference sample, retaining various percentages of contourlet

**Video surveillance sample**

Only keyframes

 4.35:1 21.55:1 4.35:1 22.73:1 1.78:1 7.26:1 5.89:1 28.74:1 5.92:1 31.06:1 2.63:1 9.68:1 6.85:1 32.89:1 6.89:1 35.97:1 3.23:1 11.09:1 8.14:1 38.02:1 8.18:1 42.19:1 4.15:1 12.79:1 0.5 8.58:1 39.53:1 8.61:1 44.05:1 4.48:1 13.30:1 0.2 8.89:1 40.65:1 8.92:1 45.45:1 4.74:1 13.70:1 11.71:1 49.26:1 11.71:1 56.50:1 7.65:1 16.58:1

**Table 4.** Compression ratios achieved for the video surveillance sample, retaining various percentages of contourlet

Compression ratio YCoCg YCbCr Greyscale

> Keyframes & Internal frames

Only keyframes

Keyframes & Internal frames

Compression ratio YCoCg YCbCr Greyscale

> Keyframes & Internal frames

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

Only keyframes

Keyframes & Internal frames 167

http://dx.doi.org/10.5772/51735

Contourlet coefficients retained (%)

Contourlet coefficients retained (%)

Only keyframes

coefficients and utilizing the YCoCg. YCbCr and Greyscale colour spaces.

coefficients and utilizing the YCoCg. YCbCr and Greyscale colour spaces.

Keyframes & Internal frames

Only keyframes

Keyframes & Internal frames

The sample videos were encoded using a variety of parameters. The mean PSNR value for each video was calculated based on a set of different percentages of contourlet coefficients to be retained. The compression ratios achieved when using the scheme that incorporates both key frames and internal frames and when compressing all the frames as keyframes were al‐ so calculated. The interval between the key frames was set to five frames for the video-con‐ ference sample video and to twenty frames for the surveillance video. Detailed results are shown on Tables 2, 3 and 4 while sample frames of the encoded videos utilizing the YCoCg, YCbCr and Greyscale colour space for a set of settings are shown on Figures 5-10.

Examining the compression ratios achieved, it is shown that utilizing the keyframe and in‐ ternal frame scheme outperforms the naive method of encoding all the frames the same way, as expected. However, the selection of an efficient entropy encoding algorithm that will further enhance the compression ability of our algorithm is still an open issue. Another interesting observation is that the contourlet transform exhibits substantial resistance to the loss of contourlet coefficients. Even when only 5% of its original coefficients are retained, the visual quality of the image is not seriously affected. This fact underlines the efficiency of the contourlet transform in approximating natural images using a small number of descriptors and justifies its utilization in this algorithm. The slightly lower PSNR achieved for the sur‐ veillance video sample can be explained due to the higher complexity of the scene compared to the video conference sample. More complex scenes contain higher frequency content, a portion of which is then discarded by dropping the contourlet coefficients.


**Table 2.** PSNRs achieved for the (a) video conference and (b) video surveillance samples, retaining various percentages of contourlet coefficients and utilizing the YCoCg. YCbCr and Greyscale colour spaces.


*Y* (*i*, *j*)=0.2989 · *R*(*i*, *j*) + 0.5870 · *G*(*i*, *j*) + 0.1140 · *B*(*i*, *j*) (7)

The sample videos were encoded using a variety of parameters. The mean PSNR value for each video was calculated based on a set of different percentages of contourlet coefficients to be retained. The compression ratios achieved when using the scheme that incorporates both key frames and internal frames and when compressing all the frames as keyframes were al‐ so calculated. The interval between the key frames was set to five frames for the video-con‐ ference sample video and to twenty frames for the surveillance video. Detailed results are shown on Tables 2, 3 and 4 while sample frames of the encoded videos utilizing the YCoCg,

Examining the compression ratios achieved, it is shown that utilizing the keyframe and in‐ ternal frame scheme outperforms the naive method of encoding all the frames the same way, as expected. However, the selection of an efficient entropy encoding algorithm that will further enhance the compression ability of our algorithm is still an open issue. Another interesting observation is that the contourlet transform exhibits substantial resistance to the loss of contourlet coefficients. Even when only 5% of its original coefficients are retained, the visual quality of the image is not seriously affected. This fact underlines the efficiency of the contourlet transform in approximating natural images using a small number of descriptors and justifies its utilization in this algorithm. The slightly lower PSNR achieved for the sur‐ veillance video sample can be explained due to the higher complexity of the scene compared to the video conference sample. More complex scenes contain higher frequency content, a

**(a) Video conference sample (b) Video surveillance sample**

 45.11 44.77 52.04 10 44.18 44.03 50.11 44.53 44.29 49.71 5 43.54 43.45 47.88 43.88 43.70 47.72 3 42.96 42.89 46.33 42.30 42.23 44.28 1 41.57 41.50 43.45 0.5 41.62 41.56 43.10 0.5 40.80 40.76 42.17 0.2 41.30 41.25 42.60 0.2 40.17 40.14 41.21 39.15 39.13 39.82 0 39.59 39.55 40.39

**Table 2.** PSNRs achieved for the (a) video conference and (b) video surveillance samples, retaining various percentages

PSNR (dB) PSNR (dB)

Contourlet coefficients retained (%)

YCoCg YCbCr Greyscale

YCbCr and Greyscale colour space for a set of settings are shown on Figures 5-10.

portion of which is then discarded by dropping the contourlet coefficients.

YCoCg YCbCr Greyscale

of contourlet coefficients and utilizing the YCoCg. YCbCr and Greyscale colour spaces.

Contourlet coefficients retained (%)

166 Design and Architectures for Digital Signal Processing

**Table 3.** Compression ratios achieved for the video conference sample, retaining various percentages of contourlet coefficients and utilizing the YCoCg. YCbCr and Greyscale colour spaces.


**Table 4.** Compression ratios achieved for the video surveillance sample, retaining various percentages of contourlet coefficients and utilizing the YCoCg. YCbCr and Greyscale colour spaces.

**Figure 6.** Sample frame of the encoded video-conference video for each setting using the *YCbCr* colour space. The

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

169

frame has been resized and cropped to fit the figure.

**Figure 5.** Sample frame of the encoded video-conference video for each setting using the *YCoCg* colour space. The frame has been resized and cropped to fit the figure.

**Figure 5.** Sample frame of the encoded video-conference video for each setting using the *YCoCg* colour space. The

frame has been resized and cropped to fit the figure.

168 Design and Architectures for Digital Signal Processing

**Figure 6.** Sample frame of the encoded video-conference video for each setting using the *YCbCr* colour space. The frame has been resized and cropped to fit the figure.

**Figure 8.** Sample frame of the encoded video surveillance video for each setting using the *YCoCg* colour space. The

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

171

frame has been resized and cropped to fit the figure.

**Figure 7.** Sample frame of the encoded video-conference video for each setting using the *Greyscale* colour space. The frame has been resized and cropped to fit the figure.

**Figure 8.** Sample frame of the encoded video surveillance video for each setting using the *YCoCg* colour space. The frame has been resized and cropped to fit the figure.

**Figure 7.** Sample frame of the encoded video-conference video for each setting using the *Greyscale* colour space. The

frame has been resized and cropped to fit the figure.

170 Design and Architectures for Digital Signal Processing

**Figure 10.** Sample frame of the encoded video surveillance video for each setting using the *Greyscale* colour space.

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

173

The frame has been resized and cropped to fit the figure.

**Figure 9.** Sample frame of the encoded video surveillance video for each setting using the *YCbCr* colour space. The frame has been resized and cropped to fit the figure.

**Figure 10.** Sample frame of the encoded video surveillance video for each setting using the *Greyscale* colour space. The frame has been resized and cropped to fit the figure.

**Figure 9.** Sample frame of the encoded video surveillance video for each setting using the *YCbCr* colour space. The

frame has been resized and cropped to fit the figure.

172 Design and Architectures for Digital Signal Processing

Considering the YCoCg and the YCbCr colour spaces, for the two video samples tested, it is shown on Tables 2-4 and Figures 11 and 12 that the YCoCg colour space achieves slightly better visual quality (higher PSNR), while the YCbCr colour space provides better compres‐ sion (higher compression ratio). The Greyscale examples cannot be directly compared to the colour samples since the calculated PSNR characterizes the original and encoded greyscale samples. Nevertheless, it is clear that in the case of Greyscale colour space, compression suf‐ fers greatly compared to the other colour spaces.

Average execution times for the basic operations of the encoding and decoding algorithm for a frame of the video conference sample are presented on Table 5. Parameters were kept the same as in the previous examples and the computer utilised for the performance tests was equipped with an Intel Core i3 CPU, 4 GB of memory and a NVIDIA GeForce 430

> Transfer of RGB frame to GPU memory 1.385 Transfer of encoded frame to main memory 1.050 Conversion from RGB to YCoCg 1.067 Conversion from YCoCg to RGB 0.402 Contourlet transform decomposition 59.040 Contourlet transform reconstruction 57.102 Run-length encoding of directional subbands 2.424 Run-length decoding of directional subbands 7.008 Contourlet coefficients dropping 0.492

**Table 5.** Average execution times (in milliseconds) for the basic operations of the algorithm for a 640x480 video frame. The chrominance channels were subsampled by a factor of 4 and the video stream contained the original VGA

In this chapter, a low complexity algorithm for real-time video encoding based on the con‐ tourlet transform and optimized for video conferencing applications and surveillance cam‐ eras has been presented and evaluated. The algorithm provides a scalable video compression scheme ideal for video conferencing content as it achieves high quality encod‐ ing and increased compression efficiency for static regions of the image, while maintaining low complexity and adaptability to the receivers resources. A video stream can contain vari‐ ous resolutions avoiding the need for reencoding at the source. The receiver can select the desired quality by dropping the components referring to higher quality than needed. Fur‐ thermore, the algorithm has the inherent ability to suppress the noise induced by low-quali‐ ty sensors, without the need of an extra denoising or image enhancement stage, due to the manipulation of the structural characteristics of the video through the rejection of insignifi‐ cant contourlet transform coefficients. In the case of long recordings for surveillance sys‐ tems, where higher compression is needed, the visual quality degradation is much more eye-friendly than with other well established video compression methods, as it introduces fuzziness and blurring instead of artificial block artifacts, providing smoother images and facilitating image rectification/recognition procedures. Additionally, due to the relatively small GOPs, the algorithm is more resistant to frame losses that can occur during transmis‐

**Operation Time (ms)**

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

175

graphics card with 1 GB of memory.

(640x480) as well as the lower QVGA (320x240) resolution.

**5. Conclusions**

#### **Compression ratio X:1 for video conference sample**

**Figure 11.** Compression ratio vs percentage of contourlet coefficients retained diagram, for the video conference sample, utilizing the YCoCg. YCbCr and Greyscale colour spaces. K refers to using only keyframes, while K&I refers to using both keyframes and internal frames.

#### **Compression ratio X:1 for video surveillance sample**

**Figure 12.** Compression ratio vs percentage of contourlet coefficients retained diagram, for the video surveillance sample, utilizing the YCoCg. YCbCr and Greyscale colour spaces. K refers to using only keyframes, while K&I refers to using both keyframes and internal frames.

Average execution times for the basic operations of the encoding and decoding algorithm for a frame of the video conference sample are presented on Table 5. Parameters were kept the same as in the previous examples and the computer utilised for the performance tests was equipped with an Intel Core i3 CPU, 4 GB of memory and a NVIDIA GeForce 430 graphics card with 1 GB of memory.


**Table 5.** Average execution times (in milliseconds) for the basic operations of the algorithm for a 640x480 video frame. The chrominance channels were subsampled by a factor of 4 and the video stream contained the original VGA (640x480) as well as the lower QVGA (320x240) resolution.

#### **5. Conclusions**

Considering the YCoCg and the YCbCr colour spaces, for the two video samples tested, it is shown on Tables 2-4 and Figures 11 and 12 that the YCoCg colour space achieves slightly better visual quality (higher PSNR), while the YCbCr colour space provides better compres‐ sion (higher compression ratio). The Greyscale examples cannot be directly compared to the colour samples since the calculated PSNR characterizes the original and encoded greyscale samples. Nevertheless, it is clear that in the case of Greyscale colour space, compression suf‐

**Compression ratio X:1 for video conference sample**

YCoCg K YCoCg K&I YCbCr K YCbCr K&I Greyscale K Greyscale K&I

YCoCg K YCoCg K&I YCbCr K YCbCr K&I Greyscale K Greyscale K&I

10 5 3 1 0,5 0,2 0 **CT coefficients retained (%)**

**Figure 11.** Compression ratio vs percentage of contourlet coefficients retained diagram, for the video conference sample, utilizing the YCoCg. YCbCr and Greyscale colour spaces. K refers to using only keyframes, while K&I refers to

**Compression ratio X:1 for video surveillance sample**

10 5 3 1 0,5 0,2 0 **CT coefficients retained (%)**

**Figure 12.** Compression ratio vs percentage of contourlet coefficients retained diagram, for the video surveillance sample, utilizing the YCoCg. YCbCr and Greyscale colour spaces. K refers to using only keyframes, while K&I refers to

fers greatly compared to the other colour spaces.

174 Design and Architectures for Digital Signal Processing

0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 40,00

0,00 10,00 20,00 30,00 40,00 50,00 60,00

**Compression ratio (X:1)**

using both keyframes and internal frames.

**Compression ratio (X:1)**

using both keyframes and internal frames.

In this chapter, a low complexity algorithm for real-time video encoding based on the con‐ tourlet transform and optimized for video conferencing applications and surveillance cam‐ eras has been presented and evaluated. The algorithm provides a scalable video compression scheme ideal for video conferencing content as it achieves high quality encod‐ ing and increased compression efficiency for static regions of the image, while maintaining low complexity and adaptability to the receivers resources. A video stream can contain vari‐ ous resolutions avoiding the need for reencoding at the source. The receiver can select the desired quality by dropping the components referring to higher quality than needed. Fur‐ thermore, the algorithm has the inherent ability to suppress the noise induced by low-quali‐ ty sensors, without the need of an extra denoising or image enhancement stage, due to the manipulation of the structural characteristics of the video through the rejection of insignifi‐ cant contourlet transform coefficients. In the case of long recordings for surveillance sys‐ tems, where higher compression is needed, the visual quality degradation is much more eye-friendly than with other well established video compression methods, as it introduces fuzziness and blurring instead of artificial block artifacts, providing smoother images and facilitating image rectification/recognition procedures. Additionally, due to the relatively small GOPs, the algorithm is more resistant to frame losses that can occur during transmis‐ sion over IP networks. Another advantageous characteristic of the presented algorithm is that its most computationally intensive parts are calculated on the GPU. The utilization of the usually "dormant" GPU computational power lets the CPU to be utilized for other tasks, further enhancing the multitasking capacity of the system and enabling the users to take full advantage of their computational capabilities. The experimental evaluation of the presented algorithm provided promising results. Nevertheless, in order to compete for compression ef‐ ficiency with state of the art video compression algorithms, a highly efficient entropy encod‐ ing scheme has to be incorporated to the algorithm. Modern entropy encoding methods tend to be complex and computationally intensive. As a result, the optimal trade-off be‐ tween compression rates and complexity has to be decided in order to retain the low com‐ plexity and real time characteristics of our algorithm.

[4] Katsigiannis, S., Keramidas, E., & Maroulis, D. (2010). A Contourlet Transform Fea‐ ture Extraction Scheme for Ultrasound Thyroid Texture Classification. *Engineering In‐*

A Real-Time Video Encoding Scheme Based on the Contourlet Transform

http://dx.doi.org/10.5772/51735

177

[5] Liu, Z., & Xu, H. (2010, 9-11 April 2010). Image denoising using Contourlet and twodimensional Principle Component Analysis. Xi'an, China. *Proceedings of 2010 Interna‐ tional Conference on Image Analysis and Signal Processing, IASP*, 309-313, doi: 10.1109/

[6] Burt, P. J., & Adelson, E. H. (1983). The Laplacian Pyramid as a compact image code. *IEEE Transactions on Communications*, 31(4), 532-540, doi: 10.1109/TCOM.

[7] Bamberger, R. H., & Smith, M. J. T. (1992). A filter bank for the directional decompo‐ sition of images: Theory and design. *IEEE Transactions on Signal Processing*, 40(4),

[8] Shapiro, J. M. (1993). Embedded image coding using zerotrees of wavelet coeffi‐ cients. *IEEE Transactions on Signal Processing*, 41(12), 3445-3462, doi:

[9] Vetterli, M. (1984). Multidimensional subband coding: Some theory and algorithms.

[10] Yifan, Z., & Liangzheng, X. (2008). Contourlet-based feature extraction on texture im‐ ages. *Proceedings of the 2008 International Conference on Computer Science and Software*

[11] Cohen, A., Daubechies, I., & Feauveau, J. C. (1992). Biorthogonal bases of compactly supported wavelets. *Communications on Pure and Applied Mathematics*, 45(5), 485-560,

[12] Katsigiannis, S. (2011). Acceleration of the Contourlet Transform. *M.Sc. thesis*, Athens

[13] Mavridis, P., & Papaioannou, G. (2013). The Compact YCoCg Frame Buffer. *GPU Pro*

[14] Malvar, H., & Sullivan, G. (2003). YCoCg-R: A Color Space with RGB Reversibility and Low Dynamic Range. *Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG*,

[15] Van Rijsselbergen, D. (2005). YCoCg(-R) Color space conversion on the GPU. *6th FirW PhD Symposium*, Faculty of Engineering, Ghent University, paper no. 102. [16] *Center for Image Processing Research (CIPR), Rensselaer Polytechnic Institute*, http://

[17] *Object and Concept Recognition for Content-Based Image Retrieval Groundtruth Database*, University of Washington, http://www.cs.washington.edu/research/imagedatabase/

www.cipr.rpi.edu/resource/stills/index.html, accessed 1 July 2012.

*Signal Processing 1984*, 6(2), 97-112, doi: 10.1016/0165-1684(84)90012-4.

*telligent Systems*, 18, 3/4.

IASP.2010.5476106.

882-893, doi: 10.1109/78.258085.

doi: 10.1002/cpa.3160450502.

Document No. JVTI014r3.

groundtruth/, accessed 1 July 2012.

*4*, CRC Press.

University of Economics and Business.

*Engineering, CSSE*, 221-224, doi: 10.1109/CSSE.

1983.1095851.

10.1109/78.258085.

#### **Acknowledgements**

The authors would like to thank Pavlos Mavridis for his fruitful advice concerning the YCoCg colour space.

#### **Author details**

Stamos Katsigiannis1\*, Georgios Papaioannou2 and Dimitris Maroulis1

\*Address all correspondence to: stamos@di.uoa.gr

1 Department of Informatics and Telecommunications, National and Kapodistrian Universi‐ ty of Athens, Athens, Greece

2 Department of Informatics, Athens University of Economics and Business, Athens, Greece

#### **References**


[4] Katsigiannis, S., Keramidas, E., & Maroulis, D. (2010). A Contourlet Transform Fea‐ ture Extraction Scheme for Ultrasound Thyroid Texture Classification. *Engineering In‐ telligent Systems*, 18, 3/4.

sion over IP networks. Another advantageous characteristic of the presented algorithm is that its most computationally intensive parts are calculated on the GPU. The utilization of the usually "dormant" GPU computational power lets the CPU to be utilized for other tasks, further enhancing the multitasking capacity of the system and enabling the users to take full advantage of their computational capabilities. The experimental evaluation of the presented algorithm provided promising results. Nevertheless, in order to compete for compression ef‐ ficiency with state of the art video compression algorithms, a highly efficient entropy encod‐ ing scheme has to be incorporated to the algorithm. Modern entropy encoding methods tend to be complex and computationally intensive. As a result, the optimal trade-off be‐ tween compression rates and complexity has to be decided in order to retain the low com‐

The authors would like to thank Pavlos Mavridis for his fruitful advice concerning the

1 Department of Informatics and Telecommunications, National and Kapodistrian Universi‐

2 Department of Informatics, Athens University of Economics and Business, Athens, Greece

[1] Katsigiannis, S., Papaioannou, G., & Maroulis, D. (2012). A contourlet transform based algorithm for real-time video encoding. *Proceedings of SPIE*, 8437, 843704, doi:

[2] , M. N., & Vetterli, M. (2005). The contourlet transform: an efficient directional multi‐ resolution image representation. *IEEE Transactions on Image Processing*, 14(12),

[3] Liu, Z. (2008). Minimum Distance Texture Classification of SAR Images in Contourlet Domain. *Proceedings of the 2008 International Conference on Computer Science and Soft‐*

and Dimitris Maroulis1

plexity and real time characteristics of our algorithm.

Stamos Katsigiannis1\*, Georgios Papaioannou2

\*Address all correspondence to: stamos@di.uoa.gr

2091-2106, doi: 10.1109/TIP.2005.859376.

*ware Engineering, CSSE*, doi: 10.1109/IASP.2010.5476106.

**Acknowledgements**

176 Design and Architectures for Digital Signal Processing

YCoCg colour space.

**Author details**

**References**

ty of Athens, Athens, Greece

10.1117/12.924327.


[18] International Telecommunication Union. (2011). Studio encoding parameters of digi‐ tal television for standard 4:3 and wide screen 16:9 aspect ratios. *Recommendation BT. 601-7 (03/11)*.

**Chapter 8**

**Provisional chapter**

**Algorithms for Efficient Computation of Convolution**

**Algorithms for Efficient Computation of Convolution**

Convolution is an important mathematical tool in both fields of signal and image processing. It is employed in filtering [1, 2], denoising [3], edge detection [4, 5], correlation [6], compression [7, 8], deconvolution [9, 10], simulation [11, 12], and in many other applications. Although the concept of convolution is not new, the efficient computation of convolution is still an open topic. As the amount of processed data is constantly increasing, there is considerable request for fast manipulation with huge data. Moreover, there is demand for fast algorithms which can exploit computational power of modern parallel architectures.

The basic convolution algorithm evaluates inner product of a flipped kernel and a neighborhood of each individual sample of an input signal. Although the time complexity of the algorithms based on this approach is quadratic, i.e. *O*(*N*2) [13, 14], the practical implementation is very slow. This is true especially for higher-dimensional tasks, where each new dimension worsens the complexity by increasing the degree of polynomial, i.e. *O*(*N*2*k*). Thanks to its simplicity, the naïve algorithms are popular to be implemented on parallel architectures [15–17], yet the use of implementations is generally limited to small kernel sizes. Under some circumstances, the convolution can be computed faster than as

In the case the higher dimensional convolution kernel is *separable* [18, 19], it can be decomposed into several lower dimensional kernels. In this sense, a 2-D separable kernel can be split into two 1-D kernels, for example. Due to the associativity of convolution, the input signal can be convolved step by step, first with one 1-D kernel, then with the second 1-D kernel. The result equals to the convolution of the input signal with the original 2-D kernel. Gaussian, Difference of Gaussian, and Sobel are the representatives of separable kernels commonly used in signal and image processing. Respecting the time complexity, this approach keeps the higher dimensional convolution to be a polynomial of

> ©2012 Karas and Svoboda, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Pavel and David; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Pavel and David.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Karas Pavel and Svoboda David

Pavel Karas and David Svoboda

http://dx.doi.org/10.5772/51942

mentioned in the text above.

10.5772/51942

**1. Introduction**

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

**Provisional chapter**

### **Algorithms for Efficient Computation of Convolution**

**Algorithms for Efficient Computation of Convolution**

Karas Pavel and Svoboda David Additional information is available at the end of the chapter

Pavel Karas and David Svoboda

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51942

#### **1. Introduction**

10.5772/51942

[18] International Telecommunication Union. (2011). Studio encoding parameters of digi‐ tal television for standard 4:3 and wide screen 16:9 aspect ratios. *Recommendation BT.*

*601-7 (03/11)*.

178 Design and Architectures for Digital Signal Processing

Convolution is an important mathematical tool in both fields of signal and image processing. It is employed in filtering [1, 2], denoising [3], edge detection [4, 5], correlation [6], compression [7, 8], deconvolution [9, 10], simulation [11, 12], and in many other applications. Although the concept of convolution is not new, the efficient computation of convolution is still an open topic. As the amount of processed data is constantly increasing, there is considerable request for fast manipulation with huge data. Moreover, there is demand for fast algorithms which can exploit computational power of modern parallel architectures.

The basic convolution algorithm evaluates inner product of a flipped kernel and a neighborhood of each individual sample of an input signal. Although the time complexity of the algorithms based on this approach is quadratic, i.e. *O*(*N*2) [13, 14], the practical implementation is very slow. This is true especially for higher-dimensional tasks, where each new dimension worsens the complexity by increasing the degree of polynomial, i.e. *O*(*N*2*k*). Thanks to its simplicity, the naïve algorithms are popular to be implemented on parallel architectures [15–17], yet the use of implementations is generally limited to small kernel sizes. Under some circumstances, the convolution can be computed faster than as mentioned in the text above.

In the case the higher dimensional convolution kernel is *separable* [18, 19], it can be decomposed into several lower dimensional kernels. In this sense, a 2-D separable kernel can be split into two 1-D kernels, for example. Due to the associativity of convolution, the input signal can be convolved step by step, first with one 1-D kernel, then with the second 1-D kernel. The result equals to the convolution of the input signal with the original 2-D kernel. Gaussian, Difference of Gaussian, and Sobel are the representatives of separable kernels commonly used in signal and image processing. Respecting the time complexity, this approach keeps the higher dimensional convolution to be a polynomial of

©2012 Karas and Svoboda, licensee InTech. This is an open access chapter distributed under the terms of the

2 Design and Architectures for Digital Signal Processing

lower degree, i.e. *O*(*kNk*+1). On the other hand, there is a nontrivial group of algorithms that use general kernels. For example, the deconvolution or the template matching algorithms based on correlation methods typically use kernels, which cannot be characterized by special properties like separability. In this case, other convolution methods have to be used.

Algorithms for Efficient Computation of Convolution 3

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

*f*(*t* − *τ*)*g*(*τ*)d*τ*. (1)

*f*(*n* − *i*)*g*(*i*). (2)

In the following list you will find the most commonly used symbols in this chapter. We recommend you to go through it first to avoid some misunderstanding during reading the

• F [.], F−<sup>1</sup> [.] . . . Fourier transform and inverse Fourier transform of a signal, respectively

• *F*, *G* . . . Fourier transforms of input signal *f* and convolution kernel *g*, respectively • *N<sup>f</sup>* , *N<sup>g</sup>* . . . length of input signal and convolution kernel, respectively (number of

, *k*′ . . . index of a signal of half length in the spatial and the frequency domain,

−∞

Respecting the fact that Eq. (1) is used mainly in the fields of research different from image and signal procesing we will focus on the alternative definition that the reader is likely to be

The basic (or *naïve*) approach visits the individual time samples *n* in the input signal *f* . In each position, it computes inner product of current sample neighbourhood and

∞ ∑ *i*=−∞

• *n*, *k* . . . index of a signal in the spatial and the frequency domain, respectively

*<sup>h</sup>*(*t*)=(*<sup>f</sup>* <sup>∗</sup> *<sup>g</sup>*)(*t*) = <sup>∞</sup>

*h*(*n*)=(*f* ∗ *g*)(*n*) =

*<sup>k</sup>* ... *k*-th sample of *i*-th Fourier transform base function and inverse Fourier

**1.1. Shortcuts and symbols**

• ∗ . . . symbol for convolution • e . . . Euler number (e ≈ 2.718)

• *j* . . . complex unit (*j*

• *h* . . . convolved signal

samples)

respectively

**2. Naïve approach**

• *n*′

transform base function, respectively

• *P* . . . number of processing units in use • Φ . . . computational complexity function

more familiar with—the dicrete signals:

• ||*s*|| . . . number of samples of a discrete signal (sequence) *s*

First of all, let us recall the basic definition of convolution:

• *z*<sup>∗</sup> . . . complex conjugate of complex number *z*

<sup>2</sup> = −1) • *f* , *g* . . . input signal and convolution kernel, respectively

text.

• *W<sup>i</sup>*

*<sup>k</sup>*, *<sup>W</sup>*−*<sup>i</sup>*

10.5772/51942

181

There also exist algorithms that can perform convolution in time *O*(*N*). In this concept, the repetitive application of convolution kernel is reduced due to the fact that neighbouring positions overlap. Hence, the convolution in each individual sample is obtained as a weighted sum of both input samples and previously computed output samples. The design of so called *recursive filters* [18] allows them to be implemented efficiently on streaming architectures such as FPGA. Mostly, the recursive filters are not designed from scratch. Rather the well-known 1-D filters (Gaussian, Difference of Gaussian, . . . ) are converted into their recursive form. The extenstion to higher dimension is straighforward due to their separability. Also this method has its drawbacks. The conversion of general convolution kernel into its recursive version is a nontrivial task. Moreover, the recursive filtering often suffers from inaccuracy and instability [2].

While the convolution in time domain performs an inner product in each sample, in the Fourier domain [20], it can be computed as a simple point-wise multiplication. Due to this convolution property and the fast Fourier transform the convolution can be performed in time *O*(*N* log *N*). This approach is known as a *fast convolution* [1]. The main advantage of this method stems in the fact that no restrictions are imposed on the kernel. On the other hand, the excessive memory requirements make this approach not very popular. Fortunately, there exists a workaround: If a direct computation of fast convolution of larger signals or images is not realizable using common computers one can reduce the whole problem to several subtasks. In practice, this leads to splitting the signal and kernel into smaller pieces. The signal and kernel decomposition can be perfomed in two ways:


The aim of this chapter is to review the algorithms and approaches for computation of convolution with regards to various properties such as signal and kernel size or kernel separability (when processing *k*-dimensional signals). Target architectures include superscalar and parallel processing units (namely CPU, DSP, and GPU), programmable architectures (e.g. FPGA), and distributed systems (such as grids). The structure of the chapter is designed to cover various applications with respect to the signal size, from small to large scales.

In the first part, the state-of-the-art algorithms will be revised, namely (i) naïve approach, (ii) convolution with separable kernel, (iii) recursive filtering, and (iv) convolution in the frequency domain. In the second part, will be described convolution decomposition in both the spatial and the frequency domain and its implementation on a parallel architecture.

10.5772/51942

#### **1.1. Shortcuts and symbols**

2 Design and Architectures for Digital Signal Processing

180 Design and Architectures for Digital Signal Processing

suffers from inaccuracy and instability [2].

lower degree, i.e. *O*(*kNk*+1). On the other hand, there is a nontrivial group of algorithms that use general kernels. For example, the deconvolution or the template matching algorithms based on correlation methods typically use kernels, which cannot be characterized by special

There also exist algorithms that can perform convolution in time *O*(*N*). In this concept, the repetitive application of convolution kernel is reduced due to the fact that neighbouring positions overlap. Hence, the convolution in each individual sample is obtained as a weighted sum of both input samples and previously computed output samples. The design of so called *recursive filters* [18] allows them to be implemented efficiently on streaming architectures such as FPGA. Mostly, the recursive filters are not designed from scratch. Rather the well-known 1-D filters (Gaussian, Difference of Gaussian, . . . ) are converted into their recursive form. The extenstion to higher dimension is straighforward due to their separability. Also this method has its drawbacks. The conversion of general convolution kernel into its recursive version is a nontrivial task. Moreover, the recursive filtering often

While the convolution in time domain performs an inner product in each sample, in the Fourier domain [20], it can be computed as a simple point-wise multiplication. Due to this convolution property and the fast Fourier transform the convolution can be performed in time *O*(*N* log *N*). This approach is known as a *fast convolution* [1]. The main advantage of this method stems in the fact that no restrictions are imposed on the kernel. On the other hand, the excessive memory requirements make this approach not very popular. Fortunately, there exists a workaround: If a direct computation of fast convolution of larger signals or images is not realizable using common computers one can reduce the whole problem to several subtasks. In practice, this leads to splitting the signal and kernel into smaller pieces.

• Data can be decomposed in Fourier domain using so-called decimation-in-frequency (DIF) algorithm [1, 21]. The division of a signal and a kernel into smaller parts also

• Data can be split in time domain according to overlap-save and overlap-add scheme [22, 23], respectively. Combining these two schemes with fast convolution one can receive a quasi-optimal solution that can be efficiently computed on any computer. Again, the

The aim of this chapter is to review the algorithms and approaches for computation of convolution with regards to various properties such as signal and kernel size or kernel separability (when processing *k*-dimensional signals). Target architectures include superscalar and parallel processing units (namely CPU, DSP, and GPU), programmable architectures (e.g. FPGA), and distributed systems (such as grids). The structure of the chapter is designed to cover various applications with respect to the signal size, from small

In the first part, the state-of-the-art algorithms will be revised, namely (i) naïve approach, (ii) convolution with separable kernel, (iii) recursive filtering, and (iv) convolution in the frequency domain. In the second part, will be described convolution decomposition in both the spatial and the frequency domain and its implementation on a parallel architecture.

The signal and kernel decomposition can be perfomed in two ways:

offers a straightforward way of parallelizing the whole task.

solution naturally leads to a possible parallelization.

to large scales.

properties like separability. In this case, other convolution methods have to be used.

In the following list you will find the most commonly used symbols in this chapter. We recommend you to go through it first to avoid some misunderstanding during reading the text.


#### **2. Naïve approach**

First of all, let us recall the basic definition of convolution:

$$h(t) = (f \ast g)(t) = \int\_{-\infty}^{\infty} f(t - \tau)g(\tau)d\tau. \tag{1}$$

Respecting the fact that Eq. (1) is used mainly in the fields of research different from image and signal procesing we will focus on the alternative definition that the reader is likely to be more familiar with—the dicrete signals:

$$h(n) = (f \* g)(n) = \sum\_{i = -\infty}^{\infty} f(n - i)g(i). \tag{2}$$

The basic (or *naïve*) approach visits the individual time samples *n* in the input signal *f* . In each position, it computes inner product of current sample neighbourhood and 4 Design and Architectures for Digital Signal Processing

flipped kernel *g*, where the size of the neighbourhood is practically equal to the size of the convolution kernel. The result of this inner product is a number which is simply stored into the position *n* in the output signal *h*. It is noteworthy that according to the definition (2), the size of output signal *h* is always equal or greater than the size of the input signal *f* . This fact is related to the boundary conditions. Let *f*(*n*) = 0 for all *n* < 0 ∨ *n* > *N<sup>f</sup>* and also *g*(*n*) = 0 for all *n* < 0 ∨ *n* > *Ng*. Then computing the expression (2) at the position *n* = −1 likely gives non-zero value, i.e. the output signal becomes larger. It can be derived that the size of output signal *h* is equal to *N<sup>f</sup>* + *N<sup>g</sup>* − 1.

Algorithms for Efficient Computation of Convolution 5

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

(c) Blurred image

(a) Phantom image (1024 × 1024 × 100

**2.1. Convolution on a custom hardware**

(b) PSF (128 × 128 × 100

**Figure 1.** Example of a 3-D convolution. The images show an artificial (phantom) image of a tissue, a PSF of an optical microscope, and blurred image, computed by the convolution of the two images. Each 3-D image is represented by three 2-D

Dedicated and configurable hardware, namely digital signal processors (DSP) or Field-programmable gate array (FPGA) units are very popular in the field of signal processing for their promising computational power at both low cost and low power consumption. Although the approach based on the Fourier transform is more popular in digital signal processing for its ability to process enormously long signals, the naïve convolution with a small convolution kernel on various architectures has been also well studied in the literature, especially in the context of the 2-D and multi-dimensional

Shoup [24] proposed techniques for automatic generation of convolution pipelines for small kernels such as of 3×3 pixels. Benedetti et al. [25] proposed a multi-FPGA solution by using an external memory to store a FIFO buffer and partitioning of data among several FPGA units, allowing to increase the size of the convolution kernel. Perri et al. [26] followed the previous work by designing a fully reconfigurable FPGA-based 2-D convolution processor. The core of this processor contains four 16-bit SIMD 3×3 convolvers, allowing real-time computation of convolution of a 8-bit or 16-bit image with a 3×3 or 5×5 convolution kernel. Recently, convolution on a custom specialized hardware, e.g. FPGA, ASIC, and DSP, is used

to detect objects [27], edges [28], and other features in various real-time applications.

From the beginning, graphics processing units (GPU) had been designed for visualisation purposes. Since the beginning of the 21st century, they started to play a role in general computations. This phenomenon is often referred to as general-purpose computing on graphics processing units (GPGPU) [29]. At first, there used to be no high-level programming languages specifically designed for general computation purposes. The programmers instead had to use shading languages such as Cg, High Level Shading Language (HLSL) or OpenGL Shading Language (GLSL) [29–31], to utilize texture units. Recently, two programming

pixels)

pixels)

views (XY, YZ, and XZ).

convolution.

**2.2. GPU-based convolution**

10.5772/51942

183

#### 2.0.0.1. Analysis of time complexity.

For the computation of *f* ∗ *g* we need to perform *N<sup>f</sup> N<sup>g</sup>* multiplications. The computational complexity of this algorithm is polynomial [13], but we must keep in mind what happens when the *N<sup>f</sup>* and *N<sup>g</sup>* become larger and namely what happens when we extend the computation into higher dimensions. In the 3-D case, for example, the expression (2) is slightly changed:

$$\begin{split} h^{\text{3d}}(n\_{\text{x}}, n\_{\text{y}}, n\_{\text{z}}) &= \left( f^{\text{3d}} \ast g^{\text{3d}} \right)(n\_{\text{x}}, n\_{\text{y}}, n\_{\text{z}}) \\ &= \sum\_{i = -\infty}^{\infty} \sum\_{j = -\infty}^{\infty} \sum\_{k = -\infty}^{\infty} f^{\text{3d}}(n\_{\text{x}} - i, n\_{\text{y}} - j, n\_{\text{z}} - k) g^{\text{3d}}(i, j, k) \end{split} \tag{3}$$

Here, *<sup>f</sup>* <sup>3</sup>*d*, *<sup>g</sup>*3*<sup>d</sup>* and *<sup>h</sup>*3*<sup>d</sup>* have the similar meaning as in (2). If we assume || *<sup>f</sup>* <sup>3</sup>*d*|| = *<sup>N</sup><sup>f</sup> <sup>x</sup>* <sup>×</sup>*N<sup>f</sup> <sup>y</sup>* <sup>×</sup>*N<sup>f</sup> z* and ||*g*3*d*|| = *<sup>N</sup><sup>g</sup> <sup>x</sup>* <sup>×</sup>*N<sup>g</sup> <sup>y</sup>* <sup>×</sup>*N<sup>g</sup> <sup>z</sup>* , the complexity of our filtering will raise from *N<sup>f</sup> N<sup>g</sup>* in the 1-D case to *<sup>N</sup><sup>f</sup> <sup>x</sup> <sup>N</sup><sup>f</sup> <sup>y</sup> <sup>N</sup><sup>f</sup> <sup>z</sup> <sup>N</sup><sup>g</sup> <sup>x</sup> <sup>N</sup><sup>g</sup> <sup>y</sup> <sup>N</sup><sup>g</sup> <sup>z</sup>* , which is unusable for larger signals or kernels. Hence, for higher dimensional tasks the use of this approach is becomes impractical, as each dimension increases the degree of this polynomial. Although the time complexity of this algorithm is polynomial the use of this solution is advantageous only if we handle with kernels with a small support. An example of such kernels are well-known filters from signal/image processing:

$$
\begin{bmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{bmatrix} \begin{bmatrix} 1 \ 2 \ 1 \\ 2 \ 4 \ 2 \\ 1 \ 2 \ 1 \end{bmatrix}
$$
 
$$\text{Sobel} \qquad \text{Gaussian}$$

For better insight, let us consider the convolution of two relatively small 3-D signals 1024× 1024×100 voxels and 128×128×100 voxels—the example is shown in Fig. 1. When this convolution was performed in double precision on Intel Xeon QuadCore 2.83 GHz computer it lasted cca for 7 days if the computation was based on the basic approach.

#### 2.0.0.2. Parallelization.

Due to its simplicity and no specific restrictions, the naïve convolution is still the most popular approach. Its computation is usually sped up by employing large computer clusters that significantly decrease the time complexity per one computer. This approach [15–17] assumes the availability of some computer cluster, however.

10.5772/51942

**Figure 1.** Example of a 3-D convolution. The images show an artificial (phantom) image of a tissue, a PSF of an optical microscope, and blurred image, computed by the convolution of the two images. Each 3-D image is represented by three 2-D views (XY, YZ, and XZ).

#### **2.1. Convolution on a custom hardware**

4 Design and Architectures for Digital Signal Processing

182 Design and Architectures for Digital Signal Processing

size of output signal *h* is equal to *N<sup>f</sup>* + *N<sup>g</sup>* − 1.

*h*3*d*(*nx*, *ny*, *nz*) =

*<sup>x</sup>* <sup>×</sup>*N<sup>g</sup>*

*<sup>x</sup> <sup>N</sup><sup>f</sup> <sup>y</sup> <sup>N</sup><sup>f</sup> <sup>z</sup> <sup>N</sup><sup>g</sup> <sup>x</sup> <sup>N</sup><sup>g</sup> <sup>y</sup> <sup>N</sup><sup>g</sup>*

*<sup>y</sup>* <sup>×</sup>*N<sup>g</sup>*

�

<sup>=</sup> <sup>∞</sup> ∑ *i*=−∞

*f* <sup>3</sup>*<sup>d</sup>* ∗ *g*3*<sup>d</sup>*

 

it lasted cca for 7 days if the computation was based on the basic approach.

assumes the availability of some computer cluster, however.

∞ ∑ *j*=−∞

�

Here, *<sup>f</sup>* <sup>3</sup>*d*, *<sup>g</sup>*3*<sup>d</sup>* and *<sup>h</sup>*3*<sup>d</sup>* have the similar meaning as in (2). If we assume || *<sup>f</sup>* <sup>3</sup>*d*|| = *<sup>N</sup><sup>f</sup>*

121 000 −1 −2 −1

∞ ∑ *k*=−∞

2.0.0.1. Analysis of time complexity.

slightly changed:

and ||*g*3*d*|| = *<sup>N</sup><sup>g</sup>*

1-D case to *<sup>N</sup><sup>f</sup>*

processing:

2.0.0.2. Parallelization.

flipped kernel *g*, where the size of the neighbourhood is practically equal to the size of the convolution kernel. The result of this inner product is a number which is simply stored into the position *n* in the output signal *h*. It is noteworthy that according to the definition (2), the size of output signal *h* is always equal or greater than the size of the input signal *f* . This fact is related to the boundary conditions. Let *f*(*n*) = 0 for all *n* < 0 ∨ *n* > *N<sup>f</sup>* and also *g*(*n*) = 0 for all *n* < 0 ∨ *n* > *Ng*. Then computing the expression (2) at the position *n* = −1 likely gives non-zero value, i.e. the output signal becomes larger. It can be derived that the

For the computation of *f* ∗ *g* we need to perform *N<sup>f</sup> N<sup>g</sup>* multiplications. The computational complexity of this algorithm is polynomial [13], but we must keep in mind what happens when the *N<sup>f</sup>* and *N<sup>g</sup>* become larger and namely what happens when we extend the computation into higher dimensions. In the 3-D case, for example, the expression (2) is

(*nx*, *ny*, *nz*)

higher dimensional tasks the use of this approach is becomes impractical, as each dimension increases the degree of this polynomial. Although the time complexity of this algorithm is polynomial the use of this solution is advantageous only if we handle with kernels with a small support. An example of such kernels are well-known filters from signal/image

> 

Sobel Gaussian

For better insight, let us consider the convolution of two relatively small 3-D signals 1024× 1024×100 voxels and 128×128×100 voxels—the example is shown in Fig. 1. When this convolution was performed in double precision on Intel Xeon QuadCore 2.83 GHz computer

Due to its simplicity and no specific restrictions, the naïve convolution is still the most popular approach. Its computation is usually sped up by employing large computer clusters that significantly decrease the time complexity per one computer. This approach [15–17]

 

121 242 121  

*<sup>f</sup>* <sup>3</sup>*d*(*nx*−*i*, *ny*−*j*, *nz*−*k*)*g*3*d*(*i*, *<sup>j</sup>*, *<sup>k</sup>*) (3)

*<sup>z</sup>* , the complexity of our filtering will raise from *N<sup>f</sup> N<sup>g</sup>* in the

*<sup>z</sup>* , which is unusable for larger signals or kernels. Hence, for

*<sup>x</sup>* <sup>×</sup>*N<sup>f</sup>*

*<sup>y</sup>* <sup>×</sup>*N<sup>f</sup> z*

Dedicated and configurable hardware, namely digital signal processors (DSP) or Field-programmable gate array (FPGA) units are very popular in the field of signal processing for their promising computational power at both low cost and low power consumption. Although the approach based on the Fourier transform is more popular in digital signal processing for its ability to process enormously long signals, the naïve convolution with a small convolution kernel on various architectures has been also well studied in the literature, especially in the context of the 2-D and multi-dimensional convolution.

Shoup [24] proposed techniques for automatic generation of convolution pipelines for small kernels such as of 3×3 pixels. Benedetti et al. [25] proposed a multi-FPGA solution by using an external memory to store a FIFO buffer and partitioning of data among several FPGA units, allowing to increase the size of the convolution kernel. Perri et al. [26] followed the previous work by designing a fully reconfigurable FPGA-based 2-D convolution processor. The core of this processor contains four 16-bit SIMD 3×3 convolvers, allowing real-time computation of convolution of a 8-bit or 16-bit image with a 3×3 or 5×5 convolution kernel. Recently, convolution on a custom specialized hardware, e.g. FPGA, ASIC, and DSP, is used to detect objects [27], edges [28], and other features in various real-time applications.

#### **2.2. GPU-based convolution**

From the beginning, graphics processing units (GPU) had been designed for visualisation purposes. Since the beginning of the 21st century, they started to play a role in general computations. This phenomenon is often referred to as general-purpose computing on graphics processing units (GPGPU) [29]. At first, there used to be no high-level programming languages specifically designed for general computation purposes. The programmers instead had to use shading languages such as Cg, High Level Shading Language (HLSL) or OpenGL Shading Language (GLSL) [29–31], to utilize texture units. Recently, two programming 6 Design and Architectures for Digital Signal Processing

frameworks are widely used among the GPGPU community, namely CUDA [32] and OpenCL [33].

Algorithms for Efficient Computation of Convolution 7

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

(*nx*, *ny*, *nz*) (5)

*f* <sup>3</sup>*d*(*nx*−*i*, *ny*−*j*, *nz*−*k*)*gz*(*k*)

(*nx*, *ny*, *nz*) /*associativity*/ (6)

(*nx*, *ny*, *nz*) (7)

� *gy*(*j*) *gx*(*i*)

*<sup>x</sup> <sup>N</sup><sup>f</sup> <sup>y</sup> <sup>N</sup><sup>f</sup> <sup>z</sup> <sup>N</sup><sup>g</sup> <sup>x</sup> <sup>N</sup><sup>g</sup> <sup>y</sup> <sup>N</sup><sup>g</sup> <sup>z</sup>* to

It is clear that *rank*(*A*) = 1. Here, *A* is a matrix representing some separable convolution kernel while *u* and *v* are the previously referred lower dimensional (cheaper) convolution

In the previous section, we derived the complexity of naïve approach. We also explained how the complexity worsens when we increase the dimensionality of the processed data. In case the convolution kernel is separable we can split the hard problem into a sequence of several simpler problems. Let us recall the 3-D naïve convolution from (3). Assume that *g*3*<sup>d</sup>* is separable, i.e. *g*3*<sup>d</sup>* = *gx* ∗ *gy* ∗ *gz*. Then the expression is simplified in the following way:

One should keep in mind that the kernel decomposition is usually the only one decomposition that can be performed in this task. It is based on the fact that many well-known kernels (Gaussian, Sobel) have some special properties. Nevertheless, the input signal is typically unpredictable and in higher dimensional cases it is unlikely one could

As separable filters are very popular in many applications, a number of implementations on various architectures can be found in the literature. Among the most favourite filters, the Gaussian filter is often used for pre-processing, for example in optical flow applications [43, 44]. Fialka et al. [45] compared the separable and the fast convolution on the graphics hardware and proved both the kernel size and separability to be the essential properties that have to be considered when choosing an appropriate implementation. They proved the separable convolution to be more efficient for kernel sizes up to tens of pixels in each dimension which is usually sufficient if the convolution is used for the pre-processing. The implementation usually does not require particular optimizations as the separable convolution is intrinsically a sequence of 1-D basic convolutions. Programmers should nevertheless consider some tuning steps regarding the memory accesses, as mentioned in Section 2.2. For the case of a GPU implementation, this issue is discussed in [35]. The GPU

implementation described in the document is also included in the CUDA SDK [34].

kernels.

*Nf <sup>x</sup> <sup>N</sup><sup>f</sup> <sup>y</sup> <sup>N</sup><sup>f</sup> z* � *Ng <sup>x</sup>* <sup>+</sup> *<sup>N</sup><sup>g</sup>*

3.1.0.3. Analysis of Time Complexity.

= � *f* <sup>3</sup>*<sup>d</sup>* ∗ �

=

= ∞ ∑ *i*=−∞

*<sup>y</sup>* <sup>+</sup> *<sup>N</sup><sup>g</sup> z* � .

*f* <sup>3</sup>*<sup>d</sup>* ∗ *g*3*<sup>d</sup>*

��� *<sup>f</sup>* <sup>3</sup>*<sup>d</sup>* <sup>∗</sup> *gx*

separate it into individual lower-dimensional signals.

**3.2. Separable convolution on various architectures**

 ∞ ∑ *j*=−∞

�

*gx* ∗ *gy* ∗ *gz*

� ∗ *gy* � ∗ *gz* �

��

� ∞ ∑ *k*=−∞

The complexity of such algorithm is then reduced from *<sup>N</sup><sup>f</sup>*

*<sup>h</sup>*3*d*(*nx*, *ny*, *nz*)=�

10.5772/51942

185

 (8)

For their ability to efficiently process 2-D and 3-D images and videos, GPUs have been utilized in various image processing applications, including those based on the convolution. Several convolution algorithms including the naïve one are included in the CUDA Computing SDK [34]. The naïve convolution on the graphics hardware has been also described in [35] and included in the Nvidia Performance Primitives library [36]. Specific applications, namely Canny edge detection [37, 38] or real-time object detection[39] have been studied in the literature. It can be noted that the problem of computing a rank filter such as the median filter has a naïve solution similar to the one of the convolution. Examples can be found in the aforementioned CUDA SDK or in [40, 41].

Basically, the convolution is a *memory-bound* problem [42], i.e. the ratio between the arithmetic operations and memory accesses is low. The adjacent threads process the adjacent signal samples including the common neighbourhood. Hence, they should share the data via a faster memory space, e.g. *shared memory* [35]. To store input data, programmers can also use *texture memory* which is read-only but cached. Furthermore, the texture cache exhibits the 2-D locality which makes it naturally suitable especially for 2-D convolutions.

#### **3. Separable convolution**

#### **3.1. Separable convolution**

The naïve algorithm is of polynomial complexity. Furthermore, with each added dimension the polynomial degree raises linearly which leads to very expensive computation of convolution in higher dimensions. Fortunately, some kernels are so called *separable* [18, 19]. The convolution with these kernels can be simply decomposed into several lower dimensional (let us say "cheaper") convolutions. Gaussian and Sobel [4] are the representatives of such group of kernels.

Separable convolution kernel must fullfil the condition that its matrix has rank equal to one. In other words, all the rows must be linearly dependent. Why? Let us construct such a kernel. Given one row vector

$$\vec{\boldsymbol{\mu}} = (\boldsymbol{\mu}\_1, \boldsymbol{\mu}\_2, \boldsymbol{\mu}\_3, \dots, \boldsymbol{\mu}\_m)$$

and one column vector

$$\vec{v}^T = (v\_1, v\_2, v\_3, \dots, v\_n)$$

let us convolve them together:

$$
\vec{u} \ast \vec{v} = (u\_1, u\_2, u\_3, \dots, u\_m) \ast \begin{pmatrix} v\_1 \\ v\_2 \\ v\_3 \\ \vdots \\ v\_n \end{pmatrix} = \begin{pmatrix} u\_1 v\_1 \ u\_2 v\_1 \ u\_3 v\_1 \ \dots \ u\_m v\_1 \\ u\_1 v\_2 \ u\_2 v\_2 \ u\_3 v\_2 \ \dots \ u\_m v\_2 \\\ u\_1 v\_3 \ u\_2 v\_3 \ u\_3 v\_3 \ \dots \ u\_m v\_3 \\\ \vdots \\\ \vdots \\\ u\_1 v\_n \ u\_2 v\_n \ u\_3 v\_n \ \dots \ u\_m v\_n \end{pmatrix} = A \tag{4}
$$

10.5772/51942

It is clear that *rank*(*A*) = 1. Here, *A* is a matrix representing some separable convolution kernel while *u* and *v* are the previously referred lower dimensional (cheaper) convolution kernels.

#### 3.1.0.3. Analysis of Time Complexity.

6 Design and Architectures for Digital Signal Processing

184 Design and Architectures for Digital Signal Processing

**3. Separable convolution**

**3.1. Separable convolution**

kernel. Given one row vector

let us convolve them together:

*<sup>u</sup>* ∗*<sup>v</sup>* = (*u*1, *<sup>u</sup>*2, *<sup>u</sup>*3,..., *um*) ∗

and one column vector

representatives of such group of kernels.

can be found in the aforementioned CUDA SDK or in [40, 41].

OpenCL [33].

frameworks are widely used among the GPGPU community, namely CUDA [32] and

For their ability to efficiently process 2-D and 3-D images and videos, GPUs have been utilized in various image processing applications, including those based on the convolution. Several convolution algorithms including the naïve one are included in the CUDA Computing SDK [34]. The naïve convolution on the graphics hardware has been also described in [35] and included in the Nvidia Performance Primitives library [36]. Specific applications, namely Canny edge detection [37, 38] or real-time object detection[39] have been studied in the literature. It can be noted that the problem of computing a rank filter such as the median filter has a naïve solution similar to the one of the convolution. Examples

Basically, the convolution is a *memory-bound* problem [42], i.e. the ratio between the arithmetic operations and memory accesses is low. The adjacent threads process the adjacent signal samples including the common neighbourhood. Hence, they should share the data via a faster memory space, e.g. *shared memory* [35]. To store input data, programmers can also use *texture memory* which is read-only but cached. Furthermore, the texture cache exhibits the

The naïve algorithm is of polynomial complexity. Furthermore, with each added dimension the polynomial degree raises linearly which leads to very expensive computation of convolution in higher dimensions. Fortunately, some kernels are so called *separable* [18, 19]. The convolution with these kernels can be simply decomposed into several lower dimensional (let us say "cheaper") convolutions. Gaussian and Sobel [4] are the

Separable convolution kernel must fullfil the condition that its matrix has rank equal to one. In other words, all the rows must be linearly dependent. Why? Let us construct such a

*<sup>u</sup>* = (*u*1, *<sup>u</sup>*2, *<sup>u</sup>*3,..., *um*)

*<sup>v</sup><sup>T</sup>* = (*v*1, *<sup>v</sup>*2, *<sup>v</sup>*3,..., *vn*)

*v*1 *v*2 *v*3 . . . *vn* *u*1*v*<sup>1</sup> *u*2*v*<sup>1</sup> *u*3*v*<sup>1</sup> ... *umv*<sup>1</sup> *u*1*v*<sup>2</sup> *u*2*v*<sup>2</sup> *u*3*v*<sup>2</sup> ... *umv*<sup>2</sup> *u*1*v*<sup>3</sup> *u*2*v*<sup>3</sup> *u*3*v*<sup>3</sup> ... *umv*<sup>3</sup> = *A* (4)

*u*1*vn u*2*vn u*3*vn* ... *umvn*

. . . . . . . . . ... . . .

 =

2-D locality which makes it naturally suitable especially for 2-D convolutions.

In the previous section, we derived the complexity of naïve approach. We also explained how the complexity worsens when we increase the dimensionality of the processed data. In case the convolution kernel is separable we can split the hard problem into a sequence of several simpler problems. Let us recall the 3-D naïve convolution from (3). Assume that *g*3*<sup>d</sup>* is separable, i.e. *g*3*<sup>d</sup>* = *gx* ∗ *gy* ∗ *gz*. Then the expression is simplified in the following way:

$$h^{\mathfrak{J}d}(n\_{\mathbf{x}\prime}n\_{\mathbf{y}\prime}n\_{\mathbf{z}}) = \left(f^{\mathfrak{J}d} \ast g^{\mathfrak{J}d}\right)(n\_{\mathbf{x}\prime}n\_{\mathbf{y}\prime}n\_{\mathbf{z}})\tag{5}$$

$$\mathbf{x} = \left(f^{\Im d} \ast \left(\mathbf{g}\_{\mathbf{x}} \ast \mathbf{g}\_{\mathbf{y}} \ast \mathbf{g}\_{\mathbf{z}}\right)\right) \left(n\_{\mathbf{x}\prime} n\_{\mathbf{y}\prime} n\_{\mathbf{z}}\right) \quad \text{/associativity} \tag{6}$$

$$= \left( \left( \left( f^{\text{Cd}} \ast g\_{\text{x}} \right) \ast g\_{\text{y}} \right) \ast g\_{\text{z}} \right) \left( n\_{\text{x}}, n\_{\text{y}}, n\_{\text{z}} \right) \tag{7}$$

$$= \left( \sum\_{i=-\infty}^{\infty} \left( \sum\_{j=-\infty}^{\infty} \left( \sum\_{k=-\infty}^{\infty} f^{3d}(n\_x - i\_\prime n\_y - j\_\prime n\_z - k) g\_z(k) \right) g\_y(j) \right) \mathbf{g}\_x(i) \right) \tag{8}$$

The complexity of such algorithm is then reduced from *<sup>N</sup><sup>f</sup> <sup>x</sup> <sup>N</sup><sup>f</sup> <sup>y</sup> <sup>N</sup><sup>f</sup> <sup>z</sup> <sup>N</sup><sup>g</sup> <sup>x</sup> <sup>N</sup><sup>g</sup> <sup>y</sup> <sup>N</sup><sup>g</sup> <sup>z</sup>* to *Nf <sup>x</sup> <sup>N</sup><sup>f</sup> <sup>y</sup> <sup>N</sup><sup>f</sup> z* � *Ng <sup>x</sup>* <sup>+</sup> *<sup>N</sup><sup>g</sup> <sup>y</sup>* <sup>+</sup> *<sup>N</sup><sup>g</sup> z* � .

One should keep in mind that the kernel decomposition is usually the only one decomposition that can be performed in this task. It is based on the fact that many well-known kernels (Gaussian, Sobel) have some special properties. Nevertheless, the input signal is typically unpredictable and in higher dimensional cases it is unlikely one could separate it into individual lower-dimensional signals.

#### **3.2. Separable convolution on various architectures**

As separable filters are very popular in many applications, a number of implementations on various architectures can be found in the literature. Among the most favourite filters, the Gaussian filter is often used for pre-processing, for example in optical flow applications [43, 44]. Fialka et al. [45] compared the separable and the fast convolution on the graphics hardware and proved both the kernel size and separability to be the essential properties that have to be considered when choosing an appropriate implementation. They proved the separable convolution to be more efficient for kernel sizes up to tens of pixels in each dimension which is usually sufficient if the convolution is used for the pre-processing.

The implementation usually does not require particular optimizations as the separable convolution is intrinsically a sequence of 1-D basic convolutions. Programmers should nevertheless consider some tuning steps regarding the memory accesses, as mentioned in Section 2.2. For the case of a GPU implementation, this issue is discussed in [35]. The GPU implementation described in the document is also included in the CUDA SDK [34].

8 Design and Architectures for Digital Signal Processing

#### **4. Recursive filtering**

The convolution is a process where the inner product, whose size corresponds to kernel size, is computed again and again in each individual sample. One of the vectors (kernel), that enter this operation, is always the same. It is clear that we could compute the whole inner product only in one position while the neighbouring position can be computed as a slightly modified difference with respect to the first position. Analogously, the same is valid for all the following positions. The computation of the convolution using this difference-based approach is called *recursive filtering* [2, 18].

#### 4.0.0.4. Example.

The well-known pure averaging filter in 1D is defined as follows:

$$h(n) = \sum\_{i=0}^{n-1} f(n-i) \tag{9}$$

Algorithms for Efficient Computation of Convolution 9

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

As for the parallel architectures, Robelley et al. [52] presented a mathematical formulation for computing time-invariant recursive filters on general SIMD DSP architectures. Authors also discuss the speed-up factor regarding to the level of parallelism and the filter order. Among the GPU implementations, we can mention the work of Trebien and Oliveira who implemented recursive filters in CUDA for the purpose of the realistic sound synthesis and processing [53]. In this case, recursive filters were computed in the frequency domain.

In the previous sections, we have introduced the common approaches to compute the convolution in the time (spatial) domain. We mentioned that in some applications, one has to cope with signals of millions of samples where the computation of the convolution requires too much time. Hence, for long or multi-dimensional input signals, the popular approach is to compute the convolution in the frequency domain which is sometimes referred to as the *fast convolution*. As shown in [45], the fast convolution can be even more efficient than the separable version if the number of kernel samples is large enough. Although the concept of the fast Fourier transform [54] and the frequency-based convolution [55] is several decades old, with new architectures upcoming, one has to deal with new problems. For example, the efficient access to the memory was an important issue in 1970s [56] just as it is today [21, 23].

In the following text, we will first recall the Fourier transform along with some of its important properties and the convolution theorem which provides us with a powerful tool for the convolution computation. Subsequently, we will describe the algorithm of the so-called fast Fourier transform, often simply denoted as FFT, and mention some notable implementations of the FFT. Finally, we will summarize the benefits and drawbacks of the

The Fourier transform *F* = F [ *f* ] of a function *f* and the inverse Fourier transform *f* =

2*π*

*N*

 <sup>+</sup><sup>∞</sup> −∞

> *N*−1 ∑ *k*=0

*<sup>F</sup>*(*ω*)<sup>e</sup> *<sup>j</sup>ω<sup>t</sup>*

d*ω*. (11)

*<sup>N</sup>* , respectively,

*<sup>F</sup>*(*k*)<sup>e</sup> *<sup>j</sup>*(2*π*/*N*)*kn* (12)

<sup>2</sup>*<sup>π</sup>* and <sup>1</sup>

*<sup>f</sup>*(*t*)e−*jtω*d*t*, *<sup>f</sup>*(*t*) <sup>≡</sup> <sup>1</sup>

The discrete finite equivalents of the aforementioned transforms are defined as follows:

*<sup>f</sup>*(*n*)e−*j*(2*π*/*N*)*nk*, *<sup>f</sup>*(*n*) <sup>≡</sup> <sup>1</sup>

guarantee that the identity *f* = F−<sup>1</sup> [F [ *f* ]] is maintained. The exponential function <sup>e</sup>−*j*(2*π*/*N*) is called the base function. For the sake of simplicity, we will refer to it as *WK*.

where *k*, *n* = 0, 1, . . . , *N* − 1. The so-called normalization factors <sup>1</sup>

Another problem to be considered is the numerical precision [57].

Parallel architectures.

**5. Fast convolution**

fast convolution.

**5.1. Fourier transform**

F−<sup>1</sup> [*F*] are defined as follows:

*F*(*ω*) ≡

*F*(*k*) ≡

 <sup>+</sup><sup>∞</sup> −∞

*N*−1 ∑ *n*=0

10.5772/51942

187

The performance of this filter worsen with the width of its support. Fortunately, there exists a recursive version of this filter with constant complexity regardless the size of its support. Such a filter is no more defined via standard convolution but using the recursive formula:

$$h(n) = h(n-1) + f(n) - f(n-n) \tag{10}$$

The transform of standard convolution into a recursive filtering is not a simple task. There are three main issues that should be solved:


The transform is a quite complex task and so-called *Z*-transform [22] is typically employed in this process. Each recursive filter may be designed as all other filters from scratch. In practice, the standard well-known filters are used as the bases and subsequently their recursive counterpart is found. There are two principal approaches how to do it:


#### **4.1. Recursive filters on various architectures**

Streaming architectures.

The recursive filtering is a popular approach especially on streaming architectures such as FPGA. The data can be processed in a stream keeping the memory requirements on a minimum level. This allows moving the computation to relatively small and cheap embedded systems. The recursive filters are thus used in various real-time applications such as edge detection [49], video filtering [50], and optical flow [51].

10.5772/51942

Parallel architectures.

8 Design and Architectures for Digital Signal Processing

186 Design and Architectures for Digital Signal Processing

approach is called *recursive filtering* [2, 18].

are three main issues that should be solved:

The well-known pure averaging filter in 1D is defined as follows:

*h*(*n*) =

*n*−1 ∑ *i*=0

The performance of this filter worsen with the width of its support. Fortunately, there exists a recursive version of this filter with constant complexity regardless the size of its support. Such a filter is no more defined via standard convolution but using the recursive formula:

The transform of standard convolution into a recursive filtering is not a simple task. There

1. replication – given slow (but correctly working) non-recursive filter, find its recursive

The transform is a quite complex task and so-called *Z*-transform [22] is typically employed in this process. Each recursive filter may be designed as all other filters from scratch. In practice, the standard well-known filters are used as the bases and subsequently their

The recursive filtering is a popular approach especially on streaming architectures such as FPGA. The data can be processed in a stream keeping the memory requirements on a minimum level. This allows moving the computation to relatively small and cheap embedded systems. The recursive filters are thus used in various real-time applications such as edge

2. stability – the recursive formula may cause the computation to diverge 3. accuracy – the recursion may cause the accumulation of small errors

recursive counterpart is found. There are two principal approaches how to do it:

• analytically – the filter is step by step constructed via the math formulas [46]

• numerically – the filter is derived using numerical methods [47, 48]

**4.1. Recursive filters on various architectures**

detection [49], video filtering [50], and optical flow [51].

*f*(*n* − *i*) (9)

*h*(*n*) = *h*(*n* − 1) + *f*(*n*) − *f*(*n* − *n*) (10)

The convolution is a process where the inner product, whose size corresponds to kernel size, is computed again and again in each individual sample. One of the vectors (kernel), that enter this operation, is always the same. It is clear that we could compute the whole inner product only in one position while the neighbouring position can be computed as a slightly modified difference with respect to the first position. Analogously, the same is valid for all the following positions. The computation of the convolution using this difference-based

**4. Recursive filtering**

4.0.0.4. Example.

version

Streaming architectures.

As for the parallel architectures, Robelley et al. [52] presented a mathematical formulation for computing time-invariant recursive filters on general SIMD DSP architectures. Authors also discuss the speed-up factor regarding to the level of parallelism and the filter order. Among the GPU implementations, we can mention the work of Trebien and Oliveira who implemented recursive filters in CUDA for the purpose of the realistic sound synthesis and processing [53]. In this case, recursive filters were computed in the frequency domain.

#### **5. Fast convolution**

In the previous sections, we have introduced the common approaches to compute the convolution in the time (spatial) domain. We mentioned that in some applications, one has to cope with signals of millions of samples where the computation of the convolution requires too much time. Hence, for long or multi-dimensional input signals, the popular approach is to compute the convolution in the frequency domain which is sometimes referred to as the *fast convolution*. As shown in [45], the fast convolution can be even more efficient than the separable version if the number of kernel samples is large enough. Although the concept of the fast Fourier transform [54] and the frequency-based convolution [55] is several decades old, with new architectures upcoming, one has to deal with new problems. For example, the efficient access to the memory was an important issue in 1970s [56] just as it is today [21, 23]. Another problem to be considered is the numerical precision [57].

In the following text, we will first recall the Fourier transform along with some of its important properties and the convolution theorem which provides us with a powerful tool for the convolution computation. Subsequently, we will describe the algorithm of the so-called fast Fourier transform, often simply denoted as FFT, and mention some notable implementations of the FFT. Finally, we will summarize the benefits and drawbacks of the fast convolution.

#### **5.1. Fourier transform**

The Fourier transform *F* = F [ *f* ] of a function *f* and the inverse Fourier transform *f* = F−<sup>1</sup> [*F*] are defined as follows:

$$F(\omega) \equiv \int\_{-\infty}^{+\infty} f(t) \mathbf{e}^{-j t \omega} \, \mathrm{d}t, \qquad f(t) \equiv \frac{1}{2\pi} \int\_{-\infty}^{+\infty} F(\omega) \mathbf{e}^{j \omega t} \, \mathrm{d}\omega. \tag{11}$$

The discrete finite equivalents of the aforementioned transforms are defined as follows:

$$F(k) \equiv \sum\_{n=0}^{N-1} f(n) \mathbf{e}^{-j(2\pi/N)nk}, \qquad f(n) \equiv \frac{1}{N} \sum\_{k=0}^{N-1} F(k) \mathbf{e}^{j(2\pi/N)kn} \tag{12}$$

where *k*, *n* = 0, 1, . . . , *N* − 1. The so-called normalization factors <sup>1</sup> <sup>2</sup>*<sup>π</sup>* and <sup>1</sup> *<sup>N</sup>* , respectively, guarantee that the identity *f* = F−<sup>1</sup> [F [ *f* ]] is maintained. The exponential function <sup>e</sup>−*j*(2*π*/*N*) is called the base function. For the sake of simplicity, we will refer to it as *WK*.

10 Design and Architectures for Digital Signal Processing

**Figure 2.** Example of the so-called windowing effect produced by signal *f* (a) and kernel *g* (b). The circular convolution causes border effects as seen in (c). The properly computed basic convolution is shown in (d).

If the sequence *f*(*n*), *n* = 0, 1, . . . , *N* − 1, is real, the discrete Fourier transform *F*(*k*) keeps some specific properties, in particular:

$$F(k) = F(N-k)^{\*}.\tag{13}$$

Algorithms for Efficient Computation of Convolution 11

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

), *n*′ = 0, 1, . . . , *N*/2 − 1 of even and odd

*<sup>N</sup> Fo*(*k*) (15)

) (16)

(a) Decimation-in-time (b) Decimation-in-frequency

**Figure 3.** The basic two radix-2 FFT algorithms: decimation-in-time and decimation-in-frequency. Demonstration on an input

algorithm. The time complexity of the fast convolution is hence equal to the complexity of the FFT, that is *O*(*N* log *N*). The detailed discussion on the complexity is provided in

In 1965, Cooley and Tukey [60] proposed an algorithm for fast computation of the Fourier transform. The widely-known algorithm was then improved through years and optimized for various signal lengths but the basic idea remained the same. The problem is handled in a divide-and-conquer manner by splitting the input signal into *N* parts1 and processing the individual parts recursively. Without loss of generality, we will recall the idea of the FFT for *N* = 2 which is the simplest situation. There are two fundamental approaches to split the

Assuming that *N* is even, the radix-2 decimation-in-time algorithm splits the input signal

) and *fo*(*n*′

samples, respectively. By recursive usage of the approach, the discrete Fourier transforms *Fe* and *Fo* of the two parts are computed. Finally, the resulting Fourier transform *F* can be

where *k* = 0, 1, . . . , *N* − 1. The signals *Fe* and *Fo* are of half length, however, they are periodic,

), *Fo*(*k*′ + *N*/2) = *Fo*(*k*′

*F*(*k*) = *Fe*(*k*) + *W<sup>k</sup>*

*Fe*(*k*′ + *N*/2) = *Fe*(*k*′

for any *k*′ = 0, 1, . . . , *N*/2 − 1. The algorithm is shown in Fig. 3(a).

<sup>1</sup> The individual variants of the algorithm for a particular *N* are called radix-*N* algorithms.

signal. They are called *decimation in time (DIT)* and *decimation in frequency (DIF)* [58].

signal of 8 samples.

Section 6.

**5.3. Fast Fourier transform**

Decimation in time (DIT).

computed as follows:

hence

*f*(*n*), *n* = 0, 1, . . . , *N* − 1 into parts *fe*(*n*′

10.5772/51942

189

This means that in the output signal *F*, only half of the samples are useful, the rest is redundant. As the real signals are typical for many practical applications, in most popular FT and FFT implementations, users are hence provided with special functions to handle real signals in order to save time and memory.

#### **5.2. Convolution theorem**

According to the convolution theorem, the Fourier transform convolution of two signals *f* and *g* is equal to the product of the Fourier transforms F [ *f* ] and F [*g*], respectively [58]:

$$
\mathcal{F}\left[f\*\mathcal{g}\right] = \mathcal{F}\left[f\right]\mathcal{F}\left[\mathcal{g}\right].\tag{14}
$$

In the following text, we will sometimes refer to the convolution computed by applying Eq. (14) as the "classical" fast convolution algorithm.

In the discrete case, the same holds for periodic signals (sequences) and is sometimes referred to as the circular or cyclic convolution [22]. However, in practical applications, one usually deals with non-periodic finite signals. This results into the so-called windowing problem [59], causing undesirable artefacts in the output signals—see Fig. 2. In practice, the problem is usually solved by either imposing the periodicity into the kernel, adding a so-called windowing function, or padding the kernel with zero values. One also has to consider the sizes of both the input signal and the convolution kernel which have to be equal. Generally, this is also solved by padding both the signal and the kernel with zero values. The size of both padded signals which enter the convolution is hence *N* = *N<sup>f</sup>* + *N<sup>g</sup>* − 1 where *N<sup>f</sup>* and *N<sup>g</sup>* is the number of signal and kernel samples, respectively. The equivalent property holds for the multi-dimensional case. The most time-demanding operation of the fast convolution approach is the Fourier transform which can be computed by the fast Fourier transform

10.5772/51942

**Figure 3.** The basic two radix-2 FFT algorithms: decimation-in-time and decimation-in-frequency. Demonstration on an input signal of 8 samples.

algorithm. The time complexity of the fast convolution is hence equal to the complexity of the FFT, that is *O*(*N* log *N*). The detailed discussion on the complexity is provided in Section 6.

#### **5.3. Fast Fourier transform**

10 Design and Architectures for Digital Signal Processing

188 Design and Architectures for Digital Signal Processing

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>0</sup>

<sup>0</sup> <sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>0</sup>

<sup>0</sup> <sup>10</sup> <sup>20</sup> <sup>30</sup> <sup>40</sup> <sup>0</sup>

(d) Basic convolution

. (13)

(c) Fast (circular)

**Figure 2.** Example of the so-called windowing effect produced by signal *f* (a) and kernel *g* (b). The circular convolution causes

If the sequence *f*(*n*), *n* = 0, 1, . . . , *N* − 1, is real, the discrete Fourier transform *F*(*k*) keeps

*F*(*k*) = *F*(*N* − *k*)

This means that in the output signal *F*, only half of the samples are useful, the rest is redundant. As the real signals are typical for many practical applications, in most popular FT and FFT implementations, users are hence provided with special functions to handle real

According to the convolution theorem, the Fourier transform convolution of two signals *f* and *g* is equal to the product of the Fourier transforms F [ *f* ] and F [*g*], respectively [58]:

In the following text, we will sometimes refer to the convolution computed by applying

In the discrete case, the same holds for periodic signals (sequences) and is sometimes referred to as the circular or cyclic convolution [22]. However, in practical applications, one usually deals with non-periodic finite signals. This results into the so-called windowing problem [59], causing undesirable artefacts in the output signals—see Fig. 2. In practice, the problem is usually solved by either imposing the periodicity into the kernel, adding a so-called windowing function, or padding the kernel with zero values. One also has to consider the sizes of both the input signal and the convolution kernel which have to be equal. Generally, this is also solved by padding both the signal and the kernel with zero values. The size of both padded signals which enter the convolution is hence *N* = *N<sup>f</sup>* + *N<sup>g</sup>* − 1 where *N<sup>f</sup>* and *N<sup>g</sup>* is the number of signal and kernel samples, respectively. The equivalent property holds for the multi-dimensional case. The most time-demanding operation of the fast convolution approach is the Fourier transform which can be computed by the fast Fourier transform

convolution

∗

F [ *f* ∗ *g*] = F [ *f* ] F [*g*] . (14)

(b) Kernel *g*

border effects as seen in (c). The properly computed basic convolution is shown in (d).

<sup>0</sup> <sup>5</sup> <sup>10</sup> <sup>15</sup> <sup>20</sup> <sup>0</sup>

(a) Signal *f*

some specific properties, in particular:

signals in order to save time and memory.

Eq. (14) as the "classical" fast convolution algorithm.

**5.2. Convolution theorem**

> In 1965, Cooley and Tukey [60] proposed an algorithm for fast computation of the Fourier transform. The widely-known algorithm was then improved through years and optimized for various signal lengths but the basic idea remained the same. The problem is handled in a divide-and-conquer manner by splitting the input signal into *N* parts1 and processing the individual parts recursively. Without loss of generality, we will recall the idea of the FFT for *N* = 2 which is the simplest situation. There are two fundamental approaches to split the signal. They are called *decimation in time (DIT)* and *decimation in frequency (DIF)* [58].

Decimation in time (DIT).

Assuming that *N* is even, the radix-2 decimation-in-time algorithm splits the input signal *f*(*n*), *n* = 0, 1, . . . , *N* − 1 into parts *fe*(*n*′ ) and *fo*(*n*′ ), *n*′ = 0, 1, . . . , *N*/2 − 1 of even and odd samples, respectively. By recursive usage of the approach, the discrete Fourier transforms *Fe* and *Fo* of the two parts are computed. Finally, the resulting Fourier transform *F* can be computed as follows:

$$F(k) = F\_{\mathcal{e}}(k) + \mathcal{W}\_N^k F\_{\mathcal{o}}(k) \tag{15}$$

where *k* = 0, 1, . . . , *N* − 1. The signals *Fe* and *Fo* are of half length, however, they are periodic, hence

$$F\_{\varepsilon}(k' + N/2) = F\_{\varepsilon}(k'), \qquad F\_{\circ}(k' + N/2) = F\_{\circ}(k') \tag{16}$$

for any *k*′ = 0, 1, . . . , *N*/2 − 1. The algorithm is shown in Fig. 3(a).

<sup>1</sup> The individual variants of the algorithm for a particular *N* are called radix-*N* algorithms.

12 Design and Architectures for Digital Signal Processing

Decimation in frequency (DIF).

Having the signal *f* of an even length *N*, the sequences *fr* and *fs* of the half length are created as follows:

$$f\_r(n') = f(n') + f(n' + N/2), \qquad f\_\mathbf{s}(n') = \left[f(n') - f(n' + N/2)\right] W\_\mathbf{N}^{-\mathbf{n}'}.\tag{17}$$

Algorithms for Efficient Computation of Convolution 13

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

**6. Decomposition in the time domain**

algorithms.

*Ng <sup>y</sup>* <sup>×</sup>*N<sup>g</sup>*

*Ng x*×*N<sup>g</sup>*

*Ng <sup>y</sup>* )(*N<sup>f</sup>*

in [23].

*y*×*N<sup>g</sup>*

*<sup>z</sup>* <sup>+</sup>*N<sup>g</sup>*

*<sup>z</sup>* ) · *C* bytes.

(*N<sup>f</sup>* + *N<sup>g</sup>* ) as we suppose *N<sup>f</sup>* ≫1 and *N<sup>g</sup>* ≫1.

signal *f* and kernel *g* requires

*<sup>z</sup>* , we need only

*<sup>x</sup>* )(*N<sup>f</sup>*

*<sup>y</sup>* <sup>+</sup>*N<sup>g</sup>*

(*N<sup>f</sup> <sup>x</sup>* <sup>+</sup>*N<sup>g</sup>*

steps in total.

In this section, we will focus on the decomposition of the fast convolution in the time domain. We will provide the analysis of time and space complexity. Regarding the former, we will focus on the number of additions and multiplications needed for the computation of studied

Utilizing the convolution theorem and the fast Fourier transform the 1-D convolution of

steps [8]. Here, the term (*N<sup>f</sup>* +*Ng*) means that the processed signal *f* was zero padded<sup>2</sup> to prevent the overlap effect caused by circular convolution. The kernel was modified in the same way. Another advantage of using Fourier transform stems from its separability.

Up to now, this method seems to be optimal. Before we proceed, let us look into the space complexity of this approach. If we do not take into account buffers for the input/output signals and serialize both Fourier transforms, we need space for two equally aligned Fourier

bytes, where (*N<sup>f</sup>* + *Ng*) is a size of one padded signal and *C* is a constant dependent on the required algorithm precision (single, double or long double). If the double precision is required, for example, then C = 2 · sizeof(double), which corresponds to two Fourier

*<sup>z</sup>* the space needed by the aligned signal is proportionally higher: (*N<sup>f</sup>*

Keeping in mind that due to the lack of available memory, direct computation of fast convolution is not realizable using common computers we will try to split the whole task into several subtasks. This means that the input signal and kernel will be split into smaller pieces, so called *tiles* that need not be of the same size. Hence, we will try to reduce the memory requirements while keeping the efficiency of the whole convolution process as proposed

<sup>2</sup> The size of padded signal should be exactly (*N<sup>f</sup>* +*N<sup>g</sup>* − 1). For the sake of simplicity, we reduced this term to

<sup>2</sup> log2(*N<sup>f</sup>* <sup>+</sup>*Ng*) + <sup>1</sup>

*<sup>x</sup>* <sup>×</sup>*N<sup>f</sup>*

*<sup>y</sup>* <sup>+</sup>*N<sup>g</sup>*

(*N<sup>f</sup>* + *Ng*) · *C* (21)

*<sup>x</sup>* <sup>×</sup>*N<sup>f</sup>*

*<sup>y</sup>* <sup>×</sup>*N<sup>f</sup>*

*<sup>z</sup>* and ||*g*3*d*|| =

*<sup>x</sup>* )(*N<sup>f</sup> <sup>y</sup>* +

*x*+*N<sup>g</sup>*

*<sup>x</sup>* )(*N<sup>f</sup>*

*<sup>y</sup>* <sup>×</sup>*N<sup>f</sup>*

*<sup>y</sup>* )(*N<sup>f</sup>*

*<sup>z</sup>* <sup>+</sup>*N<sup>g</sup> z* ) + 1 

*<sup>z</sup>* and ||*g*3*d*|| <sup>=</sup> *<sup>N</sup><sup>g</sup>*

(*N<sup>f</sup>* +*Ng*)

Convolving two 3-D signals *<sup>f</sup>* <sup>3</sup>*<sup>d</sup>* and *<sup>g</sup>*3*d*, where || *<sup>f</sup>* ||3*<sup>d</sup>* = *<sup>N</sup><sup>f</sup>*

*<sup>y</sup>* )(*N<sup>f</sup>*

*<sup>z</sup>* <sup>+</sup>*N<sup>g</sup> z* ) 9 <sup>2</sup> log2 (*N<sup>f</sup> <sup>x</sup>* <sup>+</sup>*N<sup>g</sup>*

signals and some negligible Fourier transform workspace. In total, it is

signals used by real-valued FFT. In the 3-D case, when || *<sup>f</sup>* <sup>3</sup>*d*|| = *<sup>N</sup><sup>f</sup>*

 9 10.5772/51942

191

(19)

*<sup>x</sup>* ×

(20)

Then, the Fourier transform *Fr* and *Fs* fulfill the following property: *Fr*(*k*′ ) = *F*(2*k*′ ) and *Fs*(*k*′ ) = *F*(2*k*′ + 1) for any *k*′ = 0, 1, . . . , *N*/2 − 1. Hence, the sequences *fr* and *fs* are then processed recursively, as shown in Fig. 3(b). It is easy to deduce the inverse equation from Eq. (17):

$$f(n') = \frac{1}{2} \left[ f\_r(n') + f\_s(n') \mathcal{W}\_N^{n'} \right], \qquad f(n' + N/2) = \frac{1}{2} \left[ f\_r(n') - f\_s(n') \mathcal{W}\_N^{n'} \right]. \tag{18}$$

#### **5.4. The most popular FFT implementations**

#### On CPU.

One of the most popular FFT implementations ever is so-called Fastest Fourier Transform in the West (FFTW) [61]. It is kept updated and available for download on the web page http://www.fftw.org/. According to the authors' comprehensive benchmark [62], it is still one of the fastest CPU implementations available. The top performance is achieved by using multiple CPU threads, the extended instruction sets of modern processors such as SSE/SSE2, optimized radix-*N* algorithms for *N* up to 7, optimized functions for purely real input data etc. Other popular CPU implementations can be found e.g. in the Intel libraries called Intel Integrated Performance Primitives (IPP) [63] and Intel Math Kernel Library (MKL) [64]. In terms of performance, they are comparable with the FFTW.

On other architectures.

For the graphics hardware, there exists several implementations in the literature [65–67]. Probably the most widely-used one is the CUFFT library by Nvidia. Although it is dedicated to the Nvidia graphics cards, it is popular due to its good performance and ease of use. It also contains optimized functions for real input data. The FFT has been also implemented on various architectures, including DSP [68] and FPGA [69].

#### **5.5. Benefits and drawbacks of the fast convolution**

To summarize this section, fast convolution is the most efficient approach if both signal and kernel contain thousands of samples or more, or if the kernel is slightly smaller but non-separable. Thanks to numerous implementations, it is accessible to a wide range of users on various architectures. The main drawbacks are the windowing problem, the relatively lower numerical precision, and considerable memory requirements due to the signal padding. In the following, we will examine the memory usage in detail and propose several approaches to optimize it on modern parallel architectures.

10.5772/51942

#### **6. Decomposition in the time domain**

In this section, we will focus on the decomposition of the fast convolution in the time domain. We will provide the analysis of time and space complexity. Regarding the former, we will focus on the number of additions and multiplications needed for the computation of studied algorithms.

Utilizing the convolution theorem and the fast Fourier transform the 1-D convolution of signal *f* and kernel *g* requires

$$(N^f + N^\mathcal{S})\left[\frac{9}{2}\log\_2(N^f + N^\mathcal{S}) + 1\right] \tag{19}$$

steps [8]. Here, the term (*N<sup>f</sup>* +*Ng*) means that the processed signal *f* was zero padded<sup>2</sup> to prevent the overlap effect caused by circular convolution. The kernel was modified in the same way. Another advantage of using Fourier transform stems from its separability. Convolving two 3-D signals *<sup>f</sup>* <sup>3</sup>*<sup>d</sup>* and *<sup>g</sup>*3*d*, where || *<sup>f</sup>* ||3*<sup>d</sup>* = *<sup>N</sup><sup>f</sup> <sup>x</sup>* <sup>×</sup>*N<sup>f</sup> <sup>y</sup>* <sup>×</sup>*N<sup>f</sup> <sup>z</sup>* and ||*g*3*d*|| <sup>=</sup> *<sup>N</sup><sup>g</sup> <sup>x</sup>* × *Ng <sup>y</sup>* <sup>×</sup>*N<sup>g</sup> <sup>z</sup>* , we need only

$$\left(\mathrm{N}\_{\mathrm{x}}^{f} + \mathrm{N}\_{\mathrm{x}}^{g}\right)\left(\mathrm{N}\_{\mathrm{y}}^{f} + \mathrm{N}\_{\mathrm{y}}^{g}\right)\left(\mathrm{N}\_{\mathrm{z}}^{f} + \mathrm{N}\_{\mathrm{z}}^{g}\right)\left[\frac{9}{2}\log\_{2}\left(\left(\mathrm{N}\_{\mathrm{x}}^{f} + \mathrm{N}\_{\mathrm{x}}^{g}\right)\left(\mathrm{N}\_{\mathrm{y}}^{f} + \mathrm{N}\_{\mathrm{y}}^{g}\right)\left(\mathrm{N}\_{\mathrm{z}}^{f} + \mathrm{N}\_{\mathrm{z}}^{g}\right)\right) + 1\right] \tag{20}$$

steps in total.

12 Design and Architectures for Digital Signal Processing

190 Design and Architectures for Digital Signal Processing

) = *f*(*n*′

Having the signal *f* of an even length *N*, the sequences *fr* and *fs* of the half length are created

) =

) = *F*(2*k*′ + 1) for any *k*′ = 0, 1, . . . , *N*/2 − 1. Hence, the sequences *fr* and *fs* are then processed recursively, as shown in Fig. 3(b). It is easy to deduce the inverse equation from

, *<sup>f</sup>*(*n*′ <sup>+</sup> *<sup>N</sup>*/2) = <sup>1</sup>

One of the most popular FFT implementations ever is so-called Fastest Fourier Transform in the West (FFTW) [61]. It is kept updated and available for download on the web page http://www.fftw.org/. According to the authors' comprehensive benchmark [62], it is still one of the fastest CPU implementations available. The top performance is achieved by using multiple CPU threads, the extended instruction sets of modern processors such as SSE/SSE2, optimized radix-*N* algorithms for *N* up to 7, optimized functions for purely real input data etc. Other popular CPU implementations can be found e.g. in the Intel libraries called Intel Integrated Performance Primitives (IPP) [63] and Intel Math Kernel Library (MKL) [64]. In

For the graphics hardware, there exists several implementations in the literature [65–67]. Probably the most widely-used one is the CUFFT library by Nvidia. Although it is dedicated to the Nvidia graphics cards, it is popular due to its good performance and ease of use. It also contains optimized functions for real input data. The FFT has been also implemented

To summarize this section, fast convolution is the most efficient approach if both signal and kernel contain thousands of samples or more, or if the kernel is slightly smaller but non-separable. Thanks to numerous implementations, it is accessible to a wide range of users on various architectures. The main drawbacks are the windowing problem, the relatively lower numerical precision, and considerable memory requirements due to the signal padding. In the following, we will examine the memory usage in detail and propose

*f*(*n*′

2 *fr*(*n*′

) − *f*(*n*′ + *N*/2)

 *W*−*n*′

) − *fs*(*n*′

*<sup>N</sup>* . (17)

) and

. (18)

) = *F*(2*k*′

)*Wn*′ *N* 

) + *f*(*n*′ + *N*/2), *fs*(*n*′

)*Wn*′ *N* 

) + *fs*(*n*′

terms of performance, they are comparable with the FFTW.

on various architectures, including DSP [68] and FPGA [69].

**5.5. Benefits and drawbacks of the fast convolution**

several approaches to optimize it on modern parallel architectures.

**5.4. The most popular FFT implementations**

Then, the Fourier transform *Fr* and *Fs* fulfill the following property: *Fr*(*k*′

Decimation in frequency (DIF).

*fr*(*n*′

*f*(*n*′

On other architectures.

) = <sup>1</sup> 2 *fr*(*n*′

as follows:

*Fs*(*k*′

Eq. (17):

On CPU.

Up to now, this method seems to be optimal. Before we proceed, let us look into the space complexity of this approach. If we do not take into account buffers for the input/output signals and serialize both Fourier transforms, we need space for two equally aligned Fourier signals and some negligible Fourier transform workspace. In total, it is

$$(N^f + N^\mathcal{S}) \cdot \mathcal{C} \tag{21}$$

bytes, where (*N<sup>f</sup>* + *Ng*) is a size of one padded signal and *C* is a constant dependent on the required algorithm precision (single, double or long double). If the double precision is required, for example, then C = 2 · sizeof(double), which corresponds to two Fourier signals used by real-valued FFT. In the 3-D case, when || *<sup>f</sup>* <sup>3</sup>*d*|| = *<sup>N</sup><sup>f</sup> <sup>x</sup>* <sup>×</sup>*N<sup>f</sup> <sup>y</sup>* <sup>×</sup>*N<sup>f</sup> <sup>z</sup>* and ||*g*3*d*|| = *Ng x*×*N<sup>g</sup> y*×*N<sup>g</sup> <sup>z</sup>* the space needed by the aligned signal is proportionally higher: (*N<sup>f</sup> x*+*N<sup>g</sup> <sup>x</sup>* )(*N<sup>f</sup> <sup>y</sup>* + *Ng <sup>y</sup>* )(*N<sup>f</sup> <sup>z</sup>* <sup>+</sup>*N<sup>g</sup> <sup>z</sup>* ) · *C* bytes.

Keeping in mind that due to the lack of available memory, direct computation of fast convolution is not realizable using common computers we will try to split the whole task into several subtasks. This means that the input signal and kernel will be split into smaller pieces, so called *tiles* that need not be of the same size. Hence, we will try to reduce the memory requirements while keeping the efficiency of the whole convolution process as proposed in [23].

<sup>2</sup> The size of padded signal should be exactly (*N<sup>f</sup>* +*N<sup>g</sup>* − 1). For the sake of simplicity, we reduced this term to (*N<sup>f</sup>* + *N<sup>g</sup>* ) as we suppose *N<sup>f</sup>* ≫1 and *N<sup>g</sup>* ≫1.

14 Design and Architectures for Digital Signal Processing

Algorithms for Efficient Computation of Convolution 15

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

· *C* (22)

Let us inspect the memory requirements for this approach. As the filtered signal *f* is split

+ *N<sup>g</sup>* 

bytes. Concerning the time complexity, after splitting the signal *f* into *m* tiles, we need to

multiplications in total. If there is no division (*m* = 1) we get the time complexity of the fast approach. If the division is total (*m* = *N<sup>f</sup>* ) we get even worse complexity than the basic convolution has. The higher the level of splitting is required the worse the complexity is. Therefore, we can conclude that splitting only the input signal into tiles does not help.

From the previous text, we recognize that splitting only the input signal *f* might be inefficient. It may even happen that the kernel *g* is so large that splitting of only the signal *f* does not reduce the memory requirements sufficiently. As the convolution belongs to commutative operators one could recommend swapping the input signal and the kernel. This may help, namely when the signal *f* is small and the kernel *g* is very large. As soon as the signal and the kernel are swapped, we can simply apply the overlap-save method. However, this approach fails when both the signal and the kernel are too large. Let us

Keeping in mind that the input signal *f* has already been decomposed into *m* tiles using overlap-save method, we can focus on the manipulation with just one tile *fi*, *i* = 1, 2, . . . , *m*. For the computation of convolution of the selected tile *fi* and the large kernel *g* we will employ so called *overlap-add method* [22]. This method splits the kernel *g* into *n* disjoint (nonoverlapping) pieces *gj*, *j* = 1, 2, . . . , *n*. Then, it performs *n* cheaper convolutions *fi* ∗*gj*,

Without loss of generality we will focus on the manipulation with just one *kernel tile gj*. Prior to the computation, the selected *tile gj* has to be aligned to the size || *fi*|| + ||*gj*||. It is done simply by padding *gj* with zeros equally on both sides. In this way, we get the tile *g*′

Each kernel tile *gj* has its positive shift *sj* with respect to the origin of *g*. This shift is very important for further computation and cannot be omitted. Before we perform the convolution

*<sup>i</sup>* within *f* by *sj* samples to the left. The reason originates from

and finally it adds the results together preserving the appropriate overruns.

signal tile *fi* is also aligned to the size || *fi*|| + ||*gj*||. However, *f* ′

is created from *fi* by extending its support equally on both sides.

 *Nf m* +*N<sup>g</sup>* + 1 

into *m* pieces, the respective memory requirements are lowered to

(*N<sup>f</sup>* +*mNg*)

 *Nf m*

> 9 <sup>2</sup> log2

6.1.0.6. Analysis of time complexity.

perform

**6.2. Kernel tiling**

6.2.0.7. Method.

*f* ′ *<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′

decompose the kernel *g* as well.

*<sup>j</sup>* we must shift the tile *<sup>f</sup>* ′

10.5772/51942

193

(23)

*j* . The

*<sup>i</sup>* is not padded with zeros. It

**Figure 4.** Using the overlap-save and overlap-add methods, the input data can be segmented into smaller blocks and convolved separately. Finally, the sub-parts are concatenated (a) or summed (b) together.

#### **6.1. Signal tiling**

Splitting the input signal *f* into smaller disjoint tiles *f*1, *f*2,..., *fm*, then performing *m* smaller convolutions *fi* ∗ *g*, *i* = 1, 2, . . . , *m* and finally concatenating the results together with discarding the overlaps is a well-known algorithm in digital signal processing. The implementation is commonly known as the *overlap-save method* [22].

#### 6.1.0.5. Method.

Without loss of generality we will focus on the manipulation with just one *tile fi*. The other tiles are processes in the same way. The tile *fi* is uniquely determined by its size and shift with respect to the origin of *f* . Its size and shift also uniquely determine the area in the output signal *h* where the expected result of *fi* ∗ *g* is going to be stored. In order to guarantee that the convolution *fi* ∗ *g* computes correctly the appropriate part of output signal *h*, the tile *fi* must be equipped with some overlap to its neighbours. The size of this overlap is equal to the size of the whole kernel *g*. Hence, the tile *fi* is extended equally on both sides and we get *f* ′ *<sup>i</sup>* . If the tile *fi* is the boundary one, it is padded with zero values. As the fast convolution required both the signal and the kernel of the same size the kernel *g* must be also extended. It is just padded with zeros which produces *g*′ . As soon as *f* ′ *<sup>i</sup>* and *<sup>g</sup>*′ are prepared, the convolution *f* ′ *<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′ can be performed and the result is cropped to the size || *fi*||. Then, all the convolutions *f* ′ *<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′ , *i* = 1, 2, . . . , *m* are successively performed and the output signal *h* is obtained by *concatenating* the individual results together. A general form of the method is shown in Fig. 4(a).

10.5772/51942

6.1.0.6. Analysis of time complexity.

14 Design and Architectures for Digital Signal Processing

192 Design and Architectures for Digital Signal Processing

(a) Overlap-save method (b) Overlap-add method

**Figure 4.** Using the overlap-save and overlap-add methods, the input data can be segmented into smaller blocks and convolved

Splitting the input signal *f* into smaller disjoint tiles *f*1, *f*2,..., *fm*, then performing *m* smaller convolutions *fi* ∗ *g*, *i* = 1, 2, . . . , *m* and finally concatenating the results together with discarding the overlaps is a well-known algorithm in digital signal processing. The

Without loss of generality we will focus on the manipulation with just one *tile fi*. The other tiles are processes in the same way. The tile *fi* is uniquely determined by its size and shift with respect to the origin of *f* . Its size and shift also uniquely determine the area in the output signal *h* where the expected result of *fi* ∗ *g* is going to be stored. In order to guarantee that the convolution *fi* ∗ *g* computes correctly the appropriate part of output signal *h*, the tile *fi* must be equipped with some overlap to its neighbours. The size of this overlap is equal to the size of the whole kernel *g*. Hence, the tile *fi* is extended equally on both sides

convolution required both the signal and the kernel of the same size the kernel *g* must be

signal *h* is obtained by *concatenating* the individual results together. A general form of the

*<sup>i</sup>* . If the tile *fi* is the boundary one, it is padded with zero values. As the fast

*<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′ can be performed and the result is cropped to the size || *fi*||.

, *i* = 1, 2, . . . , *m* are successively performed and the output

. As soon as *f* ′

*<sup>i</sup>* and *<sup>g</sup>*′ are

separately. Finally, the sub-parts are concatenated (a) or summed (b) together.

implementation is commonly known as the *overlap-save method* [22].

also extended. It is just padded with zeros which produces *g*′

*<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′

**6.1. Signal tiling**

6.1.0.5. Method.

and we get *f* ′

prepared, the convolution *f* ′

Then, all the convolutions *f* ′

method is shown in Fig. 4(a).

Let us inspect the memory requirements for this approach. As the filtered signal *f* is split into *m* pieces, the respective memory requirements are lowered to

$$\left(\frac{N^f}{m} + N^g\right) \cdot \mathbb{C} \tag{22}$$

bytes. Concerning the time complexity, after splitting the signal *f* into *m* tiles, we need to perform

$$(N^f + mN^g) \left[ \frac{9}{2} \log\_2 \left( \frac{N^f}{m} + N^g \right) + 1 \right] \tag{23}$$

multiplications in total. If there is no division (*m* = 1) we get the time complexity of the fast approach. If the division is total (*m* = *N<sup>f</sup>* ) we get even worse complexity than the basic convolution has. The higher the level of splitting is required the worse the complexity is. Therefore, we can conclude that splitting only the input signal into tiles does not help.

#### **6.2. Kernel tiling**

From the previous text, we recognize that splitting only the input signal *f* might be inefficient. It may even happen that the kernel *g* is so large that splitting of only the signal *f* does not reduce the memory requirements sufficiently. As the convolution belongs to commutative operators one could recommend swapping the input signal and the kernel. This may help, namely when the signal *f* is small and the kernel *g* is very large. As soon as the signal and the kernel are swapped, we can simply apply the overlap-save method. However, this approach fails when both the signal and the kernel are too large. Let us decompose the kernel *g* as well.

#### 6.2.0.7. Method.

Keeping in mind that the input signal *f* has already been decomposed into *m* tiles using overlap-save method, we can focus on the manipulation with just one tile *fi*, *i* = 1, 2, . . . , *m*. For the computation of convolution of the selected tile *fi* and the large kernel *g* we will employ so called *overlap-add method* [22]. This method splits the kernel *g* into *n* disjoint (nonoverlapping) pieces *gj*, *j* = 1, 2, . . . , *n*. Then, it performs *n* cheaper convolutions *fi* ∗*gj*, and finally it adds the results together preserving the appropriate overruns.

Without loss of generality we will focus on the manipulation with just one *kernel tile gj*. Prior to the computation, the selected *tile gj* has to be aligned to the size || *fi*|| + ||*gj*||. It is done simply by padding *gj* with zeros equally on both sides. In this way, we get the tile *g*′ *j* . The signal tile *fi* is also aligned to the size || *fi*|| + ||*gj*||. However, *f* ′ *<sup>i</sup>* is not padded with zeros. It is created from *fi* by extending its support equally on both sides.

Each kernel tile *gj* has its positive shift *sj* with respect to the origin of *g*. This shift is very important for further computation and cannot be omitted. Before we perform the convolution *f* ′ *<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′ *<sup>j</sup>* we must shift the tile *<sup>f</sup>* ′ *<sup>i</sup>* within *f* by *sj* samples to the left. The reason originates from 16 Design and Architectures for Digital Signal Processing

the idea of kernel decomposition and minus sign in Eq. (2) which causes the whole kernel to be flipped. As soon as the convolution *f* ′ *<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′ *<sup>j</sup>* is performed, its result is cropped to the size || *fi*|| and *added* to the output signal *h* into the position defined by overlap-save method. Finally, all the convolutions *f* ′ *<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′ *j* , *j* = 1, 2, . . . *n* are performed to get complete result for one given tile *fi*. A general form of the method is shown in Fig. 4(b).

Algorithms for Efficient Computation of Convolution 17

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

 1 2 3 4 5

· *C* (26)

*<sup>m</sup>* and *<sup>y</sup>* <sup>=</sup> *<sup>N</sup><sup>g</sup>*

*n* .

(27)

**(size of image tile)**

**x−axis**

**Figure 5.** A graph of a function Φ(*x*, *y*) that represents the time complexity of tiled convolution. The *x*-axis and *y*-axis correspond to number of samples in signal and kernel tile, respectively. The evident minimum of function Φ(*x*, *y*) occurs in the

*n* = *Ng*) we obtain basic convolution, i.e. the complexity class *O*(*N<sup>f</sup> Ng*). Concerning the

bytes, where *C* is again the precision dependent constant and *m*, *n* are the levels of division

We currently designed an algorithm of splitting the signal *f* into *m* tiles and the kernel *g* into *n* tiles. Now we will answer the question regarding the optimal way of splitting the input signal and the kernel. As the relationship between *m* and *n* is hard to be expressed

Here *x* and *y* stand for the sizes of the signal and the kernel tiles, respectively. Applying this

The plot of this function is depicted in Figure 5. The minimum of this function is reached if and only if *x* = *y* and both variables *x* and *y* are maximized, i.e. the input signal and the kernel tiles should be of the same size (equal number of samples) and they should be as large as possible. In order to reach the optimal solution, the size of the tile should be the power of small primes [70]. In this sense, it is recommended to fulfill both criteria put on the tile size: the maximality (as stated above) and the capability of simple decomposition into small

<sup>2</sup> log2(*<sup>x</sup>* <sup>+</sup> *<sup>y</sup>*) + <sup>1</sup>

 *Nf m* + *N<sup>g</sup> n*

and *N<sup>f</sup>* and *N<sup>g</sup>* are constants let us define the following substitution: *x* = *<sup>N</sup><sup>f</sup>*

 1 *x* + 1 *y* 9

substitution to Eq. (25) and simplifying, we get the function

Φ(*x*, *y*) = *N<sup>f</sup> N<sup>g</sup>*

(1/x + 1/y) \* ((9/2) \* log(x+y) + 1)

1 2 3 4 5

**(size of kernel tile)**

location, where both variables (sizes of tiles) are maximized and equal at the same time.

space occupied by our convolution algorithm, we need

of signal *f* and kernel *g*, respectively.

6.2.0.9. Algorithm optimality.

primes.

**y−axis**

10.5772/51942

195

The complete computation of the convolution across all signal and kernel tiles is sketched in the Algorithm 1.

**Algorithm 1.** Divide-and-conquer approach applied to the convolution over large data.

(*f* , *g*) ← (input signal, kernel) *f* → *f*1, *f*2,..., *fm* {*split ' f ' into tiles according to overlap-save scheme*} *g* → *g*1, *g*2,..., *gn* {*split 'g' into tiles according to overlap-add scheme*} *h* ← 0 {*create the output signal 'h' and fill it with zeros*} **for** *i* = 1 to *m* **do for** *j* = 1 to *n* **do** *hij* ← convolve(*fi*, *gj*) {*use fast convolution*} *hij* ← discard\_overruns(*hij*) {*discard hij overruns following overlap-save output rules*} *h* ← *h* + shift(*hij*) {*add hij to h following overlap-add output rules*} **end for end for** Output ← *h*

#### 6.2.0.8. Analysis of time complexity.

Let us suppose the signal *f* is split into *m* tiles and kernel *g* is decomposed into *n* tiles. The time complexity of the fast convolution *fi*∗*gj* is

$$
\left(\frac{N^f}{m} + \frac{N^\mathcal{S}}{n}\right) \left[\frac{9}{2}\log\_2\left(\frac{N^f}{m} + \frac{N^\mathcal{S}}{n}\right) + 1\right].\tag{24}
$$

We have *m* signal tiles and *n* kernel tiles. In order to perform the complete convolution *f* ∗*g* we have to perform *m*×*n* convolutions (see the nested loops in Algorithm 1) of the individual signal and kernel tiles. In total, we have to complete

$$\left(nN^f + mN^\S\right)\left[\frac{9}{2}\log\_2\left(\frac{N^f}{m} + \frac{N^\S}{n}\right) + 1\right] \tag{25}$$

steps. One can clearly see that without any division (*m* = *n* = 1) we get the complexity of fast convolution, i.e. the class *O*((*N<sup>f</sup>* +*Ng*)log(*N<sup>f</sup>* +*Ng*)). For total division (*m* = *N<sup>f</sup>* and

10.5772/51942

**Figure 5.** A graph of a function Φ(*x*, *y*) that represents the time complexity of tiled convolution. The *x*-axis and *y*-axis correspond to number of samples in signal and kernel tile, respectively. The evident minimum of function Φ(*x*, *y*) occurs in the location, where both variables (sizes of tiles) are maximized and equal at the same time.

*n* = *Ng*) we obtain basic convolution, i.e. the complexity class *O*(*N<sup>f</sup> Ng*). Concerning the space occupied by our convolution algorithm, we need

$$\left(\frac{N^f}{m} + \frac{N^g}{n}\right) \cdot \mathbb{C} \tag{26}$$

bytes, where *C* is again the precision dependent constant and *m*, *n* are the levels of division of signal *f* and kernel *g*, respectively.

#### 6.2.0.9. Algorithm optimality.

16 Design and Architectures for Digital Signal Processing

194 Design and Architectures for Digital Signal Processing

Finally, all the convolutions *f* ′

(*f* , *g*) ← (input signal, kernel)

*hij* ← convolve(*fi*, *gj*) {*use fast convolution*} *hij* ← discard\_overruns(*hij*)

*h* ← *h* + shift(*hij*)

6.2.0.8. Analysis of time complexity.

time complexity of the fast convolution *fi*∗*gj* is

 *Nf m* + *N<sup>g</sup> n*

signal and kernel tiles. In total, we have to complete

*nN<sup>f</sup>* +*mN<sup>g</sup>*

the Algorithm 1.

**for** *i* = 1 to *m* **do for** *j* = 1 to *n* **do**

**end for end for** Output ← *h*

to be flipped. As soon as the convolution *f* ′

*<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′ *j*

given tile *fi*. A general form of the method is shown in Fig. 4(b).

*f* → *f*1, *f*2,..., *fm* {*split ' f ' into tiles according to overlap-save scheme*} *g* → *g*1, *g*2,..., *gn* {*split 'g' into tiles according to overlap-add scheme*}

{*discard hij overruns following overlap-save output rules*}

 9 <sup>2</sup> log2

> 9 <sup>2</sup> log2

{*add hij to h following overlap-add output rules*}

*h* ← 0 {*create the output signal 'h' and fill it with zeros*}

the idea of kernel decomposition and minus sign in Eq. (2) which causes the whole kernel

size || *fi*|| and *added* to the output signal *h* into the position defined by overlap-save method.

The complete computation of the convolution across all signal and kernel tiles is sketched in

Let us suppose the signal *f* is split into *m* tiles and kernel *g* is decomposed into *n* tiles. The

We have *m* signal tiles and *n* kernel tiles. In order to perform the complete convolution *f* ∗*g* we have to perform *m*×*n* convolutions (see the nested loops in Algorithm 1) of the individual

steps. One can clearly see that without any division (*m* = *n* = 1) we get the complexity of fast convolution, i.e. the class *O*((*N<sup>f</sup>* +*Ng*)log(*N<sup>f</sup>* +*Ng*)). For total division (*m* = *N<sup>f</sup>* and

 *Nf m* + *N<sup>g</sup> n*

 *Nf m* + *N<sup>g</sup> n*

 + 1 

> + 1

. (24)

(25)

**Algorithm 1.** Divide-and-conquer approach applied to the convolution over large data.

*<sup>i</sup>* <sup>∗</sup> *<sup>g</sup>*′

*<sup>j</sup>* is performed, its result is cropped to the

, *j* = 1, 2, . . . *n* are performed to get complete result for one

We currently designed an algorithm of splitting the signal *f* into *m* tiles and the kernel *g* into *n* tiles. Now we will answer the question regarding the optimal way of splitting the input signal and the kernel. As the relationship between *m* and *n* is hard to be expressed and *N<sup>f</sup>* and *N<sup>g</sup>* are constants let us define the following substitution: *x* = *<sup>N</sup><sup>f</sup> <sup>m</sup>* and *<sup>y</sup>* <sup>=</sup> *<sup>N</sup><sup>g</sup> n* . Here *x* and *y* stand for the sizes of the signal and the kernel tiles, respectively. Applying this substitution to Eq. (25) and simplifying, we get the function

$$\Phi(\mathbf{x}, y) = \mathbf{N}^f \mathbf{N}^g \left(\frac{1}{\mathbf{x}} + \frac{1}{y}\right) \left[\frac{9}{2} \log\_2(\mathbf{x} + y) + 1\right] \tag{27}$$

The plot of this function is depicted in Figure 5. The minimum of this function is reached if and only if *x* = *y* and both variables *x* and *y* are maximized, i.e. the input signal and the kernel tiles should be of the same size (equal number of samples) and they should be as large as possible. In order to reach the optimal solution, the size of the tile should be the power of small primes [70]. In this sense, it is recommended to fulfill both criteria put on the tile size: the maximality (as stated above) and the capability of simple decomposition into small primes.

#### **6.3. Extension to higher dimensions**

18 Design and Architectures for Digital Signal Processing

All the previous statements are related only to a 1-D signal. Provided both signal and kernel are 3-dimensional and the tiling proces identical in all the axes, we can combine Eq. (20) and Eq. (25) in order to get:

$$\left[n\,\mathrm{N}\_{x}^{f} + m\mathrm{N}\_{x}^{\mathcal{S}}\right] \left(n\,\mathrm{N}\_{y}^{f} + m\mathrm{N}\_{y}^{\mathcal{S}}\right) \left(n\,\mathrm{N}\_{z}^{f} + m\mathrm{N}\_{z}^{\mathcal{S}}\right) \left[\frac{9}{2}\,\log\_{2}\left(\frac{\mathrm{N}\_{x}^{f}}{n} + \frac{\mathrm{N}\_{y}^{\mathcal{S}}}{n}\right)\left(\frac{\mathrm{N}\_{y}^{f}}{m} + \frac{\mathrm{N}\_{y}^{\mathcal{S}}}{n}\right) \left(\frac{\mathrm{N}\_{z}^{f}}{m} + \frac{\mathrm{N}\_{z}^{\mathcal{S}}}{n}\right) + 1\right] \tag{28}$$

Algorithms for Efficient Computation of Convolution 19

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

• *p* = *P* . . . The space complexity is the same as in the original approach. The time complexity is slightly better but practically it brings no advantage due to lots of memory accesses. The efficiency of this approach would be brought to evidence only if *P* ≫ 1. As the standard multi-core processors are typically equipped with only 2, 4 or 8 cores,

Regarding computer clusters the problem with one shared memory is solved as each computer has its private memory. Therefore, the total number of multiplications (see

constant representing the overheads and the cost of data transmission among the individual computers. The computation becomes effective only if *P* > *B*. The memory requirements for each node remain the same as in the non-parallelized case as each computer takes care of its

Just as the concept of the decomposition in the spatial (time) domain, the decomposition in the frequency domain can be used for the fast convolution algorithm, in order to (i) decrease the required amount of memory available per processing unit, (ii) employ multiple processing units without need of extensive data transfers between the processors. In the following text, we introduce the concept of the decomposition [21] along with optimization steps suitable for purely real data [71]. Subsequently, we present the results on achieved on a current graphics hardware. Finally, we conclude the applications and architectures where

In Section 5.3, the decimation-in-frequency algorithm was recalled. The DIF can be used not only to compute FFT itself but also to decompose the fast convolution. This algorithm can be divided into several phases, namely (i) so-called *decomposition* into parts using Eq. (17), (ii) the Fourier transforms of the parts, (iii) the convolution by pointwise multiplication itself, (iv) the inverse Fourier transforms, and (v) so-called *composition* using Eq. (18). In the following paragraph, we provide the mathematical background for the individual phases. The scheme

By employing Eq. (17), both the input signal *f* and *g* can be divided into sub-parts *fr*, *fs*

In the first and the last phase, it is inevitable to store the whole input signals in the memory. Here, the memory requirements are equal to those in the classical fast convolution algorithm. However, in the phases (ii)–(iv) which are by far the most computationally extensive, the

) = *F*(2*k*′ + 1) and the equivalent property is held for *Gr* and *Gs*, by applying FFT on *Fr Fs*, *Gr*, and *Gs*, individually, we obtain two separate parts of both the signal and the kernel. Subsequently, by computing the point-wise multiplication *Hr* = *FrGr* and *Hs* = *FsGs*, respectively, we obtain two separate parts of the Fourier transform of the convolution *h* = *f* ∗ *g*. Finally, the result *h* is obtained by applying Eq. (18) to the inverse Fourier transforms

and *gr*, *gs*, respectively. As the Fourier transforms *Fr* and *Fs* satisfy *Fr*(*k*′

*<sup>P</sup>* , where *P* is the number of available computers and *B* is a

neither this approach was found to be very useful.

**7. Decomposition in the frequency domain**

**7.1. Decomposition using the DIF algorithm**

description of the algorithm is shown in Fig. 6(a).

6.4.0.11. On computer clusters.

Eq. (25)) is modified by factor *<sup>B</sup>*

own private memory space.

the approach can be used.

*Fs*(*k*′

*hr* and *hs*.

10.5772/51942

197

) = *F*(2*k*′

) and

This statement can be further generalized for higher dimensions or for irregular tiling process. The proof can be simply derived from the separability of multidimensional Fourier transform, which guarantees that the time complexity of the higher dimensional Fourier transform depends on the amount of processed samples only. There is no difference in the time complexity if the higher-dimensional signal is elongated or in the shape of cube.

#### **6.4. Parallelization**

#### 6.4.0.10. On multicore CPU.

As the majority of recent computers are equipped with multi-core CPUs the following text will be devoted to the idea of parallelization of our approach using this architecture. Each such computer is equipped with two or more cores, however both cores share one memory. This means that execution of two or more huge convolutions concurrently may simply fail due to lack of available memory. The possible workaround is to perform one more division, i.e. signal and kernel tiles will be further split into even smaller pieces. Let *p* be a number that defines how many sub-pieces the signal and the kernel tiles should be split into. Let *P* be a number of available processors. If we execute the individual convolutions in parallel we get the overall number of multiplications

$$\frac{npN^f + mpN^g}{P} \left[ \frac{9}{2} \log \left( \frac{N^f}{mp} + \frac{N^g}{np} \right) + 1 \right] \tag{29}$$

and the space requirements

$$\left(\frac{N^f}{mp} + \frac{N^g}{np}\right) \cdot \mathbb{C} \cdot P \tag{30}$$

Let us study the relationship *p* versus *P*:


10.5772/51942

• *p* = *P* . . . The space complexity is the same as in the original approach. The time complexity is slightly better but practically it brings no advantage due to lots of memory accesses. The efficiency of this approach would be brought to evidence only if *P* ≫ 1. As the standard multi-core processors are typically equipped with only 2, 4 or 8 cores, neither this approach was found to be very useful.

#### 6.4.0.11. On computer clusters.

(28)

(29)

18 Design and Architectures for Digital Signal Processing

196 Design and Architectures for Digital Signal Processing

**6.3. Extension to higher dimensions**

Eq. (25) in order to get:

**6.4. Parallelization**

6.4.0.10. On multicore CPU.

and the space requirements

get the overall number of multiplications

Let us study the relationship *p* versus *P*:

*npN<sup>f</sup>* +*mpN<sup>g</sup> P*

> *Nf mp* + *N<sup>g</sup> np*

version (26). Hence, there is no advantage of using this approach.

 9 <sup>2</sup> log  *Nf mp* + *N<sup>g</sup> np*

• *p* < *P* . . . The space complexity becomes worse than in the original non-parallelized

• *p*> *P* . . . There are no additional memory requirements. However, the signal and kernel are split into too small pieces. We have to handle large number of overlaps of tiles which will cause the time complexity (29) to become worse than in the non-parallelized case (25).

 + 1 

· *C* · *P* (30)

 *nN<sup>f</sup> x*+*mN<sup>g</sup> x nN<sup>f</sup> y*+*mN<sup>g</sup> y nN<sup>f</sup> z*+*mN<sup>g</sup> z* 9 <sup>2</sup> log2 *<sup>N</sup> <sup>f</sup> x <sup>m</sup>* <sup>+</sup>*<sup>N</sup><sup>g</sup> x n <sup>N</sup> <sup>f</sup> y <sup>m</sup>* <sup>+</sup>*<sup>N</sup><sup>g</sup> y n <sup>N</sup> <sup>f</sup> z <sup>m</sup>* <sup>+</sup>*<sup>N</sup><sup>g</sup> z n* +1 

All the previous statements are related only to a 1-D signal. Provided both signal and kernel are 3-dimensional and the tiling proces identical in all the axes, we can combine Eq. (20) and

This statement can be further generalized for higher dimensions or for irregular tiling process. The proof can be simply derived from the separability of multidimensional Fourier transform, which guarantees that the time complexity of the higher dimensional Fourier transform depends on the amount of processed samples only. There is no difference in the time complexity if the higher-dimensional signal is elongated or in the shape of cube.

As the majority of recent computers are equipped with multi-core CPUs the following text will be devoted to the idea of parallelization of our approach using this architecture. Each such computer is equipped with two or more cores, however both cores share one memory. This means that execution of two or more huge convolutions concurrently may simply fail due to lack of available memory. The possible workaround is to perform one more division, i.e. signal and kernel tiles will be further split into even smaller pieces. Let *p* be a number that defines how many sub-pieces the signal and the kernel tiles should be split into. Let *P* be a number of available processors. If we execute the individual convolutions in parallel we Regarding computer clusters the problem with one shared memory is solved as each computer has its private memory. Therefore, the total number of multiplications (see Eq. (25)) is modified by factor *<sup>B</sup> <sup>P</sup>* , where *P* is the number of available computers and *B* is a constant representing the overheads and the cost of data transmission among the individual computers. The computation becomes effective only if *P* > *B*. The memory requirements for each node remain the same as in the non-parallelized case as each computer takes care of its own private memory space.

#### **7. Decomposition in the frequency domain**

Just as the concept of the decomposition in the spatial (time) domain, the decomposition in the frequency domain can be used for the fast convolution algorithm, in order to (i) decrease the required amount of memory available per processing unit, (ii) employ multiple processing units without need of extensive data transfers between the processors. In the following text, we introduce the concept of the decomposition [21] along with optimization steps suitable for purely real data [71]. Subsequently, we present the results on achieved on a current graphics hardware. Finally, we conclude the applications and architectures where the approach can be used.

#### **7.1. Decomposition using the DIF algorithm**

In Section 5.3, the decimation-in-frequency algorithm was recalled. The DIF can be used not only to compute FFT itself but also to decompose the fast convolution. This algorithm can be divided into several phases, namely (i) so-called *decomposition* into parts using Eq. (17), (ii) the Fourier transforms of the parts, (iii) the convolution by pointwise multiplication itself, (iv) the inverse Fourier transforms, and (v) so-called *composition* using Eq. (18). In the following paragraph, we provide the mathematical background for the individual phases. The scheme description of the algorithm is shown in Fig. 6(a).

By employing Eq. (17), both the input signal *f* and *g* can be divided into sub-parts *fr*, *fs* and *gr*, *gs*, respectively. As the Fourier transforms *Fr* and *Fs* satisfy *Fr*(*k*′ ) = *F*(2*k*′ ) and *Fs*(*k*′ ) = *F*(2*k*′ + 1) and the equivalent property is held for *Gr* and *Gs*, by applying FFT on *Fr Fs*, *Gr*, and *Gs*, individually, we obtain two separate parts of both the signal and the kernel. Subsequently, by computing the point-wise multiplication *Hr* = *FrGr* and *Hs* = *FsGs*, respectively, we obtain two separate parts of the Fourier transform of the convolution *h* = *f* ∗ *g*. Finally, the result *h* is obtained by applying Eq. (18) to the inverse Fourier transforms *hr* and *hs*.

In the first and the last phase, it is inevitable to store the whole input signals in the memory. Here, the memory requirements are equal to those in the classical fast convolution algorithm. However, in the phases (ii)–(iv) which are by far the most computationally extensive, the 20 Design and Architectures for Digital Signal Processing

Algorithms for Efficient Computation of Convolution 21

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

) + *jWk*′

*Nα*−(*k*′ ) 

). (32)

, (31)

It is possible to combine the two real input signals *f*(*n*) and *g*(*n*), *n* = 0, 1, . . . , *N* − 1, into one complex signal *f*(*n*) + *jg*(*n*) of the same length. However, this operation requires an additional buffer of length at least *N*. This poses significantly higher demands on the

Provided that the length *N* of a real input signal *f* is even, we can introduce a complex

of storing the complex signals is to store real and complex components, alternately, a real signal can be turned into a complex one by simply over-casting the data type, avoiding any computations or data transfers. The relationship between the Fourier transforms *F* and *F*ˆ is

, *<sup>F</sup>*(*k*′ <sup>+</sup> *<sup>N</sup>*/2) = <sup>1</sup>

As the third approach yields the best performance, it is used in the final version of the algorithm. The computation of Eq. (31), (32) will be further referred to as the *recombination*

The algorithm can be used not only in 1D but generally for any *n*-dimensional input signals. To achieve maximum data transfer efficiency, it is advisable to perform the decomposition in the first (*y* in 2D or *z* in 3D) axis so that the individual sub-parts form the undivided memory

Furthermore, the input data can be decomposed into generally *d* parts using an appropriate radix-*d* algorithm in both the decomposition and the composition phase. It should be noted, however, that due to the recombination phase, the algorithm requires twice more memory space per end node for *d* > 2. This is due to fact that some of the parts need to be recombined with others—refer to Fig. 6(b). To be more precise, the memory requirements are 2(*N<sup>f</sup>* +

As Nvidia provides users with the CUFFT library [32] for the efficient computation of the fast Fourier transform, the GPU implementation of the aforementioned algorithm is quite straightforward. The scheme description of the implementation is shown in Fig. 7. It should be noted that the significant part of the computation time is spent for the data transfers between the computing nodes (CPU and GPU, in this case). The algorithm is designed to keep the number of data transfers as low as possible. Nevertheless, it is highly

) ± *F*ˆ∗(*N*/2 − *k*′

) + *j f*(2*n*′ + 1) for any *n*′ = 0, 1, . . . , *N*/2 − 1. As the most common way

2 *α*+(*k*′

Combination of signal and kernel.

memory available at the central node. "Complexification" of input signals.

) − *jWk*′

*Nα*−(*k*′ ) 

*α*±(*k*′

phase. The scheme description of the algorithm is shown in Fig. 6(b).

) ≡ *F*ˆ(*k*′

) ≡ *f*(2*n*′

signal ˆ

*f*(*n*′

given by following:

) = <sup>1</sup> 2 *α*+(*k*′

**7.3. Getting further**

blocks, as explained in [21].

*Ng*)/*d* for *d* = 2 and 4(*N<sup>f</sup>* + *Ng*)/*d* for *d* > 2.

**7.4. GPU and multi-GPU implementation**

*F*(*k*′

where

10.5772/51942

199

**Figure 6.** A scheme description of the convolution algorithm with the decomposition in the frequency domain [71]. An input signal is decomposed into 2 parts by the decimation in frequency (DIF) algorithms. The parts are subsequently processed independently using the discrete Fourier transform (DFT).

data

memory requirements are inversely proportional to the number of parts *d* the signals are divided into. The algorithm is hence suitable for architectures with the star topology where the central node is relatively slow but has large memory, and the end nodes are fast but have small memory. The powerful desktop PC with one or several GPU cards is a typical example of such architecture.

It can be noted that the decimation-in-time (DIT) algorithm can also be used for the purpose of decomposing the convolution problem. However, its properties make it sub-efficient for practical use. Firstly, its time complexity is comparable with the one of DIF. Secondly and most important, it requires significantly more data transfers between the central and end nodes. In Section 7.5, the complexity of the individual algorithms is analysed in detail.

#### **7.2. Optimization for purely real signals**

In most practical applications, users work with purely real input signals. As described in Section 5.1, the Fourier transform is complex but satisfies specific properties when applied on such data. Therefore, it is reasonable to optimize the fast convolution algorithm in order to reduce both the time and the memory complexity. In the following paragraphs, we will describe three fundamental approaches to optimize the fast convolution of real signals.

Real-to-complex FFT.

As described in Section 5.4, most popular FFT implementations offer specialized functions for the FFT of purely real input data. With the classical fast convolution, users are advised to use specific functions of their preferred FFT library. With the DIF decomposition, it is nevertheless no more possible to use such functions as the decomposed signals are no more real.

10.5772/51942

Combination of signal and kernel.

It is possible to combine the two real input signals *f*(*n*) and *g*(*n*), *n* = 0, 1, . . . , *N* − 1, into one complex signal *f*(*n*) + *jg*(*n*) of the same length. However, this operation requires an additional buffer of length at least *N*. This poses significantly higher demands on the memory available at the central node.

"Complexification" of input signals.

Provided that the length *N* of a real input signal *f* is even, we can introduce a complex signal ˆ *f*(*n*′ ) ≡ *f*(2*n*′ ) + *j f*(2*n*′ + 1) for any *n*′ = 0, 1, . . . , *N*/2 − 1. As the most common way of storing the complex signals is to store real and complex components, alternately, a real signal can be turned into a complex one by simply over-casting the data type, avoiding any computations or data transfers. The relationship between the Fourier transforms *F* and *F*ˆ is given by following:

$$F(k') = \frac{1}{2} \left( a\_+(k') - j \mathcal{W}\_N^{k'} a\_-(k') \right), \qquad F(k' + N/2) = \frac{1}{2} \left( a\_+(k') + j \mathcal{W}\_N^{k'} a\_-(k') \right), \tag{31}$$

where

20 Design and Architectures for Digital Signal Processing

198 Design and Architectures for Digital Signal Processing

independently using the discrete Fourier transform (DFT).

**7.2. Optimization for purely real signals**

of such architecture.

Real-to-complex FFT.

real.

(a) DIF decomposition (b) DIF decomposition with the optimization for real data

**Figure 6.** A scheme description of the convolution algorithm with the decomposition in the frequency domain [71]. An input signal is decomposed into 2 parts by the decimation in frequency (DIF) algorithms. The parts are subsequently processed

memory requirements are inversely proportional to the number of parts *d* the signals are divided into. The algorithm is hence suitable for architectures with the star topology where the central node is relatively slow but has large memory, and the end nodes are fast but have small memory. The powerful desktop PC with one or several GPU cards is a typical example

It can be noted that the decimation-in-time (DIT) algorithm can also be used for the purpose of decomposing the convolution problem. However, its properties make it sub-efficient for practical use. Firstly, its time complexity is comparable with the one of DIF. Secondly and most important, it requires significantly more data transfers between the central and end nodes. In Section 7.5, the complexity of the individual algorithms is analysed in detail.

In most practical applications, users work with purely real input signals. As described in Section 5.1, the Fourier transform is complex but satisfies specific properties when applied on such data. Therefore, it is reasonable to optimize the fast convolution algorithm in order to reduce both the time and the memory complexity. In the following paragraphs, we will describe three fundamental approaches to optimize the fast convolution of real signals.

As described in Section 5.4, most popular FFT implementations offer specialized functions for the FFT of purely real input data. With the classical fast convolution, users are advised to use specific functions of their preferred FFT library. With the DIF decomposition, it is nevertheless no more possible to use such functions as the decomposed signals are no more

$$
\pi\_{\pm}(k') \equiv \hat{F}(k') \pm \hat{F}^\*(N/2 - k'). \tag{32}
$$

As the third approach yields the best performance, it is used in the final version of the algorithm. The computation of Eq. (31), (32) will be further referred to as the *recombination* phase. The scheme description of the algorithm is shown in Fig. 6(b).

#### **7.3. Getting further**

The algorithm can be used not only in 1D but generally for any *n*-dimensional input signals. To achieve maximum data transfer efficiency, it is advisable to perform the decomposition in the first (*y* in 2D or *z* in 3D) axis so that the individual sub-parts form the undivided memory blocks, as explained in [21].

Furthermore, the input data can be decomposed into generally *d* parts using an appropriate radix-*d* algorithm in both the decomposition and the composition phase. It should be noted, however, that due to the recombination phase, the algorithm requires twice more memory space per end node for *d* > 2. This is due to fact that some of the parts need to be recombined with others—refer to Fig. 6(b). To be more precise, the memory requirements are 2(*N<sup>f</sup>* + *Ng*)/*d* for *d* = 2 and 4(*N<sup>f</sup>* + *Ng*)/*d* for *d* > 2.

#### **7.4. GPU and multi-GPU implementation**

As Nvidia provides users with the CUFFT library [32] for the efficient computation of the fast Fourier transform, the GPU implementation of the aforementioned algorithm is quite straightforward. The scheme description of the implementation is shown in Fig. 7. It should be noted that the significant part of the computation time is spent for the data transfers between the computing nodes (CPU and GPU, in this case). The algorithm is designed to keep the number of data transfers as low as possible. Nevertheless, it is highly 22 Design and Architectures for Digital Signal Processing

Algorithms for Efficient Computation of Convolution 23

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

to GPU (host to device), *t*conv for the convolution including the FFT, recombination phase, point-wise multiplication, and the inverse FFT, *td*<sup>→</sup>*<sup>h</sup>* for data transfers from GPU to CPU (device to host) and finally *tc* for composition. The number of end nodes (GPU cards) is denoted by *P*. It is evident that in accordance with the famous Amdahl's law [72], the speed-up achieved by multiple end nodes is limited to the only parallel phase of the algorithm which is the convolution itself. Now if the data are decomposed into *d* parts and sent to *P* end units and if *d*>*P*>1, the data transfers can be overlapped with the convolution phase. This means that the real computation time is shorter than *T* as in Eq. (33). Eq. (33)

In the previous text, we mentioned three approaches for the decomposition of the fast convolution: Tiling (decomposition in the time domain), the DIF-based, and the DIT-based algorithm. For fair comparison of the three, we compute the number of arithmetic operations, the number of data transfers, and the memory requirements per end node, with respect to the input signal length and the *d* parameter, i.e. the number of parts the data are divided into. As for the tiling method, the computation is based on Eq. (27) while setting *d* = *m* = *n*

**Method # of operations # of data transfers required**

*<sup>d</sup>* ) + 1

To conclude the results, it can be noted that the tiling method is the best one in terms of memory demands. It requires 4× less memory per end node than the DIF-based and the DIT-based algorithms. On the other hand, both the number of the operations and the number of data transfers are dependent on the *d* parameter which is not the case of the DIF-based method. By dividing the data into more sub-parts, the memory requirements of the DIF-based algorithm decrease while the number of operations and memory transactions remain constant. Hence, the DIF-based algorithm can be generally more efficient than the

Both the tiling and the DIF-based algorithm can be used to allow the computation of the fast convolution in the applications where the convolving signals are multi-dimensional and/or contain too many samples to be handled efficiently on a single computer. We already mentioned the application of the optical microscopy data where the convolution is used to simulate the image degradation introduced by an optical system. Using the decomposition methods, the computation can be distributed over (a) a computer grid, (b) multiple CPU and

<sup>2</sup> log2(*N<sup>f</sup>* <sup>+</sup> *<sup>N</sup>g*) + <sup>1</sup>

<sup>2</sup> log2(*N<sup>f</sup>* <sup>+</sup> *<sup>N</sup>g*) + <sup>2</sup>

<sup>2</sup> log2( *<sup>N</sup><sup>f</sup>* <sup>+</sup>*N<sup>g</sup>*

can be hence viewed as the upper limit. The model example is shown in Fig. 8.

**7.5. Algorithm comparison**

**DIF** (*N<sup>f</sup>* + *Ng*)

**DIT** (*N<sup>f</sup>* + *Ng*)

**Tiling** *d*(*N<sup>f</sup>* + *Ng*)

**7.6. Applications and architectures**

tiling.

(the optimum case). The results are shown in Table 1.

 9

 9

> 9

**Table 1.** Methods for decomposition of the fast convolution and their requirements

10.5772/51942

201

**Memory**

**per node**

3(*N<sup>f</sup>* + *Ng*) 4(*N<sup>f</sup>* + *Ng*)/*d*

(*d* + 1)(*N<sup>f</sup>* + *Ng*) 4(*N<sup>f</sup>* + *Ng*)/*d*

(*d* + 1)(*N<sup>f</sup>* + *Ng*) (*N<sup>f</sup>* + *Ng*)/*d*

**Figure 7.** A scheme description of the proposed algorithm for the convolution with the decomposition in the frequency domain, implemented on GPU [21]. The example shows the decomposition into 4 parts.

**Figure 8.** A model timeline of the algorithm workflow [21]. The dark boxes denote data transfers between CPU and GPU while the light boxes represent convolution computations. The first row shows the single-GPU implementation. The second row depicts parallel usage of two GPUs. The data transfers are performed concurrently but through a common bus, therefore they last twice longer. For the third row, the data transfers are synchronized so that only one transfer is made at a time. In the last row, the data transfers are overlapped with the convolution execution.

recommendable to overlap the data transfers with some computation phases in order to keep the implementation as efficient as possible.

To prove the importance of the overlapping, we provide a detailed analysis of the algorithm workflow. The overall computation time *T* required by the algorithm can be expressed as follows:

$$T = \max(t\_p + t\_{d\prime}t\_d) + t\_{\rm h \to d} + \frac{t\_{\rm conv}}{P} + t\_{d \to h} + t\_{c\prime} \tag{33}$$

where *tp* is the time required for the initial signal padding, *td* for decomposition, *ta* for allocating memory and setting up FFT plans on GPU, *th*<sup>→</sup>*<sup>d</sup>* for data transfers from CPU

10.5772/51942

to GPU (host to device), *t*conv for the convolution including the FFT, recombination phase, point-wise multiplication, and the inverse FFT, *td*<sup>→</sup>*<sup>h</sup>* for data transfers from GPU to CPU (device to host) and finally *tc* for composition. The number of end nodes (GPU cards) is denoted by *P*. It is evident that in accordance with the famous Amdahl's law [72], the speed-up achieved by multiple end nodes is limited to the only parallel phase of the algorithm which is the convolution itself. Now if the data are decomposed into *d* parts and sent to *P* end units and if *d*>*P*>1, the data transfers can be overlapped with the convolution phase. This means that the real computation time is shorter than *T* as in Eq. (33). Eq. (33) can be hence viewed as the upper limit. The model example is shown in Fig. 8.

#### **7.5. Algorithm comparison**

22 Design and Architectures for Digital Signal Processing

200 Design and Architectures for Digital Signal Processing

**Figure 7.** A scheme description of the proposed algorithm for the convolution with the decomposition in the frequency

**Figure 8.** A model timeline of the algorithm workflow [21]. The dark boxes denote data transfers between CPU and GPU while the light boxes represent convolution computations. The first row shows the single-GPU implementation. The second row depicts parallel usage of two GPUs. The data transfers are performed concurrently but through a common bus, therefore they last twice longer. For the third row, the data transfers are synchronized so that only one transfer is made at a time. In the

recommendable to overlap the data transfers with some computation phases in order to keep

To prove the importance of the overlapping, we provide a detailed analysis of the algorithm workflow. The overall computation time *T* required by the algorithm can be expressed as

where *tp* is the time required for the initial signal padding, *td* for decomposition, *ta* for allocating memory and setting up FFT plans on GPU, *th*<sup>→</sup>*<sup>d</sup>* for data transfers from CPU

*t*conv

*<sup>P</sup>* <sup>+</sup> *td*<sup>→</sup>*<sup>h</sup>* <sup>+</sup> *tc*, (33)

*T* = max(*tp* + *td*, *ta*) + *th*<sup>→</sup>*<sup>d</sup>* +

domain, implemented on GPU [21]. The example shows the decomposition into 4 parts.

last row, the data transfers are overlapped with the convolution execution.

the implementation as efficient as possible.

follows:

In the previous text, we mentioned three approaches for the decomposition of the fast convolution: Tiling (decomposition in the time domain), the DIF-based, and the DIT-based algorithm. For fair comparison of the three, we compute the number of arithmetic operations, the number of data transfers, and the memory requirements per end node, with respect to the input signal length and the *d* parameter, i.e. the number of parts the data are divided into. As for the tiling method, the computation is based on Eq. (27) while setting *d* = *m* = *n* (the optimum case). The results are shown in Table 1.


**Table 1.** Methods for decomposition of the fast convolution and their requirements

To conclude the results, it can be noted that the tiling method is the best one in terms of memory demands. It requires 4× less memory per end node than the DIF-based and the DIT-based algorithms. On the other hand, both the number of the operations and the number of data transfers are dependent on the *d* parameter which is not the case of the DIF-based method. By dividing the data into more sub-parts, the memory requirements of the DIF-based algorithm decrease while the number of operations and memory transactions remain constant. Hence, the DIF-based algorithm can be generally more efficient than the tiling.

#### **7.6. Applications and architectures**

Both the tiling and the DIF-based algorithm can be used to allow the computation of the fast convolution in the applications where the convolving signals are multi-dimensional and/or contain too many samples to be handled efficiently on a single computer. We already mentioned the application of the optical microscopy data where the convolution is used to simulate the image degradation introduced by an optical system. Using the decomposition methods, the computation can be distributed over (a) a computer grid, (b) multiple CPU and 24 Design and Architectures for Digital Signal Processing

GPU units where CPU is usually provided with more memory, hence it is used as a central node for the (de)composition of the data.

Algorithms for Efficient Computation of Convolution 25

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

This work has been supported by the Grant Agency of the Czech Republic (Grant No.

Centre for Biomedical Image Analysis, Faculty of Informatics, Masaryk University, Brno,

[1] J. Jan. *Digital Signal Filtering, Analysis and Restoration (Telecommunications Series)*.

[3] A. Foi. Noise estimation and removal in MR imaging: The variance stabilization approach. In *IEEE International Symposium on Biomedical Imaging: from Nano to Macro*,

[4] J. R. Parker. *Algorithms for Image Processing and Computer Vision*. Wiley Publishing, 2nd

[5] J. Canny. A computational approach to edge detection. *IEEE T-PAMI*, 8:769–698, 1986.

[6] D. H. Ballard. Generalizing the Hough transform to detect arbitrary shapes. *Pattern*

[8] R. C. Gonzalez and R. E. Woods. *Digital Image Processing*. Prentice Hall, 2002. ISBN:

[10] P. J. Verveer. Computational and optical methods for improving resolution and signal

[11] A. Lehmussola, J. Selinummi, P. Ruusuvuori, A. Niemistö, and O. Yli-Harja. Simulating fluorescent microscope images of cell populations. In *Proceedings of the 27th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC'05)*,

[12] D. Svoboda, M. Kozubek, and S. Stejskal. Generation of Digital Phantoms of Cell Nuclei and Simulation of Image Formation in 3D Image Cytometry. *Cytometry part*

[7] D. Salomon. *Data Compression: The Complete Reference*. Springer-Verlag, 2007.

[9] K. R. Castleman. *Digital Image Processing*. Prentice Hall, 1996.

quality in fluorescence microscopy. 1998. PhD Thesis.

**Acknowledgments**

P302/12/G157).

**Author details**

Czech Republic

**References**

Pavel Karas<sup>⋆</sup> and David Svoboda

INSPEC, Inc., 2000.

pages 1809––1814, 2011.

*Recognition*, 13(2):111–122, 1981.

edition, 2010.

0-201-18075-8.

pages 3153–3156, 2005.

*A*, 75A(6):494–509, JUN 2009.

<sup>⋆</sup> Address all correspondence to: xkaras1@fi.muni.cz

[2] S. W. Smith. *Digital Signal Processing*. Newnes, 2003.

10.5772/51942

203

#### **8. Conclusions**

In this text, we introduce the convolution as an important tool in both signal and image processing. In the first part, we mention some of the most popular applications it is employed in and recall its mathematical definition. Subsequently, we present a number of common algorithms for an efficient computation of the convolution on various architectures. The simplest approach—so-called naïve convolution—is to perform the convolution straightly using the definition. Although it is less efficient than other algorithms, it is the most general one and is popular in some specific applications where small convolution kernels are used, such as edge or object detection. If the convolution kernel is multi-dimensional and can be expressed as a convolution of several 1-D kernels, then the naïve convolution is usually replaced by its alternative, so-called separable convolution. The lowest time complexity can be achieved by using the recursive filtering. Here, the result of the convolution at each position can be obtained by applying a few arithmetical operations to the previous result. Besides the efficiency, the advantage is that these filters are suitable for streaming architectures such as FPGA. On the other hand, this method is generally not suitable for all convolution kernels as the recursive filters are often numerically unstable and inaccurate. The last algorithm present in the chapter is the fast convolution. According to the so-called convolution theorem, the convolution can be computed in the frequency domain by a simple point-wise multiplication of the Fourier transforms of the input signals. This approach is the most suitable for long signals and kernels as it yields generally the best time complexity. However, it has non-trivial memory demands caused by the fact that the input data need to be padded.

Therefore, in the second part of the chapter, we describe two approaches to reduce the memory requirements of the fast convolution. The first one, so-called tiling is performed in the spatial (time) domain. It is the most efficient with respect to the memory requirements. However, with a higher number of sub-parts the input data are divided into, both the number of arithmetical operations and the number of potential data transfers increase. Hence, in some applications or on some architectures (such as the desktop PC with one ore multiple graphics cards) where the overhead of data transfers is critical, one can use a different approach, based on the decomposition-in-frequency (DIF) algorithm which is widely known from the concept of the fast Fourier transform. We also mention the third method based on the decomposition-in-time (DIT) algorithm. However, the DIT-based algorithm is sub-efficient from every point of view so there is no reason for it to be used instead of the DIF-based one. In the end of the chapter, we also provide a detailed analysis of (i) the number of arithmetical operations, (ii) the number of data transfers, (iii) the memory requirements for each of the three methods.

As the convolution is one of the most extensively-studied operations in the signal processing, the list of the algorithms and implementations mentioned in this chapter is not and cannot be complete. Nevertheless, we tried to include those that we consider to be the most popular and widely-used. We also believe that the decomposition tricks which are described in the second part of the chapter and are the subject of the authors' original research can help readers to improve their own applications, regardless of target architecture.

10.5772/51942

#### **Acknowledgments**

24 Design and Architectures for Digital Signal Processing

202 Design and Architectures for Digital Signal Processing

node for the (de)composition of the data.

requirements for each of the three methods.

**8. Conclusions**

be padded.

GPU units where CPU is usually provided with more memory, hence it is used as a central

In this text, we introduce the convolution as an important tool in both signal and image processing. In the first part, we mention some of the most popular applications it is employed in and recall its mathematical definition. Subsequently, we present a number of common algorithms for an efficient computation of the convolution on various architectures. The simplest approach—so-called naïve convolution—is to perform the convolution straightly using the definition. Although it is less efficient than other algorithms, it is the most general one and is popular in some specific applications where small convolution kernels are used, such as edge or object detection. If the convolution kernel is multi-dimensional and can be expressed as a convolution of several 1-D kernels, then the naïve convolution is usually replaced by its alternative, so-called separable convolution. The lowest time complexity can be achieved by using the recursive filtering. Here, the result of the convolution at each position can be obtained by applying a few arithmetical operations to the previous result. Besides the efficiency, the advantage is that these filters are suitable for streaming architectures such as FPGA. On the other hand, this method is generally not suitable for all convolution kernels as the recursive filters are often numerically unstable and inaccurate. The last algorithm present in the chapter is the fast convolution. According to the so-called convolution theorem, the convolution can be computed in the frequency domain by a simple point-wise multiplication of the Fourier transforms of the input signals. This approach is the most suitable for long signals and kernels as it yields generally the best time complexity. However, it has non-trivial memory demands caused by the fact that the input data need to

Therefore, in the second part of the chapter, we describe two approaches to reduce the memory requirements of the fast convolution. The first one, so-called tiling is performed in the spatial (time) domain. It is the most efficient with respect to the memory requirements. However, with a higher number of sub-parts the input data are divided into, both the number of arithmetical operations and the number of potential data transfers increase. Hence, in some applications or on some architectures (such as the desktop PC with one ore multiple graphics cards) where the overhead of data transfers is critical, one can use a different approach, based on the decomposition-in-frequency (DIF) algorithm which is widely known from the concept of the fast Fourier transform. We also mention the third method based on the decomposition-in-time (DIT) algorithm. However, the DIT-based algorithm is sub-efficient from every point of view so there is no reason for it to be used instead of the DIF-based one. In the end of the chapter, we also provide a detailed analysis of (i) the number of arithmetical operations, (ii) the number of data transfers, (iii) the memory

As the convolution is one of the most extensively-studied operations in the signal processing, the list of the algorithms and implementations mentioned in this chapter is not and cannot be complete. Nevertheless, we tried to include those that we consider to be the most popular and widely-used. We also believe that the decomposition tricks which are described in the second part of the chapter and are the subject of the authors' original research can help

readers to improve their own applications, regardless of target architecture.

This work has been supported by the Grant Agency of the Czech Republic (Grant No. P302/12/G157).

#### **Author details**

Pavel Karas<sup>⋆</sup> and David Svoboda

<sup>⋆</sup> Address all correspondence to: xkaras1@fi.muni.cz

Centre for Biomedical Image Analysis, Faculty of Informatics, Masaryk University, Brno, Czech Republic

#### **References**


26 Design and Architectures for Digital Signal Processing


Algorithms for Efficient Computation of Convolution 27

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

[29] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A.E. Lefohn, and T.J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. pages

[30] D. Casta ´no-Díez, D. Moser, A. Schoenegger, S. Pruggnaller, and A. S. Frangakis. Performance evaluation of image processing algorithms on the GPU. *Journal of Structural*

[31] S. Ryoo, C.I. Rodrigues, S.S. Baghsorkhi, S.S. Stone, D.B. Kirk, and Wen-mei W. Hwu. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In *PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming*, pages 73–82, New York, NY, USA, 2008.

[32] NVIDIA Developer Zone. http://developer.nvidia.com/category/zone/cuda-zone,

[35] V. Podlozhnyuk. Image Convolution with CUDA. http://developer.download.nvidia.

[37] Y. Luo and R. Duraiswami. Canny edge detection on NVIDIA CUDA. In *Computer Vision and Pattern Recognition Workshops, 2008. CVPRW '08. IEEE Computer Society Conference on*,

[38] K. Ogawa, Y. Ito, and K. Nakano. Efficient Canny Edge Detection Using a GPU. In *Networking and Computing (ICNC), 2010 First International Conference on*, pages 279–280,

[39] A. Herout, R. Jošth, R. Juránek, J. Havel, M. Hradiš, and P. Zemˇcík. Real-time object detection on CUDA. *Journal of Real-Time Image Processing*, 6:159–170, 2011.

[40] Ke Zhang, Jiangbo Lu, G. Lafruit, R. Lauwereins, and L. Van Gool. Real-time accurate stereo with bitwise fast voting on CUDA. In *IEEE 12th International Conference on*

[41] Wei Chen, M. Beister, Y. Kyriakou, and M. Kachelries. High performance median filtering using commodity graphics hardware. In *Nuclear Science Symposium Conference*

[42] S. Che, M. Boyer, J. Meng, D. Tarjan, J.W. Sheaffer, and K. Skadron. A performance study of general-purpose applications on graphics processors using CUDA. *Journal of*

*Computer Vision Workshops (ICCV Workshops)*, pages 794 –800, Oct 2009.

*Record (NSS/MIC), 2009 IEEE*, pages 4142–4147, Nov 2009.

*parallel and distributed computing*, 68(10):1370–1380, 2008.

[33] Khronos Group. OpenCL. http://www.khronos.org/opencl/, 2011.

com/assets/cuda/files/convolutionSeparable.pdf, Jun 2007.

[34] CUDA Downloads. http://developer.nvidia.com/cuda-downloads, Apr 2012.

[36] NVIDIA Performance Primitives. http://developer.nvidia.com/npp, Feb 2012.

21–51, August 2005.

ACM.

Apr 2012.

pages 1–8, Jun 2008.

10.1007/s11554-010-0179-0.

Nov 2010.

*Biology*, 164(1):153–160, 2008.

10.5772/51942

205


10.5772/51942


26 Design and Architectures for Digital Signal Processing

204 Design and Architectures for Digital Signal Processing

[14] T. Bräunl. *Parallel Image Processing*. Springer, 2001.

14(3):233–255, 1999. ISSN: 0920-8542.

ISBN: 978-0-7695-3136-6.

Image Processing.

[13] W. K. Pratt. *Digital Image Processing*. Wiley, 3rd edition edition, 2001.

[18] B. Jähne. *Digital Image Processing*. Springer, 5th edition edition, 2002.

[20] R. N. Bracewell. *Fourier Analysis and Imaging*. Springer, 2006.

Prentice hall Upper Saddle Riverˆ eN. JNJ, 1989.

*Processing–ICIAP 2011*, pages 453–462, 2011.

[15] H.-M. Yip, I. Ahmad, and T.-C. Pong. An Efficient Parallel Algorithm for Computing the Gaussian Convolution of Multi-dimensional Image Data. *The Journal of Supercomputing*,

[16] O. Schwarzkopf. Computing Convolutions on Mesh-Like Structures. In *Proceedings of*

[17] S. Kadam. Parallelization of Low-Level Computer Vision Algorithms on Clusters. In *AMS '08: Proceedings of the 2008 Second Asia International Conference on Modelling & Simulation (AMS)*, pages 113–118, Washington, DC, USA, 2008. IEEE Computer Society.

[19] Robert Hummel and David Loew. Computing Large-Kernel Convolutions of Images. Technical report, New York University, Courant Institute of Mathematical Sciences, 1986.

[21] P. Karas and D. Svoboda. Convolution of large 3D images on GPU and its decomposition. *EURASIP Journal on Advances in Signal Processing*, 2011(1):120, 2011.

[22] A.V. Oppenheim, R.W. Schafer, J.R. Buck, et al. *Discrete-time signal processing*, volume 2.

[23] D. Svoboda. Efficient computation of convolution of huge images. *Image Analysis and*

[24] R. G. Shoup. Parameterized convolution filtering in an FPGA. In *Selected papers from the Oxford 1993 international workshop on field programmable logic and applications on More*

[25] A. Benedetti, A. Prati, and N. Scarabottolo. Image convolution on FPGAs: the implementation of a multi-FPGA FIFO structure. In *Euromicro Conference, 1998.*

[26] S. Perri, M. Lanuzza, P. Corsonello, and G. Cocorullo. A high-performance fully reconfigurable FPGA-based 2D convolution processor. *Microprocessors and Microsystems*, 29(8—9):381–391, 2005. Special Issue on FPGAs: Case Studies in Computer Vision and

[27] A. Herout, P. Zemcik, M. Hradis, R. Juranek, J. Havel, R. Josth, and L. Polok. *Low-Level*

[28] H. Shan and N. A. Hazanchuk. Adaptive Edge Detection for Real-Time Video Processing

*FPGAs*, pages 274–280, Oxford, UK, UK, 1994. Abingdon EE&CS Books.

*Proceedings. 24th*, volume 1, pages 123–130 vol.1, Aug 1998.

*Image Features for Real-Time Object Detection*. InTech, 2010.

using FPGAs. Application notes, Altera Corporation, 2005.

*the Seventh International Parallel Processing Symposium*, pages 695–699, 1993.


28 Design and Architectures for Digital Signal Processing

[43] Zhaoyi Wei, Dah-Jye Lee, B. E. Nelson, J. K. Archibald, and B. B. Edwards. FPGA-Based Embedded Motion Estimation Sensor. 2008.

Algorithms for Efficient Computation of Convolution 29

http://dx.doi.org/10.5772/51942

Algorithms for Efficient Computation of Convolution

[57] G.U. Ramos. Roundoff error analysis of the fast Fourier transform. *Math. Comp*,

[58] R. N. Bracewell. *The Fourier Transform and Its Applications*. McGraw-Hill, 3rd edition,

[59] F.J. Harris. On the use of windows for harmonic analysis with the discrete Fourier

[60] J.W. Cooley and J.W. Tukey. An algorithm for the machine calculation of complex

[63] Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/

[64] Intel Integrated Performance Primitives. http://software.intel.com/en-us/articles/

[65] A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In *SC '08: Proceedings of the 2008 ACM/IEEE conference on*

[66] N.K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In *SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing*, pages 1–12, Piscataway, NJ, USA,

[67] R. Tsuchiyama, T. Nakamura, T. Iizuka, A. Asahara, and S. Miki. The OpenCL

[68] Z. Li, H. Sorensen, and C. Burrus. FFT and convolution algorithms on DSP microprocessors. In *Acoustics, Speech, and Signal Processing, IEEE International Conference*

[69] I.S. Uzun, A. Amira, and A. Bouridane. FPGA implementations of fast Fourier transforms for real-time signal and image processing. In *Vision, Image and Signal*

[70] M. Heideman, D. Johnson, and C. Burrus. Gauss and the history of the fast Fourier

[71] P. Karas, D. Svoboda, and P. Zemˇcík. GPU Optimization of Convolution for Large 3-D Real Images. In *Advanced Concepts for Intelligent Vision Systems (ACIVS), 2012*. Springer,

[61] M. Frigo and S.G. Johnson. The Fastest Fourier Transform in the West. 1997.

[62] M. Frigo and S.G. Johnson. benchFFT. http://www.fftw.org/benchfft/, 2012.

*Supercomputing*, pages 1–11, Piscataway, NJ, USA, 2008. IEEE Press.

transform. *Proceedings of the IEEE*, 66(1):51–83, 1978.

Fourier series. *Math. Comput*, 19(90):297–301, 1965.

25:757–768, 1971.

intel-ipp/, 2012.

intel-mkl/, 2012.

2008. IEEE Press.

2012. Accepted.

Programming Book. *Group*, 2009.

*on ICASSP'86.*, volume 11, pages 289–292. IEEE, 1986.

*Processing, IEE Proceedings-*, volume 152, pages 283–296. IET, 2005.

transform. *ASSP Magazine, IEEE*, 1(4):14–21, Oct 1984. ISSN: 0740-7467.

2000.

10.5772/51942

207


10.5772/51942

[57] G.U. Ramos. Roundoff error analysis of the fast Fourier transform. *Math. Comp*, 25:757–768, 1971.

28 Design and Architectures for Digital Signal Processing

206 Design and Architectures for Digital Signal Processing

2010.

1999.

2006.

1967.

April 1976.

165–8 vol.5, may 2004.

10.1007/s00371-009-0341-5.

609–614, 2006.

3(1):59–65, February 1997.

*Processing*, 44(2):139–151, 1995.

*Conference on*, volume 2, pages 406–409 vol.2, Oct 1997.

Embedded Motion Estimation Sensor. 2008.

[43] Zhaoyi Wei, Dah-Jye Lee, B. E. Nelson, J. K. Archibald, and B. B. Edwards. FPGA-Based

[44] XinXin Wang and B.E. Shi. GPU implemention of fast Gabor filters. In *Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on*, pages 373–376, Jun

[45] O. Fialka and M. Cadík. FFT and Convolution Performance in Image Filtering on ˇ GPU. In *Information Visualization, 2006. IV 2006. Tenth International Conference on*, pages

[46] J. S. Jin and Y. Gao. Recursive implementation of LoG filtering. *Real-Time Imaging*,

[47] R. Deriche. Using Canny's criteria to derive a recursively implemented optimal edge detector. *The International Journal of Computer Vision*, 1(2):167–187, May 1987.

[48] I. T. Young and L. J. van Vliet. Recursive implementation of the Gaussian filter. *Signal*

[49] F.G. Lorca, L. Kessal, and D. Demigny. Efficient ASIC and FPGA implementations of IIR filters for real time edge detection. In *Image Processing, 1997. Proceedings., International*

[50] R.D. Turney, A.M. Reza, and J.G.R. Delva. FPGA implementation of adaptive temporal Kalman filter for real time video filtering. In *Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on*, volume 4, pages 2231–2234 vol.4, Mar

[51] J. Diaz, E. Ros, F. Pelayo, E.M. Ortigosa, and S. Mota. FPGA-based real-time optical-flow system. *Circuits and Systems for Video Technology, IEEE Transactions on*, 16(2):274–279, Feb

[52] J. Robelly, G. Cichon, H. Seidel, and G. Fettweis. Implementation of recursive digital filters into vector SIMD DSP architectures. In *Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on*, volume 5, pages V –

[53] F. Trebien and M. Oliveira. Realistic real-time sound re-synthesis and processing for interactive virtual worlds. *The Visual Computer*, 25:469–477, 2009.

[54] E.O. Brigham and R.E. Morrow. The fast Fourier transform. *Spectrum, IEEE*, 4(12):63–70,

[55] H.J. Nussbaumer. Fast Fourier transform and convolution algorithms. *Berlin and New*

[56] Donald Fraser. Array Permutation by Index-Digit Permutation. *J. ACM*, 23(2):298–309,

*York, Springer-Verlag(Springer Series in Information Sciences.*, 2, 1982.


30 Design and Architectures for Digital Signal Processing

[72] G.M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In *Proceedings of the April 18-20, 1967, spring joint computer conference*, pages 483–485. ACM, 1967.

**Section 4**

**Advanced Architectures and Implementations**

**Advanced Architectures and Implementations**

30 Design and Architectures for Digital Signal Processing

208 Design and Architectures for Digital Signal Processing

*conference*, pages 483–485. ACM, 1967.

[72] G.M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In *Proceedings of the April 18-20, 1967, spring joint computer*

**Chapter 9**

**Self-Organizing Architectures for**

Additional information is available at the end of the chapter

elements in the range of the hundreds to the thousands.

essing, and the like, in the multimedia arena.

viously urges to reduce energy consumption of processing units.

Technological bounds in digital circuits integration in the last decades have been fostering the development of massively parallel architectures for tasks that had not been touched before by traditional parallel paradigms. Even in personal computers, as well as in consumer and mobile devices, it is common to find powerful processing units composed of processing

The request for mobile devices, that are self-powered, almost permanently switched on and connected through wireless networks, as well as environmental friendliness constraints, ob‐

On the other hand, applications continuously keep pushing forward computing power needs. A number of such applications are actually performed on application specific or on almost general-purpose parallel multi-core unit, as in the case of 3D graphics, sound proc‐

The current industrial trend aims to increase computing power and energetic efficiency by adding cores to both main processors and specialized units. A number of experimental ar‐ chitectures have been proposed that try to achieve the same goal by exploiting different de‐ signs. Coarse and fine grained architectures, and more in general, reconfigurable architectures have been proposed to make the hardware adapt to the required tasks instead of using specialized software running on general purpose processing elements. This has es‐

More interestingly, in these fields, the quest for massively parallel and energy efficient hard‐ ware implementations, coupled with biological models of reference, may pour interest in re‐ viewing well and lesser studied approaches that are centered on self-organizing processing

> © 2013 Peri and Gaglio; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2013 Peri and Gaglio; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

pecially been the case in computer vision, and intelligent systems in general.

**Digital Signal Processing**

Daniele Peri and Salvatore Gaglio

http://dx.doi.org/10.5772/53334

**1. Introduction**

**Chapter 9**

## **Self-Organizing Architectures for Digital Signal Processing**

Daniele Peri and Salvatore Gaglio

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/53334

#### **1. Introduction**

Technological bounds in digital circuits integration in the last decades have been fostering the development of massively parallel architectures for tasks that had not been touched before by traditional parallel paradigms. Even in personal computers, as well as in consumer and mobile devices, it is common to find powerful processing units composed of processing elements in the range of the hundreds to the thousands.

The request for mobile devices, that are self-powered, almost permanently switched on and connected through wireless networks, as well as environmental friendliness constraints, ob‐ viously urges to reduce energy consumption of processing units.

On the other hand, applications continuously keep pushing forward computing power needs. A number of such applications are actually performed on application specific or on almost general-purpose parallel multi-core unit, as in the case of 3D graphics, sound proc‐ essing, and the like, in the multimedia arena.

The current industrial trend aims to increase computing power and energetic efficiency by adding cores to both main processors and specialized units. A number of experimental ar‐ chitectures have been proposed that try to achieve the same goal by exploiting different de‐ signs. Coarse and fine grained architectures, and more in general, reconfigurable architectures have been proposed to make the hardware adapt to the required tasks instead of using specialized software running on general purpose processing elements. This has es‐ pecially been the case in computer vision, and intelligent systems in general.

More interestingly, in these fields, the quest for massively parallel and energy efficient hard‐ ware implementations, coupled with biological models of reference, may pour interest in re‐ viewing well and lesser studied approaches that are centered on self-organizing processing

structures. Indeed, current research on pattern recognition shows significant interest in highly structured models built on large numbers of processing nodes trained by demanding algorithms that can, at least partially, be implemented in a parallel fashion.

Of all the connotations given to perceptron networks, "adaptive" has been certainly one of the most adopted, however many of the terms cited before have found some kind of use

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

213

With respect to the "self-organization" term, its use has been used so widespread in com‐ puter and information processing literature that the effort to determine its introduction is

One of the most well known uses of the term can be traced back to Kohonen's "Self-Organ‐ izing Maps" (SOMs) [2]. Differently from perceptron based Artificial Neural Networks, trained with supervised algorithms, Kohonen proposed an unsupervised method whose geometrical representation is that of a continuous rearrangement of points of the feature space around auto-determined cluster centers represented as cells in a two-dimensional ar‐ ray. In the topological representation of the evolving map during learning, centers can be visualized as they move forcing the two-dimensional map to stretch in the effort to cover the feature space. Even if SOMs are not a derivation of any biological model, some parallelism

with the visual neocortex both in terms of function and structure has been drawn [3].

structure accordingly to the unknown probability distribution of the input.

namic background for surveillance purposes [9].

sion about computational frameworks.

complex organisms.

Given the impact of SOMs in machine learning and the excitement produced by a simple and effective unsupervised learning algorithm, it is not surprising that a large number of pa‐ pers followed in Kohonen's, and that research on SOMs is still carried on actively. Fritzke proposed structures that grow from a small number of cells [4] modulating the network

Other research on SOMs, similarly to the evolution of multi-layered supervised neural net‐ works, introduced some hierarchical organization, as Choi and Park did with their "Self-Creating and Organizing Neural Network" [5], or Rauber et al. with their "Growing Hierarchical SOM" [6], or followed the path of hardware implementation of SOMs either in analog [7], or digital form [8]. A surveillance application was proposed by Chacon-Murguia and Gonzalez-Duarte that mixes SOMs with neuro-fuzzy networks to detect objects in dy‐

At some point, Dingle and Jones proposed the Chaotic Self Organizing Feature Map [10] based on recurrent functions leading to chaotic behavior. Continuing this research, more re‐ cently Silva et al. proposed a self-organizing recursive architecture for continuous learning [11]. The importance of chaotic dynamics in self-organization will re-emerge in the discus‐

The relevance of the previously cited work notwithstanding, "self-organization" –in the biological sense– capabilities, should rather be attributed to systems capable of self-assem‐ bling from simple elementary units, finding their coordination by direct interactions gov‐ erned by simple mathematical rules – intrinsic of Nature, it could be stated. Such systems should rather find their biological model in the "prebiotic soup", in which chemical inter‐ actions between atoms and then compounds, lead to the organization of cells, tissues and

Random Boolean Networks (RBNs) were originally introduced by Kauffmann to model gene regulation mechanisms [12]. In Kauffman's Biology-centered view, evolution is result‐

with ANNs [1].

quite pointless.

In this chapter we provide a review of self-organization as it may appears at the various abstract levels of computational architectures, and including applications to real-world problems.

We start outlining the properties related to complexity and self-organization in natural and artificial systems. Then we de-scribe the computational models that are better suited to study self-organizing systems. We then discuss self-organization at the hardware level. Finally, we look at networked systems paying particular attention to distributed sensing networks.

#### **2. Self-organization and self-organizing systems**

Human speculation on the visible order of things, either living or not, is so ancient to be con‐ sidered one of the fundamental questions of mankind. Science has always been exploring the complex structure of Nature, adding pieces to pieces to its infinite puzzle.

Meanwhile, technologies evolve benefiting from new findings, sometimes trying either suc‐ cessfully or ingenuously to duplicate Nature's work. Improvements in technologies then re‐ flect on further science advancements, closing the loop.

Order, self-organization, adaptation, evolution, emergence and several other terms reminds us that as artificial systems advance, gaining complexity, it is expected for them to be com‐ pared to natural ones in both structure and function.

At the time of the vacuum tube digital computer introduction in the 1940s, McCulloch and Pitts had already proposed their neuron model, while cyberneticists were starting to recog‐ nize their interdisciplinary studies on natural and artificial systems as a brand new field.

With their simple and primitive circuits, made of few thousands discrete components, digi‐ tal computers were certainly "complex" with respect to the available technology, but orders of magnitude simpler and unstructured than their biological computational counterparts made of billions of neurons arranged by some "Self-Organization" process.

Anyway, it did not took much to Von Neumann to start exploring the theory of Cellular Au‐ tomata and self-reproducing machines, attempting to bridge natural and artificial computa‐ tional models.

Rosenblatt's perceptron was another attempt to propose a biologically inspired computa‐ tional framework. As a confirmation of the difficulties in reverse-engineering Nature, it took a few decades for Artificial Neural Networks built with perceptrons to become viable means to tackle useful computational tasks.

Of all the connotations given to perceptron networks, "adaptive" has been certainly one of the most adopted, however many of the terms cited before have found some kind of use with ANNs [1].

structures. Indeed, current research on pattern recognition shows significant interest in highly structured models built on large numbers of processing nodes trained by demanding

In this chapter we provide a review of self-organization as it may appears at the various abstract levels of computational architectures, and including applications to real-world

We start outlining the properties related to complexity and self-organization in natural and artificial systems. Then we de-scribe the computational models that are better suited to study self-organizing systems. We then discuss self-organization at the hardware level. Finally, we look at networked systems paying particular attention to distributed sensing

Human speculation on the visible order of things, either living or not, is so ancient to be con‐ sidered one of the fundamental questions of mankind. Science has always been exploring

Meanwhile, technologies evolve benefiting from new findings, sometimes trying either suc‐ cessfully or ingenuously to duplicate Nature's work. Improvements in technologies then re‐

Order, self-organization, adaptation, evolution, emergence and several other terms reminds us that as artificial systems advance, gaining complexity, it is expected for them to be com‐

At the time of the vacuum tube digital computer introduction in the 1940s, McCulloch and Pitts had already proposed their neuron model, while cyberneticists were starting to recog‐ nize their interdisciplinary studies on natural and artificial systems as a brand new field.

With their simple and primitive circuits, made of few thousands discrete components, digi‐ tal computers were certainly "complex" with respect to the available technology, but orders of magnitude simpler and unstructured than their biological computational counterparts

Anyway, it did not took much to Von Neumann to start exploring the theory of Cellular Au‐ tomata and self-reproducing machines, attempting to bridge natural and artificial computa‐

Rosenblatt's perceptron was another attempt to propose a biologically inspired computa‐ tional framework. As a confirmation of the difficulties in reverse-engineering Nature, it took a few decades for Artificial Neural Networks built with perceptrons to become viable means

made of billions of neurons arranged by some "Self-Organization" process.

the complex structure of Nature, adding pieces to pieces to its infinite puzzle.

algorithms that can, at least partially, be implemented in a parallel fashion.

**2. Self-organization and self-organizing systems**

flect on further science advancements, closing the loop.

pared to natural ones in both structure and function.

problems.

212 Design and Architectures for Digital Signal Processing

networks.

tional models.

to tackle useful computational tasks.

With respect to the "self-organization" term, its use has been used so widespread in com‐ puter and information processing literature that the effort to determine its introduction is quite pointless.

One of the most well known uses of the term can be traced back to Kohonen's "Self-Organ‐ izing Maps" (SOMs) [2]. Differently from perceptron based Artificial Neural Networks, trained with supervised algorithms, Kohonen proposed an unsupervised method whose geometrical representation is that of a continuous rearrangement of points of the feature space around auto-determined cluster centers represented as cells in a two-dimensional ar‐ ray. In the topological representation of the evolving map during learning, centers can be visualized as they move forcing the two-dimensional map to stretch in the effort to cover the feature space. Even if SOMs are not a derivation of any biological model, some parallelism with the visual neocortex both in terms of function and structure has been drawn [3].

Given the impact of SOMs in machine learning and the excitement produced by a simple and effective unsupervised learning algorithm, it is not surprising that a large number of pa‐ pers followed in Kohonen's, and that research on SOMs is still carried on actively. Fritzke proposed structures that grow from a small number of cells [4] modulating the network structure accordingly to the unknown probability distribution of the input.

Other research on SOMs, similarly to the evolution of multi-layered supervised neural net‐ works, introduced some hierarchical organization, as Choi and Park did with their "Self-Creating and Organizing Neural Network" [5], or Rauber et al. with their "Growing Hierarchical SOM" [6], or followed the path of hardware implementation of SOMs either in analog [7], or digital form [8]. A surveillance application was proposed by Chacon-Murguia and Gonzalez-Duarte that mixes SOMs with neuro-fuzzy networks to detect objects in dy‐ namic background for surveillance purposes [9].

At some point, Dingle and Jones proposed the Chaotic Self Organizing Feature Map [10] based on recurrent functions leading to chaotic behavior. Continuing this research, more re‐ cently Silva et al. proposed a self-organizing recursive architecture for continuous learning [11]. The importance of chaotic dynamics in self-organization will re-emerge in the discus‐ sion about computational frameworks.

The relevance of the previously cited work notwithstanding, "self-organization" –in the biological sense– capabilities, should rather be attributed to systems capable of self-assem‐ bling from simple elementary units, finding their coordination by direct interactions gov‐ erned by simple mathematical rules – intrinsic of Nature, it could be stated. Such systems should rather find their biological model in the "prebiotic soup", in which chemical inter‐ actions between atoms and then compounds, lead to the organization of cells, tissues and complex organisms.

Random Boolean Networks (RBNs) were originally introduced by Kauffmann to model gene regulation mechanisms [12]. In Kauffman's Biology-centered view, evolution is result‐ ing from the Darwinian selection mechanisms coupled with self-organization, adding the concept of "anti-chaos" to the already large and unsettled realm of complex systems.

very coarse and rigid stratification, nevertheless provides an interesting parallelism and

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

215

Either in the biological or in the computer realm, more levels bring more complexity. Sys‐ tems are of different kinds and it needs some "handshaking", as in the networking jargon, to allow communication. Systems need to share resources, and then some sort of arbitration is needed. A large part of the engineering of information systems has thus become the design

Heylighen and Gershenson invoked self-organization in computers as a necessity to cope with an increasing complexity in information systems that creates a "bottleneck" limiting further progress [16]. They discussed inter-networking, and the rapid changes in hardware, software and protocols to deal with it, as only exacerbating the difficulties for human devel‐ opers to keep everything under their own control. They then described a few qualitative principles to introduce self-organization in highly engineered and inter-networked informa‐ tion systems, with some references to current applications such as the hyperlinks-based Web, and, with some projections to the future, even software development paradigms.

Kohonen's networks, the medium access control and routing protocols of the many comput‐ er network types, and Kauffman's RBNs, all of them express self-organization of some de‐ gree. The heterogeneity of the three examples is evident, though. Some effort has been taken to formalize this hardly sought property of systems. Gershenson and Heylighen, moving from classical considerations based on thermodynamics, and then considering statistical en‐ tropy, provided an insight on what conditions should describe the emergence of self-organi‐

They concluded that the term "self-organization" may rather describe a way to look at sys‐

As anticipated, Random Boolean Networks (RBNs) trace in their biological model of inspira‐ tion their self-organizing abilities. RBNs consist in a network of N nodes with Boolean state, each having K of Boolean input connections. Both parameters N and K are fixed. Because of these characterizing parameters RBNs have also been called NK networks. Each node state is updated at discrete time steps accordingly to a Boolean function of the inputs. A variable number of Boolean outputs, propagating the node state, may departs from each node to‐ wards other nodes' inputs, arbitrarily. Indeed, both connections and the Boolean state up‐ date function are chosen randomly during initialization and never changed (Figure 2).

Kauffman discussed RBNs as finite discrete dynamical systems in terms of the sequences of states the networks run through. Given that 2N states can be assumed by RBN, and that for each state there is only one possible successor, the network will run through finite-length

cyclic sequences called *state cycles*, that are the *dynamical attractors* of the system.

of communication and arbitration protocols to make system "self-organize".

some points to look at in the distance.

zation in observed systems.

tems than a class of systems [17].

**3. Computational models**

Cells in biological systems share the same genetic information, nevertheless they differenti‐ ate on the basis of specific activation patterns of that very same information. From a com‐ puter engineering point of view, this is an upside-down perspective, as changing the software has been the way to make machines adapt to problems.

Indeed, the dichotomy of hardware and software has been the key of early digital comput‐ ers evolution, permitting to get rid of the hand-wired logic of primordial calculators. Inci‐ dentally, if we put apart for a moment most of the involved technological considerations, Von Neumann's pioneering work on self-reproducing automata was a fifty years forward leap to meet the biological research at some common point.

More or less in the same years as Kauffman, Wolfram meticulously described the complex behavior of one-dimensional cellular automata (Figure 1) pointing out the emergence of selforganization [13, 14].

**Figure 1.** Simple one-dimensional binary cellular automata with two-cell neighborhood are named after the code in‐ troduced by Wolfram. The eight possible states for each cell and its neighborhood are arranged from right to left ac‐ cordingly to the decimal interpretation of the three bit current values. The eight bits describing the future states are then interpreted as a decimal number. The code may be generalized to elementary cellular automata with any num‐ ber of states, dimensionality, and neighborhood size.

Coincidentally, at that time the influence of Mandelbrot's work on fractals was at its peak, as well as the interest for simple formulae able to produce results so similar to those of natural processes [15]. That was certainly a rather inspiring time for those who are subject to the fas‐ cination of complexity arising from simplicity but, indeed, this feeling has been pervading the research in information systems for decades.

It was also the time of the advent of networking and – a few more years would have tak‐ en the Web to be brought to life – Internet. The latter has the mark of a "self-organizing" system well in its roots, and even in its name, in some way. Then, in a short time lapse, wireless networks broadened the communication horizon once again providing us with mobile systems.

The realm of computers thus has reached a point where interconnected systems at macro scale coexist with the micro scale of the circuits they are built upon, while the nano-dimen‐ sionality is being intensively explored. Compared to the many scales adopted to observe bi‐ ological systems at their molecular, cellular, tissutal and macroscopic levels, this is still a very coarse and rigid stratification, nevertheless provides an interesting parallelism and some points to look at in the distance.

Either in the biological or in the computer realm, more levels bring more complexity. Sys‐ tems are of different kinds and it needs some "handshaking", as in the networking jargon, to allow communication. Systems need to share resources, and then some sort of arbitration is needed. A large part of the engineering of information systems has thus become the design of communication and arbitration protocols to make system "self-organize".

Heylighen and Gershenson invoked self-organization in computers as a necessity to cope with an increasing complexity in information systems that creates a "bottleneck" limiting further progress [16]. They discussed inter-networking, and the rapid changes in hardware, software and protocols to deal with it, as only exacerbating the difficulties for human devel‐ opers to keep everything under their own control. They then described a few qualitative principles to introduce self-organization in highly engineered and inter-networked informa‐ tion systems, with some references to current applications such as the hyperlinks-based Web, and, with some projections to the future, even software development paradigms.

Kohonen's networks, the medium access control and routing protocols of the many comput‐ er network types, and Kauffman's RBNs, all of them express self-organization of some de‐ gree. The heterogeneity of the three examples is evident, though. Some effort has been taken to formalize this hardly sought property of systems. Gershenson and Heylighen, moving from classical considerations based on thermodynamics, and then considering statistical en‐ tropy, provided an insight on what conditions should describe the emergence of self-organi‐ zation in observed systems.

They concluded that the term "self-organization" may rather describe a way to look at sys‐ tems than a class of systems [17].

#### **3. Computational models**

ing from the Darwinian selection mechanisms coupled with self-organization, adding the

Cells in biological systems share the same genetic information, nevertheless they differenti‐ ate on the basis of specific activation patterns of that very same information. From a com‐ puter engineering point of view, this is an upside-down perspective, as changing the

Indeed, the dichotomy of hardware and software has been the key of early digital comput‐ ers evolution, permitting to get rid of the hand-wired logic of primordial calculators. Inci‐ dentally, if we put apart for a moment most of the involved technological considerations, Von Neumann's pioneering work on self-reproducing automata was a fifty years forward

More or less in the same years as Kauffman, Wolfram meticulously described the complex behavior of one-dimensional cellular automata (Figure 1) pointing out the emergence of self-

**Figure 1.** Simple one-dimensional binary cellular automata with two-cell neighborhood are named after the code in‐ troduced by Wolfram. The eight possible states for each cell and its neighborhood are arranged from right to left ac‐ cordingly to the decimal interpretation of the three bit current values. The eight bits describing the future states are then interpreted as a decimal number. The code may be generalized to elementary cellular automata with any num‐

Coincidentally, at that time the influence of Mandelbrot's work on fractals was at its peak, as well as the interest for simple formulae able to produce results so similar to those of natural processes [15]. That was certainly a rather inspiring time for those who are subject to the fas‐ cination of complexity arising from simplicity but, indeed, this feeling has been pervading

It was also the time of the advent of networking and – a few more years would have tak‐ en the Web to be brought to life – Internet. The latter has the mark of a "self-organizing" system well in its roots, and even in its name, in some way. Then, in a short time lapse, wireless networks broadened the communication horizon once again providing us with

The realm of computers thus has reached a point where interconnected systems at macro scale coexist with the micro scale of the circuits they are built upon, while the nano-dimen‐ sionality is being intensively explored. Compared to the many scales adopted to observe bi‐ ological systems at their molecular, cellular, tissutal and macroscopic levels, this is still a

concept of "anti-chaos" to the already large and unsettled realm of complex systems.

software has been the way to make machines adapt to problems.

leap to meet the biological research at some common point.

organization [13, 14].

214 Design and Architectures for Digital Signal Processing

mobile systems.

ber of states, dimensionality, and neighborhood size.

the research in information systems for decades.

As anticipated, Random Boolean Networks (RBNs) trace in their biological model of inspira‐ tion their self-organizing abilities. RBNs consist in a network of N nodes with Boolean state, each having K of Boolean input connections. Both parameters N and K are fixed. Because of these characterizing parameters RBNs have also been called NK networks. Each node state is updated at discrete time steps accordingly to a Boolean function of the inputs. A variable number of Boolean outputs, propagating the node state, may departs from each node to‐ wards other nodes' inputs, arbitrarily. Indeed, both connections and the Boolean state up‐ date function are chosen randomly during initialization and never changed (Figure 2).

Kauffman discussed RBNs as finite discrete dynamical systems in terms of the sequences of states the networks run through. Given that 2N states can be assumed by RBN, and that for each state there is only one possible successor, the network will run through finite-length cyclic sequences called *state cycles*, that are the *dynamical attractors* of the system.

**Figure 2.** A Random Boolean Network with two input nodes. After each discrete time step, each node state is updat‐ ed accordingly to the Boolean function of the inputs i0 and i1. Node state is fed to the output at the following step.

The behavior of a RBN can be represented by the state transition diagram, a directed graph having a connected component for each state cycle. Not all the states in each of these subgraphs are part of the respective cycle, as states having no antecedents – the so-called *gar‐ den-of-Eden* states *–* may be present; they instead compose the state cycle's *basin of attraction* (Figure 3).

Properties of the *state cycles*, such as cycle length, asymptotic patterns, and *basins of attraction* were used to classify the interesting complex behaviors of these simple models for different values of *K* [12]. Some basic findings, still providing some insights into the self-organization abilities of RBNs are reported in Table 1. When the network is completely interconnected (*K* = *N* ), and the sensitivity to initial conditions is at its maximum, *state cycle* lengths become large as N increase, yet their number keeps being comparatively small.

When *K* is equal or greater than 5, RBNs keeps showing chaotic behavior. A few concepts need to be introduced to analyze these results. The *internal homogeneity P* of a Boolean func‐ tion of *K* inputs is defined as the ratio *M* / 2*<sup>K</sup>* , with *M* beeing the maximum between the number of 1's and 0's in the output column of the function's truth table. The *bias B* is then defined as 1 / *P*.

In contrast with the first two *chaotic* cases, when *K* = 2, RBNs show the emergence of "spon‐ tanous order" as both the cycle length and number of attractors scales with the square root of *N* . Moreover, these networks show other important properties that result in higher stabil‐ ity over perturbations of the activity of the nodes. Indeed, more recently, a linear depend‐ ance was found sampling larger networks [18]. For *K* =1 the RBNs show a similar growth of the cycle length and an exponential rise of the number of attractors.

**Figure 3.** The state transition diagram of the RBN showed in Figure 2. States are numbered from 0 to 31 according to the binary representation of the five network nodes' values. The graph is partitioned into four unconnected compo‐ nents, one for each *state cycle*. The network will finally be attracted into one of the four *state cycles*: (31), (30, 29), (28),

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

217

(4, 5), (0). On the left side, the 17 states that are unreachable from any other state are showed.

Self-Organizing Architectures for Digital Signal Processing http://dx.doi.org/10.5772/53334 217

**Figure 2.** A Random Boolean Network with two input nodes. After each discrete time step, each node state is updat‐ ed accordingly to the Boolean function of the inputs i0 and i1. Node state is fed to the output at the following step.

The behavior of a RBN can be represented by the state transition diagram, a directed graph having a connected component for each state cycle. Not all the states in each of these subgraphs are part of the respective cycle, as states having no antecedents – the so-called *gar‐ den-of-Eden* states *–* may be present; they instead compose the state cycle's *basin of attraction*

Properties of the *state cycles*, such as cycle length, asymptotic patterns, and *basins of attraction* were used to classify the interesting complex behaviors of these simple models for different values of *K* [12]. Some basic findings, still providing some insights into the self-organization abilities of RBNs are reported in Table 1. When the network is completely interconnected (*K* = *N* ), and the sensitivity to initial conditions is at its maximum, *state cycle* lengths become

When *K* is equal or greater than 5, RBNs keeps showing chaotic behavior. A few concepts need to be introduced to analyze these results. The *internal homogeneity P* of a Boolean func‐ tion of *K* inputs is defined as the ratio *M* / 2*<sup>K</sup>* , with *M* beeing the maximum between the number of 1's and 0's in the output column of the function's truth table. The *bias B* is then

In contrast with the first two *chaotic* cases, when *K* = 2, RBNs show the emergence of "spon‐ tanous order" as both the cycle length and number of attractors scales with the square root of *N* . Moreover, these networks show other important properties that result in higher stabil‐ ity over perturbations of the activity of the nodes. Indeed, more recently, a linear depend‐ ance was found sampling larger networks [18]. For *K* =1 the RBNs show a similar growth of

large as N increase, yet their number keeps being comparatively small.

the cycle length and an exponential rise of the number of attractors.

(Figure 3).

216 Design and Architectures for Digital Signal Processing

defined as 1 / *P*.

**Figure 3.** The state transition diagram of the RBN showed in Figure 2. States are numbered from 0 to 31 according to the binary representation of the five network nodes' values. The graph is partitioned into four unconnected compo‐ nents, one for each *state cycle*. The network will finally be attracted into one of the four *state cycles*: (31), (30, 29), (28), (4, 5), (0). On the left side, the 17 states that are unreachable from any other state are showed.


**4. Cellular automata**

the local state but not the transition function.

based on voxel automata [24].

fest as vehicles move to the right at constant speed.

From definition it is evident that cellular automata are a special case of RBNs in which each node receives inputs only from neighbors. Their simpler topology and consequent imple‐ mentation has given an appeal to these minimal models manifesting self-organization that

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

219

Some theoretical extensions include Probabilistic Cellular Automata (PCA), and Fault-toler‐ ant Cellular Automata (FCA), studied by Gács [20] in the context of the problem of reliable computation with unreliable components. In particular, the error probability of each compo‐ nent is not required to decrease as the number of components increases, and the faults affect

Even if they are purely theoretical, such models may be useful in designing massively paral‐ lel "self-organizing" architectures, as due to the distributed nature of information in cellular automata, "self-stabilization" is required beside traditional error-correction techniques.

Cellular automata have had many applications to real-world problems. No surprisingly, several biological models have been simulated with cellular automata. Shumate et al. descri‐ bed a simulation of Ductal Carcinoma in Situ [21]. Chaudary et al. proposed a simulation model for Tumorigenesis [22]. Shimokawa and Muraki investigated a simple model of nerve excitement propagation [23]. Sakamoto et al. proposed a method for surgery simulation

Cellular automata have also been used to model complex dynamics such as those of urban traffic [25–27]. Recent applications of cellular automata to image processing include super

Cellular automata also continue to be used in more theoretical studies on algorithms [31, 32].

**Figure 4.** Rule 184 (see Figure 1) is one of most used cellular automaton in traffic simulation. The distribution of vehi‐ cles and spaces in a road lane is modeled as black and white cells, respectively, in each image row. The topmost cell row depicts the initial distribution (with a black/white ratio of 3/8) let evolve over 300 iterations. After few iterations still presenting random behavior, visible in form of triangular structures, the regularization ability of the rule is mani‐

pixel segmentation [28], image compression [29], and computer graphics [30].

goes beyond "recreational" applications such as Conway's "Game of Life".

**Table 1.** Properties of RBNs for different values of K. The state cycle length is the median length of the state cycles. B is the network *bias* while *P*(*K*) is the mean *internal homogeneity* of all the Boolean functions of K inputs.

The boundaries among the ordered, critical and chaotic dynamical phases, yet not quite ana‐ lytically assessed, still inspire new studies. An updated introduction to RBNs, including ref‐ erences to Asynchronous RBNs (ARBNs), Deterministic Asynchronous RBNs (DARBNs) and other variants of RBNs, can be found in form of a tutorial in [18]. Gershenson described several methods to guide the self-organization of RBNs [19]. The need of a "guiding" proc‐ ess seems somewhat contradictory with the premise in the title. Indeed, he investigated the mechanisms through which natural selection may intervene in the self-organization of bio‐ logical structures, suggesting engineers may use the same parameters characterized in com‐ putational frameworks, such as RBNs and Cellular Automata.

Getting back to the gene regulation mechanisms the RBNs were designed to model, "selforganization" succeeded in "reducing the complexity" provided by the tens of thousands genes in the human genome to the mere hundreds types of human cells. RBNs are finite state space, deterministic, dynamical, discrete time systems whose self-organizing property derives from having attractors, i.e. states that can be revisited by the network. RBN can be in either a ordered or a chaotic dynamical phase, transitions are possible and the transition from one to the other is "characterized by its criticality".

A static, stable phase preserves information but is prevented from computing or adapt‐ ing. A chaotic phase provides the requested variability for computing and adapting, but is incapable of preserving information. As the critical "interface" between the two phases provides the advantages of both phases, guiding the RBN towards self-organizations means finding the necessary conditions to make RBNs evolve towards the critical re‐ gime. Gershenson then considered several factors that can induce such evolution and gives a few hints on how criticality could help improve adaptability, ability to evolve, and robustness of RBNs.

#### **4. Cellular automata**

**State cycle length Number of state cycle attractors**

<sup>2</sup> *<sup>N</sup>* Exponential in N

*N e*

<sup>1</sup> / <sup>2</sup> <sup>±</sup> (*<sup>P</sup>* (*<sup>K</sup>* ) - <sup>1</sup> / 2) ) 2

<sup>~</sup> *<sup>N</sup>* log ( <sup>1</sup>

*<sup>K</sup>* <sup>=</sup> *<sup>N</sup>* <sup>2</sup>*<sup>N</sup>* /2

218 Design and Architectures for Digital Signal Processing

*<sup>K</sup>* <sup>≥</sup><sup>5</sup> <sup>2</sup>

*K* =1 <sup>π</sup>

putational frameworks, such as RBNs and Cellular Automata.

from one to the other is "characterized by its criticality".

and robustness of RBNs.

2

*BN* 2 2 (*B* >1)

*K* =2 *N N*

**Table 1.** Properties of RBNs for different values of K. The state cycle length is the median length of the state cycles. B is

The boundaries among the ordered, critical and chaotic dynamical phases, yet not quite ana‐ lytically assessed, still inspire new studies. An updated introduction to RBNs, including ref‐ erences to Asynchronous RBNs (ARBNs), Deterministic Asynchronous RBNs (DARBNs) and other variants of RBNs, can be found in form of a tutorial in [18]. Gershenson described several methods to guide the self-organization of RBNs [19]. The need of a "guiding" proc‐ ess seems somewhat contradictory with the premise in the title. Indeed, he investigated the mechanisms through which natural selection may intervene in the self-organization of bio‐ logical structures, suggesting engineers may use the same parameters characterized in com‐

Getting back to the gene regulation mechanisms the RBNs were designed to model, "selforganization" succeeded in "reducing the complexity" provided by the tens of thousands genes in the human genome to the mere hundreds types of human cells. RBNs are finite state space, deterministic, dynamical, discrete time systems whose self-organizing property derives from having attractors, i.e. states that can be revisited by the network. RBN can be in either a ordered or a chaotic dynamical phase, transitions are possible and the transition

A static, stable phase preserves information but is prevented from computing or adapt‐ ing. A chaotic phase provides the requested variability for computing and adapting, but is incapable of preserving information. As the critical "interface" between the two phases provides the advantages of both phases, guiding the RBN towards self-organizations means finding the necessary conditions to make RBNs evolve towards the critical re‐ gime. Gershenson then considered several factors that can induce such evolution and gives a few hints on how criticality could help improve adaptability, ability to evolve,

the network *bias* while *P*(*K*) is the mean *internal homogeneity* of all the Boolean functions of K inputs.

From definition it is evident that cellular automata are a special case of RBNs in which each node receives inputs only from neighbors. Their simpler topology and consequent imple‐ mentation has given an appeal to these minimal models manifesting self-organization that goes beyond "recreational" applications such as Conway's "Game of Life".

Some theoretical extensions include Probabilistic Cellular Automata (PCA), and Fault-toler‐ ant Cellular Automata (FCA), studied by Gács [20] in the context of the problem of reliable computation with unreliable components. In particular, the error probability of each compo‐ nent is not required to decrease as the number of components increases, and the faults affect the local state but not the transition function.

Even if they are purely theoretical, such models may be useful in designing massively paral‐ lel "self-organizing" architectures, as due to the distributed nature of information in cellular automata, "self-stabilization" is required beside traditional error-correction techniques.

Cellular automata have had many applications to real-world problems. No surprisingly, several biological models have been simulated with cellular automata. Shumate et al. descri‐ bed a simulation of Ductal Carcinoma in Situ [21]. Chaudary et al. proposed a simulation model for Tumorigenesis [22]. Shimokawa and Muraki investigated a simple model of nerve excitement propagation [23]. Sakamoto et al. proposed a method for surgery simulation based on voxel automata [24].

Cellular automata have also been used to model complex dynamics such as those of urban traffic [25–27]. Recent applications of cellular automata to image processing include super pixel segmentation [28], image compression [29], and computer graphics [30].

Cellular automata also continue to be used in more theoretical studies on algorithms [31, 32].

**Figure 4.** Rule 184 (see Figure 1) is one of most used cellular automaton in traffic simulation. The distribution of vehi‐ cles and spaces in a road lane is modeled as black and white cells, respectively, in each image row. The topmost cell row depicts the initial distribution (with a black/white ratio of 3/8) let evolve over 300 iterations. After few iterations still presenting random behavior, visible in form of triangular structures, the regularization ability of the rule is mani‐ fest as vehicles move to the right at constant speed.

#### **5. Hardware**

The balance between the hardware and software components in signal processing applica‐ tions has always been a trade-off between the flexibility of the microprocessor-based solu‐ tions and the performance of ASIC implementations.

Other approaches to the mapping of high-level coarse-grained mapping from high-level synthesis are present in literature [42]. Recently, implementations as System On a Chip de‐ sign of highly reconfigurable arrays have been described with applications to face detection [43], Internet protocols processing [44], FIR filters and ICA [45]. Several SOMs, CNNs and derivative neural network models have been designed for reconfigurable hardware [46, 47].

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

221

Fine-grained systems bring configurability close to the gate or transistor level, permitting analog, digital, and hybrid implementations. In contrast to coarse-grained systems, data path width is reduced to the bare minimum, with the advantages of increased flexibility and

Nanotechnologies aim at even finer degrees of integration, and it is reasonable to assume that at the hardware level new computational paradigms may emerge because of that. How‐ ever, as in the pioneering stage of any technology, the span from theory and implementation may not be short. Lin et al. proposed a hybrid FPGA architecture based on nano-RAM [48] with run-time configuration abilities and high logic density, to revert later to a CMOS

**Figure 5.** A schematic depiction of Evolvable Hardware. A reconfigurable device (i.e. FPGA), in gray, is coupled with a configuration storage (light blue). Configuration is updated by the control block (green) in real time according to

some fitness function, as in genetic programming.

lower costs, but the general purpose routing is generally less energy efficient.

SRAM implementation for immaturity of the nano-RAM fabrication processes [49].

Taking the aforementioned SOMs and ANNs into account, literature abounds in hardware implementations that are motivated by scarce performance of the analogous sequential methods. In the 1980s, aiming at parallel real-time signal processing with the then available analog very-large-scale integration (VLSI) technology, Chua introduced his Cellular Neural Networks (CNN) [33], describing some application to image processing tasks [34]. Subse‐ quently, Yang et al. showed a VLSI implementation of CNNs [35].

A couple of decades later, Ruckert et al. discussed massively parallel implementations of ar‐ tificial neural networks at ultra-large-scale integration (ULSI) [36], later showing a massively parallel architecture for SOMs [37].

Hopfield had started its seminal work on recurrent neural networks posing himself the question whether "the ability of large collection of neurons to perform computational tasks may in part be a spontaneous collective consequence of having a large number of in‐ teracting simple neurons". He concluded that this is actually the case, and that implemen‐ tation of such models could lead to integrated circuits that are more fault-tolerant than normal circuits [38].

Weightless Neural Networks (WNNs), being based on random access memories, provide another ANN paradigm inherently tied to circuit-level implementation whose origins trace back to Alexander's "Self-adaptive universal logic circuits" [39].

Though all of these are examples of systems having self-adapting qualities, self-organization at the hardware level – the "microscopic" layer in the our biological analogy– had simply not been possible until the advent of reconfigurable circuits, such as field programmable gate arrays (FPGAs), and coarse-grain reconfigurable arrays, added a new degree of config‐ urability, and related complexity, to computer systems. A survey on reconfigurable hard‐ ware with emphasis on real-time configuration is provided by Shoa and Shirani [40].

#### **5.1. Coarse-grained and fine-grained architectures**

Hartenstein, reviewing most of the "coarse-grained reconfigurable architectures" of a dec‐ ade (circa 2000) [41], suggested that with the explosion of design costs and reduction of pro‐ duction life cycles, performance becomes relatively less important in the design of computing devices. Instead, extension of product longevity, "reduction of support turnaround, in-system debugging, profiling, verification, tuning, field-maintenance, and fieldupgrade" time by employing reconfigurable arrays is much more important. Hartenstein, dismissing "von Neumann" architectures as obsolete, in the light of the dominance of host/ accelerators designs, proposed a new coarse-grained soft machine paradigm, in which a so called "co-compilation" provides instructions for the host and data-path configuration infor‐ mation at the same time.

Other approaches to the mapping of high-level coarse-grained mapping from high-level synthesis are present in literature [42]. Recently, implementations as System On a Chip de‐ sign of highly reconfigurable arrays have been described with applications to face detection [43], Internet protocols processing [44], FIR filters and ICA [45]. Several SOMs, CNNs and derivative neural network models have been designed for reconfigurable hardware [46, 47].

**5. Hardware**

The balance between the hardware and software components in signal processing applica‐ tions has always been a trade-off between the flexibility of the microprocessor-based solu‐

Taking the aforementioned SOMs and ANNs into account, literature abounds in hardware implementations that are motivated by scarce performance of the analogous sequential methods. In the 1980s, aiming at parallel real-time signal processing with the then available analog very-large-scale integration (VLSI) technology, Chua introduced his Cellular Neural Networks (CNN) [33], describing some application to image processing tasks [34]. Subse‐

A couple of decades later, Ruckert et al. discussed massively parallel implementations of ar‐ tificial neural networks at ultra-large-scale integration (ULSI) [36], later showing a massively

Hopfield had started its seminal work on recurrent neural networks posing himself the question whether "the ability of large collection of neurons to perform computational tasks may in part be a spontaneous collective consequence of having a large number of in‐ teracting simple neurons". He concluded that this is actually the case, and that implemen‐ tation of such models could lead to integrated circuits that are more fault-tolerant than

Weightless Neural Networks (WNNs), being based on random access memories, provide another ANN paradigm inherently tied to circuit-level implementation whose origins trace

Though all of these are examples of systems having self-adapting qualities, self-organization at the hardware level – the "microscopic" layer in the our biological analogy– had simply not been possible until the advent of reconfigurable circuits, such as field programmable gate arrays (FPGAs), and coarse-grain reconfigurable arrays, added a new degree of config‐ urability, and related complexity, to computer systems. A survey on reconfigurable hard‐

Hartenstein, reviewing most of the "coarse-grained reconfigurable architectures" of a dec‐ ade (circa 2000) [41], suggested that with the explosion of design costs and reduction of pro‐ duction life cycles, performance becomes relatively less important in the design of computing devices. Instead, extension of product longevity, "reduction of support turnaround, in-system debugging, profiling, verification, tuning, field-maintenance, and fieldupgrade" time by employing reconfigurable arrays is much more important. Hartenstein, dismissing "von Neumann" architectures as obsolete, in the light of the dominance of host/ accelerators designs, proposed a new coarse-grained soft machine paradigm, in which a so called "co-compilation" provides instructions for the host and data-path configuration infor‐

ware with emphasis on real-time configuration is provided by Shoa and Shirani [40].

tions and the performance of ASIC implementations.

parallel architecture for SOMs [37].

220 Design and Architectures for Digital Signal Processing

normal circuits [38].

mation at the same time.

quently, Yang et al. showed a VLSI implementation of CNNs [35].

back to Alexander's "Self-adaptive universal logic circuits" [39].

**5.1. Coarse-grained and fine-grained architectures**

Fine-grained systems bring configurability close to the gate or transistor level, permitting analog, digital, and hybrid implementations. In contrast to coarse-grained systems, data path width is reduced to the bare minimum, with the advantages of increased flexibility and lower costs, but the general purpose routing is generally less energy efficient.

Nanotechnologies aim at even finer degrees of integration, and it is reasonable to assume that at the hardware level new computational paradigms may emerge because of that. How‐ ever, as in the pioneering stage of any technology, the span from theory and implementation may not be short. Lin et al. proposed a hybrid FPGA architecture based on nano-RAM [48] with run-time configuration abilities and high logic density, to revert later to a CMOS SRAM implementation for immaturity of the nano-RAM fabrication processes [49].

**Figure 5.** A schematic depiction of Evolvable Hardware. A reconfigurable device (i.e. FPGA), in gray, is coupled with a configuration storage (light blue). Configuration is updated by the control block (green) in real time according to some fitness function, as in genetic programming.

#### **5.2. Evolvable hardware**

Reconfigurable hardware is turned into a specific implementation by loading bitstreams compiled from "soft cores" coded in some hardware description language. Being a totally software matter, a host processor can perform real-time reconfiguration when needed. New paradigms blending traditional machine language and reconfigurable hardware bitstreams become thus possible.

the number of nodes may vary sensibly from a few to thousands and more units. In many scenarios sensor nodes are dispersed in the environment thus their expendability becomes

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

223

As a consequence of these constraints on physical size, energy supply and cost per unit, processing resources are limited, and in most designs they only consist in simple microcon‐ trollers. Comprehensive surveys on WSNs including sensor technologies, network proto‐ cols, and hardware and software architectures, are provided by Akyldiz et al. [57], and Yick

Even though WSNs were conceived as distributed sensing architectures, several exam‐ ples are provided in literature about nodes also performing in-network pre-processing of raw sensed data [59]. The need for a trade-off between the limited available energy source, and the manifold application scenarios [60], typically calls for the application of self-organization techniques, breaking the boundaries between the traditional architectur‐ al layers in order to optimize the behavior of such nodes. Sohrabi et al. presented a number of algorithms and protocols for self-organization of wireless sensor networks [61]. Self-organization techniques to reduce energy consumption in ad-hoc networks of wireless devices were described by Olascuaga-Cabrera et al. [62]. With even more tech‐ nological constraints than WSNs, Wireless Sensor Body Networks (WSBNs) consist of wearable and implantable devices. Health-monitoring usage of WSBNs is discussed by

Indeed, due to their ultra-low energy consumption requirements, WSBNs represent a very challenging scenario for sensor devices based on current, and even near future, general pur‐ pose processing elements, and implementing signal processing algorithms on nodes may

Alternative approaches based on application specific integrated circuits have been investi‐ gated [64]. Departing from the network oriented vision, and calling for the establishment of self-managing systems engineering, Beal et al. proposed the "amorphous medium" abstrac‐ tion in which the deployed sensor network represents the physical space of the application

From an engineering perspective, the application goal is reached by programming the medi‐ um instead of the network. The former abstracts the computational model, turning sensor nodes into points of the physical space. A global behavior is described in a specifically craft‐ ed language, as it were to be executed by the abstract medium. Actually, the abstract de‐ scription is compiled into code to be executed identically on each node. Besides executing

Beal et al. called this programming paradigm amorphous computing, revealing their in‐ spiration to come from some properties of biological systems, such as morphogenesis and regeneration. More interestingly, even though with substantial topological differen‐ ces, many similarities can be detected between the "amorphous-medium" and cellular

the same code, nodes interact only with neighboring devices.

another important requisite.

et al. [58].

Hao and Foster [63].

prove unfeasible.

to be engineered [65].

automata.

The idea that hardware could change autonomously its own configuration seeking the best one according to some fitness function was called evolvable hardware, recurring –once again– to a biological metaphor [50, 51]. Continuing with the metaphor, the bitstream takes the role of the digital DNA (Figure 5).

Approaches from genetic and evolutionary programming are attempted on the hardware configuration bitstream. Interesting applications of EHWs to pattern recognition are those presented by Glette et al. [52, 53].

Even though most work on EHW concerns digital implementations, some evolution-orient‐ ed analog implementation are reported in literature, such as the evolvable hardware archi‐ tecture based on field programmable transistor arrays [54], and quantum-inspired paradigm to be implemented in evolutionary analog hardware [55].

The enthusiasm of the early 2000s notwithstanding, EHW has not yet delivered what prom‐ ised. Cancare et al. [56] investigating the reasons of this apparently missed success, and cit‐ ing scalability issues as the most prominent, propose to abandon generic genetic algorithms and look at hierarchical evolution and linkage learning, encouraging support from the Evo‐ lutionary Computation community.

#### **6. Networks**

Computer networks provide many examples of global behaviors emerging from interactions of elements without centralized control. At different levels of abstraction and implementa‐ tion, from medium access control and routing, to the application level protocols, algorithms drive each independent network node so that some global goal, be it communication, co- or‐ dination, or distributed processing, is achieved. Thus, non-surprisingly, "self-" prefixed and akin terms abound in related literature.

While computer networks in general are a rather natural field to study self-organization, and many analogies with biological systems may be detected, without broadening too much our discussion, we restrict our discussion considering only one example of network of very simple nodes in which distributed processing of locally collected data is the main goal: Wireless Sensor Networks (WSNs).

These systems are composed by a number of nodes, consisting in miniaturized, battery-op‐ erated, computational elements fitted with sensors to monitor the surrounding environ‐ ment, that are connected through short distance radio links. Depending on the applications, the number of nodes may vary sensibly from a few to thousands and more units. In many scenarios sensor nodes are dispersed in the environment thus their expendability becomes another important requisite.

**5.2. Evolvable hardware**

222 Design and Architectures for Digital Signal Processing

become thus possible.

the role of the digital DNA (Figure 5).

to be implemented in evolutionary analog hardware [55].

presented by Glette et al. [52, 53].

lutionary Computation community.

akin terms abound in related literature.

Wireless Sensor Networks (WSNs).

**6. Networks**

Reconfigurable hardware is turned into a specific implementation by loading bitstreams compiled from "soft cores" coded in some hardware description language. Being a totally software matter, a host processor can perform real-time reconfiguration when needed. New paradigms blending traditional machine language and reconfigurable hardware bitstreams

The idea that hardware could change autonomously its own configuration seeking the best one according to some fitness function was called evolvable hardware, recurring –once again– to a biological metaphor [50, 51]. Continuing with the metaphor, the bitstream takes

Approaches from genetic and evolutionary programming are attempted on the hardware configuration bitstream. Interesting applications of EHWs to pattern recognition are those

Even though most work on EHW concerns digital implementations, some evolution-orient‐ ed analog implementation are reported in literature, such as the evolvable hardware archi‐ tecture based on field programmable transistor arrays [54], and quantum-inspired paradigm

The enthusiasm of the early 2000s notwithstanding, EHW has not yet delivered what prom‐ ised. Cancare et al. [56] investigating the reasons of this apparently missed success, and cit‐ ing scalability issues as the most prominent, propose to abandon generic genetic algorithms and look at hierarchical evolution and linkage learning, encouraging support from the Evo‐

Computer networks provide many examples of global behaviors emerging from interactions of elements without centralized control. At different levels of abstraction and implementa‐ tion, from medium access control and routing, to the application level protocols, algorithms drive each independent network node so that some global goal, be it communication, co- or‐ dination, or distributed processing, is achieved. Thus, non-surprisingly, "self-" prefixed and

While computer networks in general are a rather natural field to study self-organization, and many analogies with biological systems may be detected, without broadening too much our discussion, we restrict our discussion considering only one example of network of very simple nodes in which distributed processing of locally collected data is the main goal:

These systems are composed by a number of nodes, consisting in miniaturized, battery-op‐ erated, computational elements fitted with sensors to monitor the surrounding environ‐ ment, that are connected through short distance radio links. Depending on the applications,

As a consequence of these constraints on physical size, energy supply and cost per unit, processing resources are limited, and in most designs they only consist in simple microcon‐ trollers. Comprehensive surveys on WSNs including sensor technologies, network proto‐ cols, and hardware and software architectures, are provided by Akyldiz et al. [57], and Yick et al. [58].

Even though WSNs were conceived as distributed sensing architectures, several exam‐ ples are provided in literature about nodes also performing in-network pre-processing of raw sensed data [59]. The need for a trade-off between the limited available energy source, and the manifold application scenarios [60], typically calls for the application of self-organization techniques, breaking the boundaries between the traditional architectur‐ al layers in order to optimize the behavior of such nodes. Sohrabi et al. presented a number of algorithms and protocols for self-organization of wireless sensor networks [61]. Self-organization techniques to reduce energy consumption in ad-hoc networks of wireless devices were described by Olascuaga-Cabrera et al. [62]. With even more tech‐ nological constraints than WSNs, Wireless Sensor Body Networks (WSBNs) consist of wearable and implantable devices. Health-monitoring usage of WSBNs is discussed by Hao and Foster [63].

Indeed, due to their ultra-low energy consumption requirements, WSBNs represent a very challenging scenario for sensor devices based on current, and even near future, general pur‐ pose processing elements, and implementing signal processing algorithms on nodes may prove unfeasible.

Alternative approaches based on application specific integrated circuits have been investi‐ gated [64]. Departing from the network oriented vision, and calling for the establishment of self-managing systems engineering, Beal et al. proposed the "amorphous medium" abstrac‐ tion in which the deployed sensor network represents the physical space of the application to be engineered [65].

From an engineering perspective, the application goal is reached by programming the medi‐ um instead of the network. The former abstracts the computational model, turning sensor nodes into points of the physical space. A global behavior is described in a specifically craft‐ ed language, as it were to be executed by the abstract medium. Actually, the abstract de‐ scription is compiled into code to be executed identically on each node. Besides executing the same code, nodes interact only with neighboring devices.

Beal et al. called this programming paradigm amorphous computing, revealing their in‐ spiration to come from some properties of biological systems, such as morphogenesis and regeneration. More interestingly, even though with substantial topological differen‐ ces, many similarities can be detected between the "amorphous-medium" and cellular automata.

### **7. Conclusions**

The paradoxical fascination of simplicity producing complexity has traversed decades of re‐ search in information systems. Even more now that extremely high integration is pushing millions of highly modular circuits in few square millimeters, and inter-networking is the next –or rather, current – large scale integration.

[5] Choi DI, Park SH. Self-creating and organizing neural networks. Neural Networks,

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

225

[6] Rauber A, Merkl D, Dittenbach M. The growing hierarchical self-organizing map: ex‐ ploratory analysis of high-dimensional data. Neural Networks, IEEE Transactions on

[7] Macq D, Verleysen M, Jespers P, Legat JD. Analog implementation of a Kohonen map with on-chip learning. Neural Networks, IEEE Transactions on 1993;4(3) 456 –

[8] Ienne P, Thiran P, Vassilas N. Modified self-organizing feature map algorithms for efficient digital hardware implementation. Neural Networks, IEEE Transactions on

[9] Chacon-Murguia M, Gonzalez-Duarte S. An Adaptive Neural-Fuzzy Approach for Object Detection in Dynamic Backgrounds for Surveillance Systems. Industrial Electronics, IEEE Transactions on 2012;59(8) 3286 –3298. doi:10.1109/TIE.2011.2106093.

[10] Dingle A, Andreae J, Jones R. The chaotic self-organizing map. In: Artificial Neural Networks and Expert Systems, 1993. Proceedings., First New Zealand International

[11] da Silva L, Sandmann H, Del-Moral-Hernandez E. A self-organizing architecture of recursive elements for continuous learning. In: Neural Networks, 2008. IJCNN 2008.

[12] Kauffman SA. The Origins of Order: Self-Organization and Selection in Evolution. 1

[13] Wolfram S. Statistical mechanics of cellular automata. Rev. Mod. Phys. 1983;55 601–

[14] Wolfram S. Universality and complexity in cellular automata. Physica D: Nonlinear

[15] Mandelbrot BB. The Fractal Geometry of Nature. WH Freeman and Co., New York,

[16] Heylighen F, Gershenson C, Staab S, Flake G, Pennock D, Fain D, De Roure D, Aber‐ er K, Shen WM, Dousse O, Thiran P. Neurons, viscose fluids, freshwater polyp hy‐ dra-and self-organizing information systems. Intelligent Systems, IEEE 2003;18(4) 72

[17] Gershenson C, Heylighen F. When Can We Call a System Self-Organizing? In: Banz‐ haf W, Ziegler J, Christaller T, Dittrich P, Kim J, editors, Advances in Artificial Life, volume 2801 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg,

[18] Gershenson C. Introduction to Random Boolean Networks. In: Bedau M, Husbands P, Hutton T, Kumar S, Suzuki H, editors, Workshop and Tutorial Proceedings, Ninth

Phenomena 1984;10(1-2) 1 – 35. doi:10.1016/0167-2789(84)90245-8.

Two-Stream Conference on. 15 –18. doi:10.1109/ANNES.1993.323092.

IEEE Transactions on 1994;5(4) 561 –575. doi:10.1109/72.298226.

2002;13(6) 1331 – 1341. doi:10.1109/TNN.2002.804221.

461. doi:10.1109/72.217188.

1997;8(2) 315 –330. doi:10.1109/72.557669.

edition. Oxford University Press, USA, 1993.

644. doi:10.1103/RevModPhys.55.601.

– 86. doi:10.1109/MIS.2003.1217631.

2003;606–614. doi:10.1007/978-3-540-39432-7 65.

1982.

Some research directions seem to suggest that breaking some of the fixed ties in engineered system, letting systems auto-organize in response to the environmental changes, as biologi‐ cal systems have been doing for millions of years, is the way to go to "put some order in the chaos". In support of these indications, self-organizing systems have provided interesting results in modeling complex processes, blurring a little the line between artificial and natu‐ ral systems.

Other researches seek to extend self-organization to the extreme of self-healing systems able to recognize their own faults and self-repair, while biological applications confirm that tak‐ ing into account self- organization when studying natural processes, while not an easy task, can provide more comprehensive and effective models.

If all these efforts move on the path towards truly intelligent systems, or even Artificial Life –as some have been suggesting for years– is yet to be discovered, nevertheless it is a very interesting path.

#### **Author details**

Daniele Peri and Salvatore Gaglio

DICGIM - University of Palermo, Italy, ICAR - CNR, Palermo, Italy

#### **References**


[5] Choi DI, Park SH. Self-creating and organizing neural networks. Neural Networks, IEEE Transactions on 1994;5(4) 561 –575. doi:10.1109/72.298226.

**7. Conclusions**

ral systems.

interesting path.

**Author details**

**References**

5.58323.

doi:10.1109/5.58325.

Daniele Peri and Salvatore Gaglio

next –or rather, current – large scale integration.

224 Design and Architectures for Digital Signal Processing

can provide more comprehensive and effective models.

DICGIM - University of Palermo, Italy, ICAR - CNR, Palermo, Italy

Cortical Networks. Neuroinformatics 2001; 275–302.

Networks. North-Holland, Amsterdam, 1991;403–408.

The paradoxical fascination of simplicity producing complexity has traversed decades of re‐ search in information systems. Even more now that extremely high integration is pushing millions of highly modular circuits in few square millimeters, and inter-networking is the

Some research directions seem to suggest that breaking some of the fixed ties in engineered system, letting systems auto-organize in response to the environmental changes, as biologi‐ cal systems have been doing for millions of years, is the way to go to "put some order in the chaos". In support of these indications, self-organizing systems have provided interesting results in modeling complex processes, blurring a little the line between artificial and natu‐

Other researches seek to extend self-organization to the extreme of self-healing systems able to recognize their own faults and self-repair, while biological applications confirm that tak‐ ing into account self- organization when studying natural processes, while not an easy task,

If all these efforts move on the path towards truly intelligent systems, or even Artificial Life –as some have been suggesting for years– is yet to be discovered, nevertheless it is a very

[1] Widrow B, Lehr M. 30 years of adaptive neural networks: perceptron, Madaline, and backpropagation. Proceedings of the IEEE 1990;78(9) 1415 –1442. doi:10.1109/

[2] Kohonen T. The self-organizing map. Proceedings of the IEEE 1990;78(9) 1464 –1480.

[3] Bednar JA, Kelkar A, Miikkulainen R. Scaling Self-Organizing Maps To Model Large

[4] Fritzke B. Let it grow – self-organizing feature maps with problem dependent cell structure. In: Kohonen T, Ma kisara K, Simula O, Kangas J, editors, Artificial Neural


International Conference on the Simulation and Synthesis of Living Systems (ALife IX). Boston, MA, 160–173.

[31] OrHai M, Teuscher C. Spatial Sorting Algorithms for Parallel Computing in Net‐ works. In: Self-Adaptive and Self-Organizing Systems Workshops (SASOW), 2011

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

227

[32] Maignan L, Gruau F. Convex Hulls on Cellular Spaces: Spatial Computing on Cellu‐ lar Automata. In: Self-Adaptive and Self-Organizing Systems Workshops (SASOW),

[33] Chua L, Yang L. Cellular neural networks: theory. Circuits and Systems, IEEE Trans‐

[34] Chua L, Yang L. Cellular neural networks: applications. Circuits and Systems, IEEE

[35] Yang L, Chua L, Krieg K. VLSI implementation of cellular neural networks. In: Cir‐ cuits and Systems, 1990., IEEE International Symposium on. 2425 –2427 vol.3. doi:

[36] Ruckert U. ULSI architectures for artificial neural networks. Micro, IEEE 2002;22(3)

[37] Porrmann M, Witkowski U, Ruckert U. A massively parallel architecture for self- or‐ ganizing feature maps. Neural Networks, IEEE Transactions on 2003;14(5) 1110 –

[38] Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 1982;79(8) 2554–

[39] Aleksander I. Self-adaptive universal logic circuits. Electronics Letters 1966;2(8) 321 –

[40] Shoa A, Shirani S. Run-Time Reconfigurable Systems for Digital Signal Processing Applications: A Survey. The Journal of VLSI Signal Processing 2005;39 213–235.

[41] Hartenstein R. A decade of reconfigurable computing: a visionary retrospective. In: Design, Automation and Test in Europe, 2001. Conference and Exhibition 2001. Pro-

[42] Lee G, Lee S, Choi K. Automatic mapping of application to coarse-grained re- config‐ urable architecture based on high-level synthesis techniques. In: SoC De- sign Con‐ ference, 2008. ISOCC '08. International, volume 01. I–395 –I–398. doi: 10.1109/

[43] He C, Papakonstantinou A, Chen D. A novel SoC architecture on FPGA for ultra fast face detection. In: Computer Design, 2009. ICCD 2009. IEEE International Conference

Fifth IEEE Conference on. 73 –78. doi:10.1109/SASOW.2011.10.

actions on 1988;35(10) 1257 –1272. doi:10.1109/31.7600.

10.1109/ISCAS.1990.112500.

2558.

10 –19. doi:10.1109/MM.2002.1013300.

1121. doi:10.1109/TNN.2003.816368.

322. doi:10.1049/el:19660270.

10.1007/s11265-005-4841-x.

SOCDC.2008.4815655.

ceedings. 642 –649. doi:10.1109/DATE.2001.915091.

on. 412 –418. doi:10.1109/ICCD.2009.5413122.

Transactions on 1988;35(10) 1273 –1290. doi:10.1109/31.7601.

2011 Fifth IEEE Conference on. 67 –72. doi:10.1109/SASOW.2011.14.


[31] OrHai M, Teuscher C. Spatial Sorting Algorithms for Parallel Computing in Net‐ works. In: Self-Adaptive and Self-Organizing Systems Workshops (SASOW), 2011 Fifth IEEE Conference on. 73 –78. doi:10.1109/SASOW.2011.10.

International Conference on the Simulation and Synthesis of Living Systems (ALife

[19] Gershenson C. Guiding the Self-organization of Random Boolean Networks. ArXiv

[20] Gacs P. Reliable cellular automata with self-organization. In: Foundations of Com‐ puter Science, 1997. Proceedings., 38th Annual Symposium on. 90 –99. doi: 10.1109/

[21] Shumate S, El-Shenawee M. Computational Model of Ductal Carcinoma In Situ: The Effects of Contact Inhibition on Pattern Formation. Biomedical Engineering, IEEE

[22] Chaudhary S, Shin SY, Won JK, Cho KH. Multiscale Modeling of Tumorigenesis In‐ duced by Mitochondrial Incapacitation in Cell Death. Biomedical Engineering, IEEE

[23] Shimokawa K, Muraki S. A study on spatial and temporal visual simulation of nerve excitement propagation. In: Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 1. 217 –221 vol.1. doi:

[24] Sakamoto Y, Tuchiya K, Kato M. Deformation method for surgery simulation using voxel space automata. In: Systems, Man, and Cybernetics, 1999. IEEE SMC '99 Con‐ ference Proceedings. 1999 IEEE International Conference on, volume 4. 1026 – 1031

[25] Wei J, Wang A, Du N. Study of self-organizing control of traffic signals in an urban network based on cellular automata. Vehicular Technology, IEEE Transactions on

[26] Rosenblueth DA, Gershenson C. A model of city traffic based on elementary cellular

[27] Gershenson C, Rosenblueth DA. Self-organizing traffic lights at multiple-street inter-

[28] Wang D, Kwok N, Jia X, Fang G. A Cellular Automata approach for superpixel segmentation. In: Image and Signal Processing (CISP), 2011 4th International Congress

[29] Cappellari L, Milani S, Cruz-Reyes C, Calvagno G. Resolution Scalable Image Coding With Reversible Cellular Automata. Image Processing, IEEE Transactions on 2011;

[30] Debled-Rennesson I, Margenstern M. Cellular automata and discrete geometry. In: High Performance Computing and Simulation (HPCS), 2011 International Confer‐

Transactions on 2009;56(5) 1341 –1347. doi:10.1109/TBME.2008.2005638.

Transactions on 2011;58(10) 3028 –3032. doi:10.1109/TBME.2011.2159713.

IX). Boston, MA, 160–173.

226 Design and Architectures for Digital Signal Processing

e-prints 2010;1005.5733.

10.1109/IJCNN.2000.857839.

vol.4. doi:10.1109/ICSMC.1999.812551.

2005;54(2) 744 – 748. doi:10.1109/TVT.2004.841536.

automata. Complex Systems 2011;19(4) 305–322.

20(5) 1461 –1468. doi:10.1109/TIP.2010.2090531.

ence on. 780 –786. doi:10.1109/HPCSim.2011.5999908.

sections. Complexity 2012;17(4) 23–39. doi:10.1002/cplx.20392.

on, volume 2. 1108 –1112. doi:10.1109/CISP.2011.6100339.

SFCS.1997.646097.


[44] Badawi M, Hemani A. A coarse-grained reconfigurable protocol processor. In: Sys‐ tem on Chip (SoC), 2011 International Symposium on. 102 –107. doi:10.1109/ISSOC. 2011. 6089688.

[56] Cancare F, Bhandari S, Bartolini D, Carminati M, Santambrogio M. A bird's eye view of FPGA-based Evolvable Hardware. In: Adaptive Hardware and Systems (AHS),

Self-Organizing Architectures for Digital Signal Processing

http://dx.doi.org/10.5772/53334

229

[57] Akyildiz I, Su W, Sankarasubramaniam Y, Cayirci E. A survey on sensor networks. Communications Magazine, IEEE 2002;40(8) 102 – 114. doi:10.1109/MCOM.2002.

[58] Yick J, Mukherjee B, Ghosal D. Wireless sensor network survey. Computer Networks

[59] Gatani L, Lo Re G, Ortolani M. Robust and Efficient Data Gathering for Wireless Sen‐ sor Networks. In: Proceedings of the 39th Annual Hawaii International Conference

[60] Anastasi G, Lo Re G, Ortolani M. WSNs for Structural Health Monitoring of His- tori‐ cal Buildings. In: Proceedings of HSI'09. The 2nd Conference on Human System In‐

2008;52(12) 2292 – 2330. doi:10.1016/j.comnet.2008.04.002.

on System Sciences, HICSS'06. IEEE Computer Society, 235–242.

2011 NASA/ESA Conference on. 169 –175. doi:10.1109/AHS.2011.5963932.

1024422.

teractions. IEEE, 574–579.


[56] Cancare F, Bhandari S, Bartolini D, Carminati M, Santambrogio M. A bird's eye view of FPGA-based Evolvable Hardware. In: Adaptive Hardware and Systems (AHS), 2011 NASA/ESA Conference on. 169 –175. doi:10.1109/AHS.2011.5963932.

[44] Badawi M, Hemani A. A coarse-grained reconfigurable protocol processor. In: Sys‐ tem on Chip (SoC), 2011 International Symposium on. 102 –107. doi:10.1109/ISSOC.

[45] Jain V, Bhanja S, Chapman G, Doddannagari L. A highly reconfigurable computing array: DSP plane of a 3D heterogeneous SoC. In: SOC Conference, 2005. Proceedings.

[46] Hendry D, Duncan A, Lightowler N. IP core implementation of a self-organizing neural network. Neural Networks, IEEE Transactions on 2003;14(5) 1085 – 1096. doi:

[47] Starzyk J, Zhu Z, Liu TH. Self-organizing learning array. Neural Networks, IEEE

[48] Zhang W, Jha NK, Shang L. A hybrid nano/CMOS dynamically reconfigurable sys‐ tem–Part I: Architecture. J. Emerg. Technol. Comput. Syst. 2009;5(4) 16:1–16:30. doi:

[49] Lin TJ, Zhang W, Jha NK. SRAM-Based NATURE: A Dynamically Reconfigurable FPGA Based on 10T Low-Power SRAMs. Very Large Scale Integration (VLSI) Sys-

tems, IEEE Transactions on 2011;PP(99) 1 –5. doi:10.1109/TVLSI.2011.2169996.

[50] Yao X, Higuchi T. Promises and challenges of evolvable hardware. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 1999;29(1)

[51] Forbes N. Evolution on a chip: evolvable hardware aims to optimize circuit design. Computing in Science Engineering 2001;3(3) 6 –10. doi:10.1109/5992.919259.

[52] Glette K, Torresen J, Yasunaga M, Yamaguchi Y. On-Chip Evolution Using a Soft Processor Core Applied to Image Recognition. In: Adaptive Hardware and Systems, 2006. AHS 2006. First NASA/ESA Conference on. 373 –380. doi:10.1109/AHS.2006.55.

[53] Glette K, Torresen J, Hovin M. Intermediate Level FPGA Reconfiguration for an On‐ line EHW Pattern Recognition System. In: Adaptive Hardware and Systems, 2009.

[54] Stoica A, Zebulum R, Keymeulen D, Tawel R, Daud T, Thakoor A. Reconfigurable VLSI architectures for evolvable hardware: from experimental field programmable transistor arrays to evolution-oriented chips. Very Large Scale Integration (VLSI)

[55] Wang Y, Shi Y. The application of quantum-inspired evolutionary algorithm in analog evolvable hardware. In: Environmental Science and Information Application Technology (ESIAT), 2010 International Conference on, volume 2. 330 –334. doi:

AHS 2009. NASA/ESA Conference on. 19 –26. doi:10.1109/AHS.2009.46.

Systems, IEEE Transactions on 2001;9(1) 227 –232. doi:10.1109/92.920839.

IEEE International. 243 – 246. doi:10.1109/SOCC.2005.1554503.

Transactions on 2005;16(2) 355 –363. doi:10.1109/TNN.2004.842362.

2011. 6089688.

228 Design and Architectures for Digital Signal Processing

10.1109/TNN.2003.816353.

10.1145/1629091.1629092.

87 –97. doi:10.1109/5326.740672.

10.1109/ESIAT.2010.5567359.


**Chapter 10**

**Provisional chapter**

**A Digital Signal Processing Architecture for Soft-**

**A Digital Signal Processing Architecture for**

**Soft-Output MIMO Lattice Reduction Aided**

**Output MIMO Lattice Reduction Aided Detection**

Many wireless communication standards now include the use of multiple transmit and receive antennas as a means of achieving increased throughput or spectral efficiency, including LTE, WiMAX and WiFi (IEEE 802.11n). The task of a detector for a multi-input multi-output (MIMO) communications channel is to separate the spatially mixed and noise-corrupted data streams, and to produce reliable estimates of the transmitted bits. The brute-force maximum-likelihood (ML) detector provides optimal error-rate performance, but is computationally infeasible when either dense symbol constellations or large numbers of antennas are used. Hardware implementation of ML receivers is therefore very challenging, leading to linear detectors based on well-known approaches such as zero forcing (ZF) or minimum mean-square error (MMSE) detection, or nonlinear methods such as successive interference cancellation (SIC), which offer manageable receiver complexity at the expense of

One powerful class of receivers which have been developed over the past decade is based on the highly developed mathematical theory of point lattices, which are periodic arrangements of discrete points. The basic idea is to consider the distortion introduced by the noise-free part of a MIMO channel as a representation of a lattice, then to perform suboptimal detection on an "improved" representation of the channel matrix based derived from a "reduced" lattice. The suitably reduced lattice facilitates the search for the lattice point closest to the received vector, shifting most of the computational complexity to a pre-processing step before linear detection. Such lattice reduction aided detection (LRAD) based approaches to MIMO receiver design have significantly closed the gap between feasible yet high-performance

> ©2012 Murray and Weller, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Murray and Weller; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

© 2013 Murray and Weller.; licensee InTech. This is a paper distributed under the terms of the Creative Commons

Alan T. Murray and Steven R. Weller

highly suboptimal error-rate performance.

MIMO detection, and optimal (but impractical) ML detection.

Alan T. Murray and Steven R. Weller

http://dx.doi.org/10.5772/51649

**1. Introduction**

**Detection**

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter

**Provisional chapter**

#### **A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection Soft-Output MIMO Lattice Reduction Aided Detection**

**A Digital Signal Processing Architecture for**

Alan T. Murray and Steven R. Weller Alan T. Murray and Steven R. Weller

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51649

#### **1. Introduction**

Many wireless communication standards now include the use of multiple transmit and receive antennas as a means of achieving increased throughput or spectral efficiency, including LTE, WiMAX and WiFi (IEEE 802.11n). The task of a detector for a multi-input multi-output (MIMO) communications channel is to separate the spatially mixed and noise-corrupted data streams, and to produce reliable estimates of the transmitted bits. The brute-force maximum-likelihood (ML) detector provides optimal error-rate performance, but is computationally infeasible when either dense symbol constellations or large numbers of antennas are used. Hardware implementation of ML receivers is therefore very challenging, leading to linear detectors based on well-known approaches such as zero forcing (ZF) or minimum mean-square error (MMSE) detection, or nonlinear methods such as successive interference cancellation (SIC), which offer manageable receiver complexity at the expense of highly suboptimal error-rate performance.

One powerful class of receivers which have been developed over the past decade is based on the highly developed mathematical theory of point lattices, which are periodic arrangements of discrete points. The basic idea is to consider the distortion introduced by the noise-free part of a MIMO channel as a representation of a lattice, then to perform suboptimal detection on an "improved" representation of the channel matrix based derived from a "reduced" lattice. The suitably reduced lattice facilitates the search for the lattice point closest to the received vector, shifting most of the computational complexity to a pre-processing step before linear detection. Such lattice reduction aided detection (LRAD) based approaches to MIMO receiver design have significantly closed the gap between feasible yet high-performance MIMO detection, and optimal (but impractical) ML detection.

Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Murray and Weller; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Murray and Weller.; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

©2012 Murray and Weller, licensee InTech. This is an open access chapter distributed under the terms of the

To date, most LRAD-based MIMO detectors produce hard outputs, in which an estimate of the most likely vector of transmitted symbols is generated. For high-performance wireless communication systems, however, it is commonplace that the information transmitted over the air is coded, thereby containing not only raw data, but also the redundant information needed to perform forward error correction (FEC) at the receiver. State-of-the-art FEC codes such as turbo codes and low-density parity-check (LDPC) codes [1], which require estimates of the *probability* that a given transmitted bit was a 1 or a 0, therefore call for *soft output* detectors. The extension of hard-output LRAD detectors to the soft-output case is therefore of high practical relevance, but also recognized as a difficult problem [2, p16]. In this chapter, we present what is believed to be the first digital signal processing (DSP) implementation of a soft-output lattice reduction aided MIMO detector, based on an approach to MIMO detection known as subspace LRAD (SLRAD) proposed by Windpassinger [3, 4].

The chapter is organized as follows. In Section 2 we present the wireless MIMO system model, with an emphasis on how transmitted symbols are drawn from point sets consistent with the lattice theoretic approach to follow. In Section 3 we formally define lattices, and present the most celebrated algorithm for lattice reduction, known as the Lenstra-Lenstra-Lovász (LLL) algorithm. We then show how hard-output lattice-based detection can be used in conjunction with commonly used linear MIMO detectors in Section 4. In Section 5 we outline Windpassinger's subspace-based approach to LRAD in which a list of candidate symbols is produced, thereby facilitating soft-output LRAD. Finally in Section 6 we present a detailed description of our hardware implementation of a soft-output lattice reduction aided MIMO detector.

#### **2. System model**

We consider a MIMO wireless communication system with *n*<sup>T</sup> transmit and *n*<sup>R</sup> receive antennas. The complex baseband model for this MIMO system is

$$\mathbf{y} = \mathbf{H}\mathbf{x} + \mathbf{n},\tag{1}$$

x<sup>n</sup><sup>T</sup>

**Figure 1.** MIMO Wireless Channel

[*Es*1, *Es*2,..., *Esn*<sup>T</sup> ] so that

information bit (*Eb* = 1).

wireless communication standards.

. .

x<sup>2</sup> y<sup>2</sup>

<sup>x</sup> <sup>H</sup> <sup>n</sup> <sup>y</sup> <sup>n</sup><sup>1</sup>

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

. .

drawn from a square grid, and in particular the quadrature phase-shift keying (QPSK) and, 16-quadrature amplitude modulation (16-QAM) and 64-QAM constellations depicted in Fig. 2. We do not consider non-rectangular constellations, such as 8-PSK, due to an inherent incompatibility with the lattice-theoretic framework exploited by lattice reduction aided detection, and also the limited applicability of non-rectangular constellations in emerging

The symbol transmitted from the *n*th antenna, denoted *xn*, is drawn from a constellation A*n*:

where the scalar *Esn* is the average transmitted symbol power. We define the vector **Es**

The selection of **Es** depends on the particular objective of transmit power scaling and indeed varies in practical implementations. In this chapter, to enable fair comparison between systems employing differing modulation formats, we constrain average unity power per

The constellations considered in this chapter are formed from a subset of scaled and shifted

1 + *i*

<sup>2</sup> <sup>|</sup> *<sup>a</sup>*, *<sup>b</sup>* <sup>∈</sup> **<sup>Z</sup>**

*a* + *ib* +

**E xx***H*

Gaussian integers **Z**[*i*] {*a* + *ib* | *a*, *b* ∈ **Z**} [7, p. 230]:

**X**  h<sup>1</sup>,<sup>1</sup>

h<sup>2</sup>,<sup>1</sup>

h<sup>2</sup>,<sup>2</sup> h<sup>1</sup>,<sup>2</sup>

y1

233

n2

http://dx.doi.org/10.5772/51649

n<sup>n</sup><sup>R</sup>

*xn* <sup>∈</sup> <sup>√</sup>*Esn*A*n*, (2)

= diag (**Es**). (3)

. (4)

y<sup>n</sup><sup>R</sup>

. .

x1

where **y** ∈ **C***n*<sup>R</sup> is the received vector, **H** ∈ **C***n*R×*n*<sup>T</sup> is the channel matrix, **n** ∈ **C***n*<sup>R</sup> is the channel noise, and **x** ∈ **C***n*<sup>T</sup> is the vector of transmitted symbols, as shown in Fig. 1.

We assume that the noise **<sup>n</sup>** [*n*1, *<sup>n</sup>*2,..., *nn*<sup>R</sup> ] *<sup>T</sup>* contains independent and identically distributed (i.i.d.) elements *nm* ∼ CN 0, *σ*<sup>2</sup> , *m* = 1, . . . , *n*R. The channel matrix **H** has i.i.d. entries *hm*,*<sup>n</sup>* ∼ CN (0, 1), for *m* = 1, . . . , *n*<sup>R</sup> and *n* = 1, . . . , *n*T, where it is assumed that there are at least as many receive antennas as transmit antennas: *n*<sup>R</sup> ≥ *n*T.

An uncorrelated Rayleigh fading propagation environment is therefore assumed in this chapter, though it should be noted that lattice reduction aided detection receivers similar to those presented later in this chapter have been proposed for environments in which there is either temporal [5] or frequency-selective [6] fading.

The task of the MIMO receiver is to recover **x** from **y**, based on knowledge of both the channel realization **H** and the channel noise variance *σ*2.

The vector of transmitted symbols is denoted **<sup>x</sup>** [*x*1, *<sup>x</sup>*2,..., *xn*<sup>T</sup> ] *<sup>T</sup>*. In this chapter we restrict attention to transmit symbols drawn from finite sets of points, known as *constellations*,

<sup>232</sup> Design and Architectures for Digital Signal Processing A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection 3 A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection http://dx.doi.org/10.5772/51649 233

**Figure 1.** MIMO Wireless Channel

2 Design and Architectures for Digital Signal Processing

reduction aided MIMO detector.

**2. System model**

To date, most LRAD-based MIMO detectors produce hard outputs, in which an estimate of the most likely vector of transmitted symbols is generated. For high-performance wireless communication systems, however, it is commonplace that the information transmitted over the air is coded, thereby containing not only raw data, but also the redundant information needed to perform forward error correction (FEC) at the receiver. State-of-the-art FEC codes such as turbo codes and low-density parity-check (LDPC) codes [1], which require estimates of the *probability* that a given transmitted bit was a 1 or a 0, therefore call for *soft output* detectors. The extension of hard-output LRAD detectors to the soft-output case is therefore of high practical relevance, but also recognized as a difficult problem [2, p16]. In this chapter, we present what is believed to be the first digital signal processing (DSP) implementation of a soft-output lattice reduction aided MIMO detector, based on an approach to MIMO detection

The chapter is organized as follows. In Section 2 we present the wireless MIMO system model, with an emphasis on how transmitted symbols are drawn from point sets consistent with the lattice theoretic approach to follow. In Section 3 we formally define lattices, and present the most celebrated algorithm for lattice reduction, known as the Lenstra-Lenstra-Lovász (LLL) algorithm. We then show how hard-output lattice-based detection can be used in conjunction with commonly used linear MIMO detectors in Section 4. In Section 5 we outline Windpassinger's subspace-based approach to LRAD in which a list of candidate symbols is produced, thereby facilitating soft-output LRAD. Finally in Section 6 we present a detailed description of our hardware implementation of a soft-output lattice

We consider a MIMO wireless communication system with *n*<sup>T</sup> transmit and *n*<sup>R</sup> receive

where **y** ∈ **C***n*<sup>R</sup> is the received vector, **H** ∈ **C***n*R×*n*<sup>T</sup> is the channel matrix, **n** ∈ **C***n*<sup>R</sup> is the

0, *σ*<sup>2</sup>

i.i.d. entries *hm*,*<sup>n</sup>* ∼ CN (0, 1), for *m* = 1, . . . , *n*<sup>R</sup> and *n* = 1, . . . , *n*T, where it is assumed that

An uncorrelated Rayleigh fading propagation environment is therefore assumed in this chapter, though it should be noted that lattice reduction aided detection receivers similar to those presented later in this chapter have been proposed for environments in which there

The task of the MIMO receiver is to recover **x** from **y**, based on knowledge of both the channel

restrict attention to transmit symbols drawn from finite sets of points, known as *constellations*,

channel noise, and **x** ∈ **C***n*<sup>T</sup> is the vector of transmitted symbols, as shown in Fig. 1.

there are at least as many receive antennas as transmit antennas: *n*<sup>R</sup> ≥ *n*T.

The vector of transmitted symbols is denoted **<sup>x</sup>** [*x*1, *<sup>x</sup>*2,..., *xn*<sup>T</sup> ]

**y** = **Hx** + **n**, (1)

*<sup>T</sup>* contains independent and identically

*<sup>T</sup>*. In this chapter we

, *m* = 1, . . . , *n*R. The channel matrix **H** has

known as subspace LRAD (SLRAD) proposed by Windpassinger [3, 4].

antennas. The complex baseband model for this MIMO system is

We assume that the noise **<sup>n</sup>** [*n*1, *<sup>n</sup>*2,..., *nn*<sup>R</sup> ]

is either temporal [5] or frequency-selective [6] fading.

realization **H** and the channel noise variance *σ*2.

distributed (i.i.d.) elements *nm* ∼ CN

drawn from a square grid, and in particular the quadrature phase-shift keying (QPSK) and, 16-quadrature amplitude modulation (16-QAM) and 64-QAM constellations depicted in Fig. 2. We do not consider non-rectangular constellations, such as 8-PSK, due to an inherent incompatibility with the lattice-theoretic framework exploited by lattice reduction aided detection, and also the limited applicability of non-rectangular constellations in emerging wireless communication standards.

The symbol transmitted from the *n*th antenna, denoted *xn*, is drawn from a constellation A*n*:

$$
\omega\_n \in \sqrt{E\_{\text{sn}}} \mathcal{A}\_{n\text{\textquotedblleft}} \tag{2}
$$

where the scalar *Esn* is the average transmitted symbol power. We define the vector **Es** [*Es*1, *Es*2,..., *Esn*<sup>T</sup> ] so that

$$\mathbb{E}\left[\mathbf{x}\mathbf{x}^{H}\right] = \text{diag}\left(\mathbf{E\_{s}}\right). \tag{3}$$

The selection of **Es** depends on the particular objective of transmit power scaling and indeed varies in practical implementations. In this chapter, to enable fair comparison between systems employing differing modulation formats, we constrain average unity power per information bit (*Eb* = 1).

The constellations considered in this chapter are formed from a subset of scaled and shifted Gaussian integers **Z**[*i*] {*a* + *ib* | *a*, *b* ∈ **Z**} [7, p. 230]:

$$\mathbb{X} \triangleq \left\{ a + ib + \frac{1+i}{2} \mid a, b \in \mathbb{Z} \right\}. \tag{4}$$

of the elements *x*¯*<sup>n</sup>* ∈ X*n*:

achieved and quantified.

**3. Lattice reduction**

**3.1. Lattices**

±1 and **T** ∈ **Z**[*i*]

describe formally.

*xn* =

 *Esn cn*

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

The effect of a given channel realization **H** is to rotate and stretch (or contract) the axes of the otherwise square decision regions of the optimal, maximum-likelihood (ML) receiver. The error probability of a detector is determined by the distance of constellation points (mapped by **H**) from the associated decision boundaries. The essential idea of LR-aided detectors is to obtain a "more orthogonal" representation for the channel realization **H**, before detection using a low-complexity (sub-optimal) receiver. In the following section we make these ideas precise, drawing on the well-established mathematical literature on point lattices to formalize what is meant by the notion of a "more orthogonal" representation, and how it can be

A *complex lattice* consists of all linear combinations of the set of linearly independent basis column vectors **<sup>b</sup>***k*, 1 ≤ *<sup>k</sup>* ≤ *<sup>M</sup>* of the basis matrix **<sup>B</sup>** ∈ **<sup>C</sup>***N*×*M*, *<sup>M</sup>* ≤ *<sup>N</sup>*. A complex lattice

The number of possible bases for a given lattice L is infinite, since any basis **B˜** = **BT** forms the same lattice L(**B˜**) = L(**B**) when the transformation matrix **T** is unimodular, i.e. det(**T**) =

reasonably short and almost orthogonal is known as lattice *basis reduction*, which we now

The Lenstra-Lenstra-Lovász (LLL) algorithm was originally published as a lattice reduction algorithm operating on real-valued matrices [8]. Many works use the real decomposition of the complex-valued MIMO transmission model [3, 9]. Lattice reduction methods can operate on both real and complex integer lattices and in particular the LLL algorithm has been extended for complex lattice reduction [10]. The complex LLL (CLLL) algorithm can be

• H*<sup>i</sup>* is the squared Euclidean norm of the orthogonal vectors produced by the

*sk***b***<sup>k</sup>* | *sk* ∈ **Z**[*i*]

*<sup>M</sup>*×*M*. Finding a basis in which the basis vectors are (roughly speaking)

 ,

formed from basis matrix **B** is therefore the set of points

**3.2. Lenstra-Lenstra-Lovász (LLL) algorithm**

summarized as follows. We make the following definitions:

Gram-Schmidt orthogonalization (GSO) of **H**

L(**B**)

where **Z**[*i*] {*a* + *ib* | *a*, *b* ∈ **Z**} is the ring of Gaussian integers [7].

 *M* ∑ *k*=1 *x*¯*n*. (6)

http://dx.doi.org/10.5772/51649

235

**Figure 2.** The three constellations used in this chapter

In this chapter, we restrict attention to the three subsets X*<sup>n</sup>* ⊂ **X** shown in Fig. 3, where the introduction of the offset term in (4) maintains symmetry of each constellation with respect to the axes. We refer to constellations formed in this manner as *Gaussian integer constellations*.

**Figure 3.** Three Gaussian integer constellations

The constellation A*<sup>n</sup>* with elements *α<sup>n</sup>* ∈ A*<sup>n</sup>* employed at the *n*th transmit antenna is:

$$\mathcal{A}\_{\text{ll}} = \frac{\mathcal{X}\_{\text{n}}}{\sqrt{\mathcal{C}\_{\text{n}}}} \, \, \, \tag{5}$$

where *cn* is the average energy of <sup>X</sup>*n*. Dividing each element of <sup>X</sup>*<sup>n</sup>* by <sup>√</sup>*cn* ensures that <sup>A</sup>*<sup>n</sup>* has unity average energy and is referred to as normalized constellations. For square QAM constellations such as those in Fig. 2, *cn* = (|X*n*| − 1) /6.

It is important to note that we deliberately allow each transmit antenna to be independently mapped to a constellation set. In summary, the transmitted symbols *xn* are formed by scaling of the elements *x*¯*<sup>n</sup>* ∈ X*n*:

4 Design and Architectures for Digital Signal Processing

**Figure 2.** The three constellations used in this chapter

−3.5−2.5−1.5−0.5 0.5 1.5 2.5 3.5


(a) QPSK

**Figure 3.** Three Gaussian integer constellations

−3.5 −2.5 −1.5 −0.5 0.5 1.5 2.5 3.5 (a) QPSK (b) 16−QAM (c) 64−QAM

In this chapter, we restrict attention to the three subsets X*<sup>n</sup>* ⊂ **X** shown in Fig. 3, where the introduction of the offset term in (4) maintains symmetry of each constellation with respect to the axes. We refer to constellations formed in this manner as *Gaussian integer constellations*.


−3.5−2.5−1.5−0.5 0.5 1.5 2.5 3.5

−3.5−2.5−1.5−0.5 0.5 1.5 2.5 3.5


(c) 64−QAM

, (5)

−3.5 −2.5 −1.5 −0.5 0.5 1.5 2.5 3.5

(b) 16−QAM

The constellation A*<sup>n</sup>* with elements *α<sup>n</sup>* ∈ A*<sup>n</sup>* employed at the *n*th transmit antenna is:

<sup>A</sup>*<sup>n</sup>* <sup>=</sup> <sup>X</sup>*<sup>n</sup>* <sup>√</sup>*cn*

where *cn* is the average energy of <sup>X</sup>*n*. Dividing each element of <sup>X</sup>*<sup>n</sup>* by <sup>√</sup>*cn* ensures that <sup>A</sup>*<sup>n</sup>* has unity average energy and is referred to as normalized constellations. For square QAM

It is important to note that we deliberately allow each transmit antenna to be independently mapped to a constellation set. In summary, the transmitted symbols *xn* are formed by scaling

−3.5 −2.5 −1.5 −0.5 0.5 1.5 2.5 3.5

constellations such as those in Fig. 2, *cn* = (|X*n*| − 1) /6.

$$
\pi\_n = \sqrt{\frac{E\_{sn}}{c\_n}} \mathfrak{X}\_n. \tag{6}
$$

The effect of a given channel realization **H** is to rotate and stretch (or contract) the axes of the otherwise square decision regions of the optimal, maximum-likelihood (ML) receiver. The error probability of a detector is determined by the distance of constellation points (mapped by **H**) from the associated decision boundaries. The essential idea of LR-aided detectors is to obtain a "more orthogonal" representation for the channel realization **H**, before detection using a low-complexity (sub-optimal) receiver. In the following section we make these ideas precise, drawing on the well-established mathematical literature on point lattices to formalize what is meant by the notion of a "more orthogonal" representation, and how it can be achieved and quantified.

#### **3. Lattice reduction**

#### **3.1. Lattices**

A *complex lattice* consists of all linear combinations of the set of linearly independent basis column vectors **<sup>b</sup>***k*, 1 ≤ *<sup>k</sup>* ≤ *<sup>M</sup>* of the basis matrix **<sup>B</sup>** ∈ **<sup>C</sup>***N*×*M*, *<sup>M</sup>* ≤ *<sup>N</sup>*. A complex lattice formed from basis matrix **B** is therefore the set of points

$$\mathcal{L}(\mathbf{B}) \stackrel{\triangle}{=} \left\{ \sum\_{k=1}^{M} s\_k \mathbf{b}\_k \mid s\_k \in \mathbb{Z}[i] \right\} \ \mathbf{A}$$

where **Z**[*i*] {*a* + *ib* | *a*, *b* ∈ **Z**} is the ring of Gaussian integers [7].

The number of possible bases for a given lattice L is infinite, since any basis **B˜** = **BT** forms the same lattice L(**B˜**) = L(**B**) when the transformation matrix **T** is unimodular, i.e. det(**T**) = ±1 and **T** ∈ **Z**[*i*] *<sup>M</sup>*×*M*. Finding a basis in which the basis vectors are (roughly speaking) reasonably short and almost orthogonal is known as lattice *basis reduction*, which we now describe formally.

#### **3.2. Lenstra-Lenstra-Lovász (LLL) algorithm**

The Lenstra-Lenstra-Lovász (LLL) algorithm was originally published as a lattice reduction algorithm operating on real-valued matrices [8]. Many works use the real decomposition of the complex-valued MIMO transmission model [3, 9]. Lattice reduction methods can operate on both real and complex integer lattices and in particular the LLL algorithm has been extended for complex lattice reduction [10]. The complex LLL (CLLL) algorithm can be summarized as follows. We make the following definitions:

• H*<sup>i</sup>* is the squared Euclidean norm of the orthogonal vectors produced by the Gram-Schmidt orthogonalization (GSO) of **H**


The LLL algorithm consists of three basic steps:


Size reduction and basis vector swapping iterates until the swapping condition is no longer satisfied by any pair of **h***k*−<sup>1</sup> and **h***k*. The resultant basis is then said to be *reduced*. The swapping condition for LLL reduction, also called the *Lovász condition*, is:

$$\mathcal{H}\_k < (\delta - |\mu\_{k,k-1}|^2) \mathcal{H}\_{k-1} \tag{7}$$

**Algorithm 1** [**H**, **T**] ⇐ LLL(**H**, *δ*) **Input: H** ∈ **C***n*×*<sup>m</sup>* **and** *δ* ∈ **R***<sup>n</sup>*

**T** ⇐ **I***n*, *k* ⇐ 2 **for** *j* = 1 to *n* **do**

**for** *j* = 1 to *n* **do**

*µi*,*<sup>j</sup>* ⇐ <sup>1</sup> H*j* **h***i*, **h***<sup>j</sup>* − *j*−1 ∑ *k*=1 *µH <sup>j</sup>*,*kµi*,*k*H*<sup>k</sup>*

**end for end for**

**while** *k* ≤ *n* **do**

**for** *i* = *j* + 1 to *n* **do**

H*<sup>i</sup>* ⇐ H*<sup>i</sup>* − |*µi*,*j*|

**if** H*<sup>k</sup>* < (*<sup>δ</sup>* − |*µk*,*k*−1|

*<sup>µ</sup>*˙ *<sup>k</sup>*,*k*−<sup>1</sup> = *<sup>µ</sup><sup>H</sup>*

H˙ *<sup>k</sup>* =

**else**

**end if end while**

**end for** *k* = *k* + 1

H˙ *<sup>k</sup>*−<sup>1</sup> = H*<sup>k</sup>* + |*µk*,*k*−1|

 H*k*−<sup>1</sup> H˙ *<sup>k</sup>*−<sup>1</sup> H*<sup>k</sup>*

*k* = max(2, *k* − 1)

*k*,*k*−1

*<sup>µ</sup>*˙ *<sup>i</sup>*,*k*−<sup>1</sup> = *<sup>µ</sup>i*,*k*−1*µ*˙ *<sup>k</sup>*,*k*−<sup>1</sup> + *<sup>µ</sup>i*,*<sup>k</sup>*

**for** *j* = *k* − 2 **downto** 1 **do**

<sup>2</sup>H*<sup>j</sup>*

Swap columns *k* and *k* − 1 of **H** and **T**

 H*k*−<sup>1</sup> H˙ *<sup>k</sup>*−<sup>1</sup> 

Update H and *µ* where H˙ and *µ*˙ denote the new values

 H*<sup>k</sup>* H˙ *<sup>k</sup>*−<sup>1</sup> 

<sup>2</sup>H*k*−<sup>1</sup>

**H**1 *<sup>L</sup>* = 

*µ* =  0.75 0.25 0.5 0

1.0000 0.0000 0.2308 1.0000

[**H**, **T**, *µ*] ⇐ Reduce (**H**, **T**, *µ*, *k*, *k* − 1) // Size Reduction

*<sup>µ</sup>*˙ *<sup>i</sup>*,*<sup>k</sup>* = *<sup>µ</sup>i*,*k*−<sup>1</sup> − *<sup>µ</sup>i*,*kµk*,*k*−<sup>1</sup> // *<sup>i</sup>* = *<sup>k</sup>* + 1 to *<sup>n</sup> <sup>µ</sup>*˙ *<sup>k</sup>*−1,*<sup>j</sup>* = *<sup>µ</sup>k*,*<sup>j</sup>* // *<sup>j</sup>* = 1 to *<sup>k</sup>* − <sup>2</sup> *<sup>µ</sup>*˙ *<sup>k</sup>*,*<sup>j</sup>* = *<sup>µ</sup>k*−1,*<sup>j</sup>* // *<sup>j</sup>* = 1 to *<sup>k</sup>* − <sup>2</sup>

[**H**, **T**, *µ*] ⇐ Reduce (**H**, **T**, *µ*, *k*, *j*) // Size Reduction

*and* **T**<sup>1</sup> =

*and* H =

 1 1 0 1 ,

 0.8125 0.0192 .

<sup>2</sup>)H*k*−<sup>1</sup> **then** // Lovász condition check

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

// *i* = *k* + 1 to *n*

http://dx.doi.org/10.5772/51649

237

H*<sup>j</sup>* = **H***j*, **H***<sup>j</sup>* 

**end for**

where *δ* satisfying <sup>1</sup> <sup>4</sup> <sup>&</sup>lt; *<sup>δ</sup>* <sup>&</sup>lt; 1 is a factor selected to achieve an acceptable quality-complexity trade off [8].

After each swapping step, H*k*<sup>−</sup>1, H*<sup>k</sup>* and some of the *<sup>µ</sup>i*,*<sup>j</sup>* values needed to be updated. Techniques can be employed to minimize the number and frequency of recalculations of H and *µ* elements [11]. The LLL algorithm is detailed in Algorithm 1.

**Example 3.1.** *Suppose δ* = <sup>3</sup> <sup>4</sup> *and*

$$\mathbf{H} = \begin{bmatrix} 0.75 \ -0.5 \\ 0.5 \ -0.5 \end{bmatrix}.$$

*Then*

$$\mathbf{H}^{0}{}\_{L} = \begin{bmatrix} 0.75 & -0.5\\ 0.5 & -0.5 \end{bmatrix} \text{ and } \mathbf{T}^{0} = \begin{bmatrix} 1 \ 0\\ 0 \ 1 \end{bmatrix} \text{ }^{\prime}$$

*and from the modified GSO*

$$
\mu = \begin{bmatrix} 1.0000 & 0.0000 \\ -0.7692 & 1.0000 \end{bmatrix} \text{ and } \mathcal{H} = \begin{bmatrix} 0.8125 \\ 0.0192 \end{bmatrix}.
$$

*Starting with columns 1 and 2, as* |*µ*2,1| > 0.5*, size reduction is performed on these columns adding the first column to the second and yielding the following partially reduced matrix and corresponding transform:*

**Algorithm 1** [**H**, **T**] ⇐ LLL(**H**, *δ*)

6 Design and Architectures for Digital Signal Processing

• **H***<sup>i</sup>*

the LLL algorithm

• Initially, **H**<sup>L</sup>

where *δ* satisfying <sup>1</sup>

**Example 3.1.** *Suppose δ* = <sup>3</sup>

*and from the modified GSO*

trade off [8].

*Then*

*transform:*

orthogonal vector and the length of the *j*

<sup>0</sup> = **H** and **T**<sup>0</sup> = **I***n*<sup>T</sup>

The LLL algorithm consists of three basic steps:

• *µij* is the ratio of the length of the orthogonal projection of the *i*

1. H and *µ* are computed using a modified GSO procedure [11]

condition that |ℜ(*µk*,*j*)| ≤ 0.5 and |ℑ(*µk*,*j*)| ≤ 0.5 for all *<sup>j</sup>* < *<sup>k</sup>*

that size reduction can be repeated to make basis vectors shorter

swapping condition for LLL reduction, also called the *Lovász condition*, is:

and *µ* elements [11]. The LLL algorithm is detailed in Algorithm 1.

**H** = 

0.75 −0.5 0.5 −0.5

1.0000 0.0000 −0.7692 1.0000

<sup>4</sup> *and*

**H**0 *<sup>L</sup>* = 

*µ* = 

<sup>L</sup> and **T***<sup>i</sup>* represent the values of the reduced basis and transform after the *i*

2. Size reduction aims to make basis vectors shorter and more orthogonal by asserting the

3. Basis vectors **h***k*−<sup>1</sup> and **h***<sup>k</sup>* are swapped if a so-called *swapping condition* is satisfied such

Size reduction and basis vector swapping iterates until the swapping condition is no longer satisfied by any pair of **h***k*−<sup>1</sup> and **h***k*. The resultant basis is then said to be *reduced*. The

After each swapping step, H*k*<sup>−</sup>1, H*<sup>k</sup>* and some of the *<sup>µ</sup>i*,*<sup>j</sup>* values needed to be updated. Techniques can be employed to minimize the number and frequency of recalculations of H

> 0.75 −0.5 0.5 −0.5

*Starting with columns 1 and 2, as* |*µ*2,1| > 0.5*, size reduction is performed on these columns adding the first column to the second and yielding the following partially reduced matrix and corresponding*

 .

*and* **T**<sup>0</sup> =

*and* H =

 1 0 0 1 ,

 0.8125 0.0192

 .

<sup>4</sup> <sup>&</sup>lt; *<sup>δ</sup>* <sup>&</sup>lt; 1 is a factor selected to achieve an acceptable quality-complexity

H*<sup>k</sup>* < (*<sup>δ</sup>* − |*µk*,*k*−1|

• *k* is the index of the current column of **H** being processed such that 2 ≤ *k* ≤ *n*<sup>T</sup>

th orthogonal vector

th basis onto the *j*

<sup>2</sup>)H*k*<sup>−</sup>1, (7)

th

th step of

**Input: H** ∈ **C***n*×*<sup>m</sup>* **and** *δ* ∈ **R***<sup>n</sup>* **T** ⇐ **I***n*, *k* ⇐ 2 **for** *j* = 1 to *n* **do** H*<sup>j</sup>* = **H***j*, **H***<sup>j</sup>* **end for for** *j* = 1 to *n* **do for** *i* = *j* + 1 to *n* **do** *µi*,*<sup>j</sup>* ⇐ <sup>1</sup> H*j* **h***i*, **h***<sup>j</sup>* − *j*−1 ∑ *k*=1 *µH <sup>j</sup>*,*kµi*,*k*H*<sup>k</sup>* H*<sup>i</sup>* ⇐ H*<sup>i</sup>* − |*µi*,*j*| <sup>2</sup>H*<sup>j</sup>* **end for end for while** *k* ≤ *n* **do** [**H**, **T**, *µ*] ⇐ Reduce (**H**, **T**, *µ*, *k*, *k* − 1) // Size Reduction **if** H*<sup>k</sup>* < (*<sup>δ</sup>* − |*µk*,*k*−1| <sup>2</sup>)H*k*−<sup>1</sup> **then** // Lovász condition check Swap columns *k* and *k* − 1 of **H** and **T** Update H and *µ* where H˙ and *µ*˙ denote the new values H˙ *<sup>k</sup>*−<sup>1</sup> = H*<sup>k</sup>* + |*µk*,*k*−1| <sup>2</sup>H*k*−<sup>1</sup> *<sup>µ</sup>*˙ *<sup>k</sup>*,*k*−<sup>1</sup> = *<sup>µ</sup><sup>H</sup> k*,*k*−1 H*k*−<sup>1</sup> H˙ *<sup>k</sup>*−<sup>1</sup> H˙ *<sup>k</sup>* = H*k*−<sup>1</sup> H˙ *<sup>k</sup>*−<sup>1</sup> H*<sup>k</sup> <sup>µ</sup>*˙ *<sup>i</sup>*,*k*−<sup>1</sup> = *<sup>µ</sup>i*,*k*−1*µ*˙ *<sup>k</sup>*,*k*−<sup>1</sup> + *<sup>µ</sup>i*,*<sup>k</sup>* H*<sup>k</sup>* H˙ *<sup>k</sup>*−<sup>1</sup> // *i* = *k* + 1 to *n <sup>µ</sup>*˙ *<sup>i</sup>*,*<sup>k</sup>* = *<sup>µ</sup>i*,*k*−<sup>1</sup> − *<sup>µ</sup>i*,*kµk*,*k*−<sup>1</sup> // *<sup>i</sup>* = *<sup>k</sup>* + 1 to *<sup>n</sup> <sup>µ</sup>*˙ *<sup>k</sup>*−1,*<sup>j</sup>* = *<sup>µ</sup>k*,*<sup>j</sup>* // *<sup>j</sup>* = 1 to *<sup>k</sup>* − <sup>2</sup> *<sup>µ</sup>*˙ *<sup>k</sup>*,*<sup>j</sup>* = *<sup>µ</sup>k*−1,*<sup>j</sup>* // *<sup>j</sup>* = 1 to *<sup>k</sup>* − <sup>2</sup> *k* = max(2, *k* − 1) **else for** *j* = *k* − 2 **downto** 1 **do** [**H**, **T**, *µ*] ⇐ Reduce (**H**, **T**, *µ*, *k*, *j*) // Size Reduction **end for** *k* = *k* + 1 **end if end while**

$$\begin{aligned} \mathbf{H}^1{}\_L &= \begin{bmatrix} 0.75 \ 0.25 \\ 0.5 \ 0 \end{bmatrix} \text{ and } \mathbf{T}^1 = \begin{bmatrix} 1 \ 1 \\ 0 \ 1 \end{bmatrix} \\\\ \boldsymbol{\mu} &= \begin{bmatrix} 1.0000 \ 0.0000 \\ 0.2308 \ 1.0000 \end{bmatrix} \text{ and } \boldsymbol{\mathcal{H}} = \begin{bmatrix} 0.8125 \\ 0.0192 \end{bmatrix}. \end{aligned}$$

**Algorithm 2** [**H**, **T**, *µ*] ⇐ Reduce (**H**, **T**, *µ*, *k*, *j*)


*Next the Lovász condition is checked and, since* H<sup>2</sup> < (*δ* − |*µ*2,1| <sup>2</sup>)H1*, the two columns are swapped, yielding:*

$$\begin{aligned} \mathbf{H}^2 &= \begin{bmatrix} 0.25 \ 0.75 \\ 0 & 0.5 \end{bmatrix} \text{ and } \mathbf{T}^2 = \begin{bmatrix} 1 \ 1 \\ 1 \ 0 \end{bmatrix}, \\\\ \boldsymbol{\mu} &= \begin{bmatrix} 1.0000 \ 0.0000 \\ 3.0000 \ 1.0000 \end{bmatrix} \text{ and } \boldsymbol{\mathcal{H}} = \begin{bmatrix} 0.0625 \\ 0.2500 \end{bmatrix}. \end{aligned}$$

*Size reduction is then performed on the columns once more; this time by subtracting three times the first column from the second we have:*

$$\mathbf{H}\_L = \begin{bmatrix} 0.25 & 0\\ 0 & 0.5 \end{bmatrix} \text{ and } \mathbf{T} = \begin{bmatrix} 1 \ -2\\ 1 \ -3 \end{bmatrix} \text{ (}\ \mathbf{T}\text{)}$$

$$
\mu = \begin{bmatrix} 1.0000 \ 0.0000 \\ 0.0000 \ 1.0000 \end{bmatrix} \text{ and } \mathcal{H} = \begin{bmatrix} 0.0625 \\ 0.2500 \end{bmatrix}
$$

*The Lovász condition* (7) *is now satisfied, and the algorithm terminates.*

#### **3.3. Orthogonality defect**

The orthogonality of a matrix **H** can be quantified using the *orthogonality defect*, defined as [4, §4.6.2]:

$$\delta(\mathbf{H}) = \frac{\prod\_{k=1}^{n\_{\mathrm{T}}} ||\mathbf{h}\_{k}||}{\left| \sqrt{\det(\mathbf{H}^{H}\mathbf{H})} \right|} \,\mathrm{}\,\mathrm{}\tag{8}$$

.

or generalized inverse to have larger row norms, leading to noise enhancement. As will be shown in Section 4, matrices with a lower orthogonality defect therefore induce less noise enhancement in ZF- or MMSE-based detectors as the probability of error, for example as

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

To illustrate the impact of lattice reduction on orthogonality defect, we generated 10<sup>6</sup> randomly chosen **H** ∈ **C**4<sup>×</sup>4, and computed the lattice reduced equivalent **H**L. The orthogonality defect was calculated using (8) both before and after lattice reduction. The results are presented in the form of cumulative distributions in Fig. 4, where the effect of lattice reduction on orthogonality defect is clearly apparent. Lattice basis reduction has also been shown to improve matrix conditioning [12]. It is this improvement that reduces noise enhancement in linear detection methods and reduces the error rate of LRAD-based systems. Numerous researchers have investigated and compared the application of various lattice reduction algorithms for MIMO detection. In addition to the LLL algorithm, these include Korkine–Zolotarev (KZ) [13], and Seysen's [14] lattice reduction algorithms; see [2] and the references therein for applications to MIMO detection. In this chapter we restrict attention to the LLL algorithm, since numerous simulation studies suggest that lattice-reduction-aided detection is well suited to low-complexity MIMO receivers when large constellations are used

0 1 2 3 4 5 6 7 8 9 10

H H<sup>L</sup>

http://dx.doi.org/10.5772/51649

239

Orthogonality Defect

Detectors which output an estimate of the most likely vector of transmitted symbols are said to be *hard output* detectors. Hard estimates are denoted **<sup>b</sup>** for bit vector estimates and **<sup>x</sup>** for symbol vector estimates. Detectors which generate not just a vector of bit estimates but also an estimate of the *probability* that a given transmitted bit was a 1 or a 0 are said to be *soft output* detectors. Soft output detectors provide a significant benefit when combined with channel

**Figure 4.** Cumulative distributions of the orthogonality defect for non-reduced and reduced basis channel matrices

**4. Hard detection using lattice reduction**

calculated in (15), can be reduced.

[15, 16].

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Probability

where **h***<sup>k</sup>* is the *k*th column of **H**, *δ*(**H**) ≥ 1 for all **H** and *δ*(**H**) = 1 if and only if the columns of **H** are orthogonal. When the number of columns and rows of **H** are equal, the denominator can be simplified to |det(**H**)|. From (8), matrices with correlated columns or larger column norms will result in higher orthogonality defects. This also causes their inverse or generalized inverse to have larger row norms, leading to noise enhancement. As will be shown in Section 4, matrices with a lower orthogonality defect therefore induce less noise enhancement in ZF- or MMSE-based detectors as the probability of error, for example as calculated in (15), can be reduced.

8 Design and Architectures for Digital Signal Processing

**if** |ℜ *µk*,*j* | > <sup>1</sup>

*yielding:*

*<sup>c</sup>* ⇐ ⌊*µk*,*j*⌉ **H***<sup>k</sup>* ⇐ **H***<sup>k</sup>* − *c***H***<sup>j</sup>* **T***<sup>k</sup>* ⇐ **T***<sup>k</sup>* − *c***T***<sup>j</sup>* **for** *l* = 1 **to** *j* **do** *<sup>µ</sup>k*,*<sup>l</sup>* ⇐ *<sup>µ</sup>k*,*<sup>l</sup>* − *<sup>c</sup>µj*,*<sup>l</sup>*

**end for end if**

**Algorithm 2** [**H**, **T**, *µ*] ⇐ Reduce (**H**, **T**, *µ*, *k*, *j*)

 *µk*,*j* | > <sup>1</sup>

*Next the Lovász condition is checked and, since* H<sup>2</sup> < (*δ* − |*µ*2,1|

**H**2 *<sup>L</sup>* = 

*µ* = 

**H***<sup>L</sup>* =

*The Lovász condition* (7) *is now satisfied, and the algorithm terminates.*

*µ* =   0.25 0 0 0.5

1.0000 0.0000 0.0000 1.0000

*first column from the second we have:*

**3.3. Orthogonality defect**

[4, §4.6.2]:

<sup>2</sup> **then**

0.25 0.75 0 0.5

1.0000 0.0000 3.0000 1.0000 *Size reduction is then performed on the columns once more; this time by subtracting three times the*

The orthogonality of a matrix **H** can be quantified using the *orthogonality defect*, defined as

where **h***<sup>k</sup>* is the *k*th column of **H**, *δ*(**H**) ≥ 1 for all **H** and *δ*(**H**) = 1 if and only if the columns of **H** are orthogonal. When the number of columns and rows of **H** are equal, the denominator can be simplified to |det(**H**)|. From (8), matrices with correlated columns or larger column norms will result in higher orthogonality defects. This also causes their inverse

*<sup>δ</sup>*(**H**) = <sup>∏</sup>*n*<sup>T</sup>

  *and* **T**<sup>2</sup> =

*and* H =

*and* **T** =

*and* H =

*<sup>k</sup>*=<sup>1</sup> �**h***k*�

 

det(**H***H***H**)

 1 1 1 0 ,

 0.0625 0.2500

 1 −2 1 −3

> 0.0625 0.2500

 ,

> .

, (8)

 .

<sup>2</sup>)H1*, the two columns are swapped,*

<sup>2</sup> **or** |ℑ

To illustrate the impact of lattice reduction on orthogonality defect, we generated 10<sup>6</sup> randomly chosen **H** ∈ **C**4<sup>×</sup>4, and computed the lattice reduced equivalent **H**L. The orthogonality defect was calculated using (8) both before and after lattice reduction. The results are presented in the form of cumulative distributions in Fig. 4, where the effect of lattice reduction on orthogonality defect is clearly apparent. Lattice basis reduction has also been shown to improve matrix conditioning [12]. It is this improvement that reduces noise enhancement in linear detection methods and reduces the error rate of LRAD-based systems.

Numerous researchers have investigated and compared the application of various lattice reduction algorithms for MIMO detection. In addition to the LLL algorithm, these include Korkine–Zolotarev (KZ) [13], and Seysen's [14] lattice reduction algorithms; see [2] and the references therein for applications to MIMO detection. In this chapter we restrict attention to the LLL algorithm, since numerous simulation studies suggest that lattice-reduction-aided detection is well suited to low-complexity MIMO receivers when large constellations are used [15, 16].

**Figure 4.** Cumulative distributions of the orthogonality defect for non-reduced and reduced basis channel matrices

#### **4. Hard detection using lattice reduction**

Detectors which output an estimate of the most likely vector of transmitted symbols are said to be *hard output* detectors. Hard estimates are denoted **<sup>b</sup>** for bit vector estimates and **<sup>x</sup>** for symbol vector estimates. Detectors which generate not just a vector of bit estimates but also an estimate of the *probability* that a given transmitted bit was a 1 or a 0 are said to be *soft output* detectors. Soft output detectors provide a significant benefit when combined with channel coding schemes which make use of soft information, such as turbo codes or low-density parity-check (LDPC) codes, but typically increase receiver complexity by a significant degree.

#### **4.1. Maximum-Likelihood detection**

The maximum-likelihood (ML) detector selects from the set of possible transmitted symbol vectors **<sup>x</sup>** ∈ A*n*<sup>T</sup> the vector **<sup>x</sup>**ML which minimizes the Euclidean distance to the receive vector:

$$
\hat{\mathbf{x}}\_{\text{ML}} = \underset{\mathbf{x} \in \mathcal{A}^{\mu\_{\text{T}}}}{\text{arg}\min} \left\| \mathbf{y} - \mathbf{H}\mathbf{x} \right\|^2. \tag{9}
$$

where **<sup>x</sup>**ZF is found by independently rounding each element of **<sup>x</sup>**˜ to the nearest constellation point. The vector **<sup>x</sup>**ZF can then be demodulated to find **<sup>b</sup>**ZF, an estimate of the vector of

There are numerous methods to find the least squares solution to (1), including those that directly calculate the matrix **W**ZF. In this chapter, we utilize the well known Moore-Penrose

<sup>−</sup><sup>1</sup>

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

**H**, *σ*2, **y** 

**x**˜ ⇐ **W**ZF**y** // 4*n*R*n*<sup>T</sup> M, 4*n*R*n*<sup>T</sup> A

Whilst ZF completely reverses the effects of the MIMO channel matrix, if the columns of **H** are correlated, ZF will amplify or enhance the noise. By identifying that **W**ZF**H** = **I** and then multiplying (1) by **W**ZF we can calculate the effective additive noise component of the

It is intuitive that the noise existing in the unconstrained transmit symbol estimate **x**˜ZF is **W**ZF**n**. When the rows of **W**ZF have a large Euclidean distance, multiplication of the received vector leads to the additive noise component in **y** being amplified. We can now show how a poorly conditioned or correlated channel matrix will result in significant noise enhancement

*ǫn*(**ee***H*)

Existing work [17] has looked at the statistical properties of the channel matrix, and in particular the effect of this noise enhancement, leading to a tight analytical bound of the

<sup>−</sup><sup>1</sup> 

**H***H***H**

**e** = **x**˜ − **x**

*pe* <sup>=</sup> diag

<sup>=</sup> *<sup>σ</sup>*2diag

**H***H*. (12)

http://dx.doi.org/10.5772/51649

241

**x**˜ZF = **x** + **W**ZF**n**. (13)

= **W**ZF**n** (14)

(15)

**W**ZF =

 **H***H***H**

transmitted bits, as shown in Algorithm 4.

**Algorithm 4** ZF Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ ZFdetect

**H***<sup>H</sup>*

estimated vector of transmitted symbols:

in ZF by examining the probability of error:

performance of ZF detectors in Rayleigh fading channels.

pseudoinverse:

**W**ZF =

*δ* ⇐ [1 + *i*]/2 **<sup>x</sup>**ZF ⇐ round (**x**˜, *<sup>δ</sup>*) **<sup>b</sup>**ZF ⇐ demod (**<sup>x</sup>**ZF)

**<sup>H</sup>***H***H**−<sup>1</sup>

**4.3. Noise enhancement**

This is achieved by exhaustively examining all possible transmit vectors; see Algorithm 3. Whilst the ML detection algorithm is conceptually simple, its complexity is exponential in the size of the constellation and number of transmit antennas, and is therefore practical for real-time hardware implementation only in the simplest of settings. As the optimal detector, the performance of the the ML detector serves as a benchmark for the detection schemes of the following sections.

**Algorithm 3** ML Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ MLdetect **H**, *σ*2, **y** 

*emin* ⇐ inf **for b** = 0 to 2*n*BPT − 1 **do** // 2*n*BPT iterations **<sup>x</sup>** ⇐ mod (**b**) *<sup>e</sup>* ⇐ �**<sup>y</sup>** <sup>−</sup> **<sup>H</sup><sup>x</sup>**�<sup>2</sup> // 4*n*R*n*<sup>T</sup> <sup>+</sup> <sup>2</sup>*n*<sup>R</sup> M, 4*n*R*n*<sup>T</sup> <sup>+</sup> <sup>4</sup>*n*<sup>R</sup> <sup>A</sup> **if** *e* < *emin* **then** *emin* = *e* **<sup>x</sup>***ml* <sup>=</sup> **<sup>x</sup> b***ml* = **b end if end for**

#### **4.2. Zero Forcing estimation**

The most straightforward linear detection scheme is *zero forcing* (ZF), also known as least squares estimation, which works to reverse the effect of the MIMO channel matrix on the transmitted symbols. By finding the least squares solution to (1), it is referred to as zero forcing as the interference caused by **H** is forced to zero by multiplication of the received vector **y** by **W**ZF, the inverse (or generalized inverse) of the channel matrix:

$$
\mathbf{\tilde{x}\_{ZF}} = \mathbf{W\_{ZF}} \mathbf{y}.\tag{10}
$$

We use the notation **x**˜ to represent an unconstrained estimate of the vector of transmitted symbols. The likelihood that **x**˜ actually maps to a constellation point is negligibly small and so the nearest valid constellations point must be found. ZF finds the estimate of the vector of transmitted symbols **<sup>x</sup>**ZF as follows:

$$\hat{\mathbf{x}}\_{\text{ZF}} = \operatorname\*{arg\,min}\_{\mathbf{x} \in \mathcal{A}^{\mu\_{\text{T}}}} \left\| \mathbf{W}\_{\text{ZF}} \mathbf{y} - \mathbf{x} \right\|^2 \tag{11}$$

where **<sup>x</sup>**ZF is found by independently rounding each element of **<sup>x</sup>**˜ to the nearest constellation point. The vector **<sup>x</sup>**ZF can then be demodulated to find **<sup>b</sup>**ZF, an estimate of the vector of transmitted bits, as shown in Algorithm 4.

There are numerous methods to find the least squares solution to (1), including those that directly calculate the matrix **W**ZF. In this chapter, we utilize the well known Moore-Penrose pseudoinverse:

$$\mathbf{W}\_{\rm ZF} = \left(\mathbf{H}^{H}\mathbf{H}\right)^{-1}\mathbf{H}^{H}.\tag{12}$$

**Algorithm 4** ZF Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ ZFdetect **H**, *σ*2, **y** 

**W**ZF = **<sup>H</sup>***H***H**−<sup>1</sup> **H***<sup>H</sup>* **x**˜ ⇐ **W**ZF**y** // 4*n*R*n*<sup>T</sup> M, 4*n*R*n*<sup>T</sup> A *δ* ⇐ [1 + *i*]/2 **<sup>x</sup>**ZF ⇐ round (**x**˜, *<sup>δ</sup>*) **<sup>b</sup>**ZF ⇐ demod (**<sup>x</sup>**ZF)

10 Design and Architectures for Digital Signal Processing

**4.1. Maximum-Likelihood detection**

**Algorithm 3** ML Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ MLdetect

the following sections.

**<sup>x</sup>** ⇐ mod (**b**)

**if** *e* < *emin* **then** *emin* = *e* **<sup>x</sup>***ml* <sup>=</sup> **<sup>x</sup> b***ml* = **b end if end for**

**4.2. Zero Forcing estimation**

of transmitted symbols **<sup>x</sup>**ZF as follows:

*emin* ⇐ inf

coding schemes which make use of soft information, such as turbo codes or low-density parity-check (LDPC) codes, but typically increase receiver complexity by a significant degree.

The maximum-likelihood (ML) detector selects from the set of possible transmitted symbol vectors **<sup>x</sup>** ∈ A*n*<sup>T</sup> the vector **<sup>x</sup>**ML which minimizes the Euclidean distance to the receive vector:

This is achieved by exhaustively examining all possible transmit vectors; see Algorithm 3. Whilst the ML detection algorithm is conceptually simple, its complexity is exponential in the size of the constellation and number of transmit antennas, and is therefore practical for real-time hardware implementation only in the simplest of settings. As the optimal detector, the performance of the the ML detector serves as a benchmark for the detection schemes of

**for b** = 0 to 2*n*BPT − 1 **do** // 2*n*BPT iterations

*<sup>e</sup>* ⇐ �**<sup>y</sup>** <sup>−</sup> **<sup>H</sup><sup>x</sup>**�<sup>2</sup> // 4*n*R*n*<sup>T</sup> <sup>+</sup> <sup>2</sup>*n*<sup>R</sup> M, 4*n*R*n*<sup>T</sup> <sup>+</sup> <sup>4</sup>*n*<sup>R</sup> <sup>A</sup>

The most straightforward linear detection scheme is *zero forcing* (ZF), also known as least squares estimation, which works to reverse the effect of the MIMO channel matrix on the transmitted symbols. By finding the least squares solution to (1), it is referred to as zero forcing as the interference caused by **H** is forced to zero by multiplication of the received

We use the notation **x**˜ to represent an unconstrained estimate of the vector of transmitted symbols. The likelihood that **x**˜ actually maps to a constellation point is negligibly small and so the nearest valid constellations point must be found. ZF finds the estimate of the vector

�**W**ZF**<sup>y</sup>** − **<sup>x</sup>**�<sup>2</sup>

**x**˜ZF = **W**ZF**y**. (10)

, (11)

vector **y** by **W**ZF, the inverse (or generalized inverse) of the channel matrix:

**<sup>x</sup>**ZF <sup>=</sup> arg min **x**∈A*<sup>n</sup>*<sup>T</sup>

�**y** − **Hx**�<sup>2</sup>

**H**, *σ*2, **y**  . (9)

**<sup>x</sup>**ML <sup>=</sup> arg min **x**∈A*<sup>n</sup>*<sup>T</sup>

#### **4.3. Noise enhancement**

Whilst ZF completely reverses the effects of the MIMO channel matrix, if the columns of **H** are correlated, ZF will amplify or enhance the noise. By identifying that **W**ZF**H** = **I** and then multiplying (1) by **W**ZF we can calculate the effective additive noise component of the estimated vector of transmitted symbols:

$$
\mathbf{\tilde{x}\_{ZF}} = \mathbf{x} + \mathbf{W}\_{ZF}\mathbf{n}.\tag{13}
$$

It is intuitive that the noise existing in the unconstrained transmit symbol estimate **x**˜ZF is **W**ZF**n**. When the rows of **W**ZF have a large Euclidean distance, multiplication of the received vector leads to the additive noise component in **y** being amplified. We can now show how a poorly conditioned or correlated channel matrix will result in significant noise enhancement in ZF by examining the probability of error:

> **e** = **x**˜ − **x** = **W**ZF**n** (14)

$$p\_{\varepsilon} = \text{diag}\left(\varepsilon\_{\text{fl}}\left(\mathbf{e}\mathbf{e}^{H}\right)\right)$$

$$= \sigma^{2}\text{diag}\left(\left(\mathbf{H}^{H}\mathbf{H}\right)^{-1}\right) \tag{15}$$

Existing work [17] has looked at the statistical properties of the channel matrix, and in particular the effect of this noise enhancement, leading to a tight analytical bound of the performance of ZF detectors in Rayleigh fading channels.

#### **4.4. Minimum Mean-Square Error (MMSE) estimation**

MMSE estimation acts to balance the reduction of the interference caused by **H** and the noise enhancement due to correlation of the columns in **H**. Rather than completely remove the effect of the MIMO channel, MMSE estimation works to find a coefficient which minimizes the criterion:

$$\mathbf{W\_{MMSE}} = \underset{\mathbf{W}}{\arg\min} \left\| \mathbf{W}\mathbf{y} - \mathbf{x} \right\|^2. \tag{16}$$

**Algorithm 5** MMSE Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ MMSEdetect

**H**, *σ*2, **y** 

Lattice basis reduction [20, §2.6.1] reduces the orthogonality defect, thereby reducing noise enhancement. This is achieved by finding a closer to orthogonal set of basis vectors. This reduced lattice basis is found by optimizing the generating matrix, which in the present application is a MIMO channel matrix realization. This closer-to-orthogonal set is found using elementary operations on basis vectors. Complex integer linear combinations of the column vectors of **H** are taken to form the reduced matrix **H**<sup>L</sup> which spans the same set of

where **T** is a unimodular matrix with complex integer entries and det(**T**) = ±1, therefore

As in [3], by finding an equivalent and closer to orthogonal set of the basis vectors, **H**L, noise enhancement is reduced when quantization is performed. Importantly, as **T**−<sup>1</sup> and **x**¯ both contain only integer spaced entries, so does **T**−1**x**¯ and so symbol detection or quantization is

Once the lattice reduced channel matrix is found, we then calculate the pseudoinverse as would be done in ZF or MMSE detection. LRAD therefore operates using the following

As shown in Algorithm 6, received vectors **y** are multiplied with the pseudoinverse of the reduced basis **H**<sup>L</sup> to find a soft estimate of the vector of transmitted symbols in the reduced domain. These symbols are then quantized to an integer grid. (Depending on the transform generated, this integer grid may be offset by a half in both real and imaginary dimensions.) These hard estimates are then transformed, using the transform matrix **T** generated by the LR algorithm, to find an estimate of the vector of transmitted symbols. However, as these symbols may fall outside the range of constellation points invalid constellation points are

[**<sup>x</sup>**MMSE, **<sup>b</sup>**MMSE] ⇐ ZFdetect

points **HX***n*<sup>T</sup> ≡ **H**L**X***n*<sup>T</sup> and so

merely rounding to the grid **X**.

1. Find the reduced lattice basis

3. Quantize estimates to **X**

**4.5. Detection using Lattice Reduction**

**T**−<sup>1</sup> also contains only complex integer entries.

steps, which are adapted from [3] and detailed in [21]:

4. Transform and bound points to constellation points

clipped back to the nearest constellation point.

2. Use the pseudoinverse of the reduced basis to form estimates

**H** ⇐ **H** *σ***I**  **H**, *σ*2, **y**  http://dx.doi.org/10.5772/51649

243

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

**<sup>H</sup>**<sup>L</sup> = **HT** or **<sup>H</sup>** = **<sup>H</sup>**L**T**<sup>−</sup>1, (21)

The solution to (16) is the well-known MMSE estimator, also known as the Wiener filter:

$$\mathbf{W}\_{\rm MMSE} = \left(\mathbf{H}^H \mathbf{H} + \sigma^2 \mathbf{I}\_{\mathbb{H}\_\Gamma}\right)^{-1} \mathbf{H}^H \tag{17}$$

$$\mathbf{H} = \begin{bmatrix} \mathbf{H} \\ \sigma \mathbf{I} \end{bmatrix}^{\mathsf{T}} \tag{18}$$

The shorthand notation of (18) was first proposed in [18] and is referred to as the *extended channel matrix*, which in this chapter is denoted

$$
\overline{\mathbf{H}} = \begin{bmatrix} \mathbf{H} \\ \sigma \mathbf{I} \end{bmatrix}.\tag{19}
$$

Similarly to ZF detection, MMSE detection finds the estimate of the vector of transmitted symbols **x**MMSE as follows:

$$\hat{\mathbf{x}}\_{\text{MMSE}} = \underset{\mathbf{x} \in \mathcal{A}^{\mu\_{\text{T}}}}{\text{arg}\min} \left\| \mathbf{W}\_{\text{MMSE}} \mathbf{y} - \mathbf{x} \right\|^2 \tag{20}$$

where **x**MMSE is found by independently rounding each element of **<sup>x</sup>**˜ to the nearest constellation point. It is well-known that as the noise term approaches zero (at high signal-to-noise ratios), the MMSE estimator becomes equivalent to a ZF estimator.

Compared to ZF detection, MMSE results on average in less noise enhancement, as **H** is better conditioned. This can be seen intuitively as a result of adding a diagonal matrix relating to the noise variance as in (17) or alternatively due to the stacked structure of (18) resulting in a decrease in correlation. Unlike ZF, however, MMSE does not perfectly reverse or remove the interference of **H**, leading to interference between the otherwise independent transmit antennas. As with ZF, analytical performance bounds for MMSE detectors have been developed [17, 19] for various channel models.

Utilizing the shorthand notation of the extended channel matrix of (18), ZF detection can be readily extended to perform MMSE detection, as shown in Algorithm 5. Note that due to the extra rows of **H** as compared to **H**, the computational complexity of calculating **W**MMSE is roughly double that of **W**ZF.

**Algorithm 5** MMSE Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ MMSEdetect **H**, *σ*2, **y** 

**H** ⇐ **H** *σ***I** [**<sup>x</sup>**MMSE, **<sup>b</sup>**MMSE] ⇐ ZFdetect **H**, *σ*2, **y** 

12 Design and Architectures for Digital Signal Processing

the criterion:

**4.4. Minimum Mean-Square Error (MMSE) estimation**

MMSE estimation acts to balance the reduction of the interference caused by **H** and the noise enhancement due to correlation of the columns in **H**. Rather than completely remove the effect of the MIMO channel, MMSE estimation works to find a coefficient which minimizes

**W**

**<sup>H</sup>***H***<sup>H</sup>** + *<sup>σ</sup>*2**I***n*<sup>T</sup>

The shorthand notation of (18) was first proposed in [18] and is referred to as the *extended*

Similarly to ZF detection, MMSE detection finds the estimate of the vector of transmitted

where **x**MMSE is found by independently rounding each element of **<sup>x</sup>**˜ to the nearest constellation point. It is well-known that as the noise term approaches zero (at high

Compared to ZF detection, MMSE results on average in less noise enhancement, as **H** is better conditioned. This can be seen intuitively as a result of adding a diagonal matrix relating to the noise variance as in (17) or alternatively due to the stacked structure of (18) resulting in a decrease in correlation. Unlike ZF, however, MMSE does not perfectly reverse or remove the interference of **H**, leading to interference between the otherwise independent transmit antennas. As with ZF, analytical performance bounds for MMSE detectors have

Utilizing the shorthand notation of the extended channel matrix of (18), ZF detection can be readily extended to perform MMSE detection, as shown in Algorithm 5. Note that due to the extra rows of **H** as compared to **H**, the computational complexity of calculating **W**MMSE

�**W**MMSE**<sup>y</sup>** − **<sup>x</sup>**�<sup>2</sup>

**H** = **H** *σ***I** 

**x**MMSE <sup>=</sup> arg min

**x**∈A*<sup>n</sup>*<sup>T</sup>

signal-to-noise ratios), the MMSE estimator becomes equivalent to a ZF estimator.

The solution to (16) is the well-known MMSE estimator, also known as the Wiener filter:

�**Wy** − **x**�<sup>2</sup>

<sup>−</sup><sup>1</sup>

. (16)

**H***<sup>H</sup>* (17)

. (19)

, (20)

(18)

**W**MMSE = arg min

= **H** *σ***I** †

**W**MMSE =

*channel matrix*, which in this chapter is denoted

been developed [17, 19] for various channel models.

symbols **x**MMSE as follows:

is roughly double that of **W**ZF.

#### **4.5. Detection using Lattice Reduction**

Lattice basis reduction [20, §2.6.1] reduces the orthogonality defect, thereby reducing noise enhancement. This is achieved by finding a closer to orthogonal set of basis vectors. This reduced lattice basis is found by optimizing the generating matrix, which in the present application is a MIMO channel matrix realization. This closer-to-orthogonal set is found using elementary operations on basis vectors. Complex integer linear combinations of the column vectors of **H** are taken to form the reduced matrix **H**<sup>L</sup> which spans the same set of points **HX***n*<sup>T</sup> ≡ **H**L**X***n*<sup>T</sup> and so

$$\mathbf{H}\_{\rm L} = \mathbf{H} \mathbf{T} \text{ or } \mathbf{H} = \mathbf{H}\_{\rm L} \mathbf{T}^{-1} \text{ \tag{21}$$

where **T** is a unimodular matrix with complex integer entries and det(**T**) = ±1, therefore **T**−<sup>1</sup> also contains only complex integer entries.

As in [3], by finding an equivalent and closer to orthogonal set of the basis vectors, **H**L, noise enhancement is reduced when quantization is performed. Importantly, as **T**−<sup>1</sup> and **x**¯ both contain only integer spaced entries, so does **T**−1**x**¯ and so symbol detection or quantization is merely rounding to the grid **X**.

Once the lattice reduced channel matrix is found, we then calculate the pseudoinverse as would be done in ZF or MMSE detection. LRAD therefore operates using the following steps, which are adapted from [3] and detailed in [21]:


As shown in Algorithm 6, received vectors **y** are multiplied with the pseudoinverse of the reduced basis **H**<sup>L</sup> to find a soft estimate of the vector of transmitted symbols in the reduced domain. These symbols are then quantized to an integer grid. (Depending on the transform generated, this integer grid may be offset by a half in both real and imaginary dimensions.) These hard estimates are then transformed, using the transform matrix **T** generated by the LR algorithm, to find an estimate of the vector of transmitted symbols. However, as these symbols may fall outside the range of constellation points invalid constellation points are clipped back to the nearest constellation point.

**Algorithm 6** LRAD Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ LRADdetect **H**, *σ*2, **y** [**H***L*, **T**] ⇐ LR(**H**) // LR is a lattice reduction algorithm such as Algorithm 1 *δ* ⇐ **T**[1 + *i*]/2 **x**˜ ⇐ **H**¯ † *L***y <sup>x</sup>***LRZF* ⇐ **<sup>T</sup>** (round (**x**˜, *<sup>δ</sup>*)) // 4*n*<sup>T</sup> <sup>2</sup> M, 4*n*<sup>T</sup> <sup>2</sup> A **<sup>b</sup>***ZF* ⇐ demod (**<sup>x</sup>***LRZF*)

**Algorithm 7** SLRAD Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ SLRADdetect

**H***s*, **y***s*, *σ*<sup>2</sup>

*emin* ⇐ inf **for** *k* = 1 to *n*<sup>T</sup> **do**

**sk** <sup>=</sup> *<sup>s</sup>*

**H***<sup>s</sup>* ⇐ **H**[1...(*k*−1)(*k*+1)...*n*T] **for all** *s* in A*<sup>k</sup>* **do**

**<sup>x</sup>***<sup>s</sup>* ⇐ LRADdetect

*<sup>e</sup>* ⇐ �**<sup>y</sup>** <sup>−</sup> **<sup>H</sup><sup>x</sup>**�<sup>2</sup> **if** *e* < *emin* **then** *emin* = *e* **<sup>x</sup>***SLR* <sup>=</sup> **<sup>x</sup> end if end for end for**

**<sup>b</sup>***SLR* ⇐ demod (**<sup>x</sup>***SLR*)

10 −4

10 −3

> MAP LRAD SLRAD

10 −2

BER

10 −1

10 0

**<sup>x</sup>**[1...(*k*−1)(*k*+1)...*n*T] <sup>=</sup> **<sup>x</sup>***<sup>s</sup>*

**H**, *σ*2, **y** 

http://dx.doi.org/10.5772/51649

245

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

**y***<sup>s</sup>* ⇐ **y** − **h***ks* // 4*n*<sup>R</sup> M, 4*n*<sup>R</sup> A

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Eb /N0 (dB)

**Figure 5.** Bit error rate (BER) Performance of ML, LRAD and SLRAD for 4 × 4 MIMO with 16-QAM

#### **5. Subspace-based LRAD**

#### **5.1. Hard-output SLRAD**

For hard estimation, quantization of the ZF or MMSE estimate in the transmit constellation domain is replaced by the same quantization in the lattice reduced domain. The equivalent for soft estimation calls for the calculation of the error induced by quantization in the lattice reduced domain. Unfortunately, just as it is hard to ensure quantization to valid symbols in the lattice reduced domain, it is equally hard to iterate over all possible valid symbols in the lattice reduced domain in order to estimate each bit probability.

Whilst Zhang et al. [22] present a detailed comparison of various soft output based detectors and proposes several powerful methods for generating soft output information, there are some key shortcomings, and the performance of the detectors in [22] are only evaluated using QPSK constellations. This is problematic in that a range of wireless communication standards are moving to denser constellations, such as 16-QAM and 64-QAM. This motivates the investigation of lattice reduction based detectors capable of producing *candidate lists*.

The subspace lattice reduction aided detection (SLRAD) approach of Windpassinger [3] forms a subspace of the channel matrix **H** by removing a single column from the channel matrix. This column removal allows the corresponding transmit antenna's symbol estimate to be constrained in order to calculate an estimate for what the other transmit antennae sent. For each transmit antenna a number of symbols is systematically proposed and for each proposal the set of most likely symbols transmitted on the other antennae is calculated, as shown in Algorithm 7.

The SLRAD algorithm therefore creates a list of candidate symbols, the Euclidean distance of each of these candidates from the origin being used to determine the most likely vector of transmitted symbols for a hard-output detector.

Whilst performance of SLRAD is close to that of ML (see Fig. 5), the complexity is proportional only to the sum of the size of the constellations employed on each transmit antenna. Therefore only a modest number of candidate symbols needs to be investigated, even for dense constellations. For example, a system with 4 transmit antennas each utilizing 64-QAM results in only 4 × 64 = 256 candidates.

#### **5.2. Soft-output SLRAD**

As a candidate-based detector, the hard-output SLRAD detector can be extended to generate soft output information. The probability of all the candidates where a bit is one is divided by the probability of all candidates where the bit is zero. An attractive property of subspace **Algorithm 7** SLRAD Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ SLRADdetect **H**, *σ*2, **y** 

*emin* ⇐ inf **for** *k* = 1 to *n*<sup>T</sup> **do H***<sup>s</sup>* ⇐ **H**[1...(*k*−1)(*k*+1)...*n*T] **for all** *s* in A*<sup>k</sup>* **do y***<sup>s</sup>* ⇐ **y** − **h***ks* // 4*n*<sup>R</sup> M, 4*n*<sup>R</sup> A **<sup>x</sup>***<sup>s</sup>* ⇐ LRADdetect **H***s*, **y***s*, *σ*<sup>2</sup> **<sup>x</sup>**[1...(*k*−1)(*k*+1)...*n*T] <sup>=</sup> **<sup>x</sup>***<sup>s</sup>* **sk** <sup>=</sup> *<sup>s</sup> <sup>e</sup>* ⇐ �**<sup>y</sup>** <sup>−</sup> **<sup>H</sup><sup>x</sup>**�<sup>2</sup> **if** *e* < *emin* **then** *emin* = *e* **<sup>x</sup>***SLR* <sup>=</sup> **<sup>x</sup> end if end for end for <sup>b</sup>***SLR* ⇐ demod (**<sup>x</sup>***SLR*)

14 Design and Architectures for Digital Signal Processing

*δ* ⇐ **T**[1 + *i*]/2 **x**˜ ⇐ **H**¯ † *L***y**

**<sup>b</sup>***ZF* ⇐ demod (**<sup>x</sup>***LRZF*)

**5. Subspace-based LRAD**

**5.1. Hard-output SLRAD**

shown in Algorithm 7.

**5.2. Soft-output SLRAD**

transmitted symbols for a hard-output detector.

64-QAM results in only 4 × 64 = 256 candidates.

**Algorithm 6** LRAD Algorithm - [**<sup>x</sup>**, **<sup>b</sup>**] ⇐ LRADdetect

lattice reduced domain in order to estimate each bit probability.

**H**, *σ*2, **y** 

<sup>2</sup> M, 4*n*<sup>T</sup>

<sup>2</sup> A

[**H***L*, **T**] ⇐ LR(**H**) // LR is a lattice reduction algorithm such as Algorithm 1

For hard estimation, quantization of the ZF or MMSE estimate in the transmit constellation domain is replaced by the same quantization in the lattice reduced domain. The equivalent for soft estimation calls for the calculation of the error induced by quantization in the lattice reduced domain. Unfortunately, just as it is hard to ensure quantization to valid symbols in the lattice reduced domain, it is equally hard to iterate over all possible valid symbols in the

Whilst Zhang et al. [22] present a detailed comparison of various soft output based detectors and proposes several powerful methods for generating soft output information, there are some key shortcomings, and the performance of the detectors in [22] are only evaluated using QPSK constellations. This is problematic in that a range of wireless communication standards are moving to denser constellations, such as 16-QAM and 64-QAM. This motivates the investigation of lattice reduction based detectors capable of producing *candidate lists*.

The subspace lattice reduction aided detection (SLRAD) approach of Windpassinger [3] forms a subspace of the channel matrix **H** by removing a single column from the channel matrix. This column removal allows the corresponding transmit antenna's symbol estimate to be constrained in order to calculate an estimate for what the other transmit antennae sent. For each transmit antenna a number of symbols is systematically proposed and for each proposal the set of most likely symbols transmitted on the other antennae is calculated, as

The SLRAD algorithm therefore creates a list of candidate symbols, the Euclidean distance of each of these candidates from the origin being used to determine the most likely vector of

Whilst performance of SLRAD is close to that of ML (see Fig. 5), the complexity is proportional only to the sum of the size of the constellations employed on each transmit antenna. Therefore only a modest number of candidate symbols needs to be investigated, even for dense constellations. For example, a system with 4 transmit antennas each utilizing

As a candidate-based detector, the hard-output SLRAD detector can be extended to generate soft output information. The probability of all the candidates where a bit is one is divided by the probability of all candidates where the bit is zero. An attractive property of subspace

**<sup>x</sup>***LRZF* ⇐ **<sup>T</sup>** (round (**x**˜, *<sup>δ</sup>*)) // 4*n*<sup>T</sup>

**Figure 5.** Bit error rate (BER) Performance of ML, LRAD and SLRAD for 4 × 4 MIMO with 16-QAM

detectors is that every bit is guaranteed to have at least one candidate where the bit is a one and likewise a candidate where it is zero. Without this property, it is not possible to accurately form an estimate for the ratio of the bit's value probabilities.

**Algorithm 8** Soft Output SLRAD Algorithm - [**L***e*] ⇐ SLRADdetect-soft **H**, *σ*2, **y** 

```
nbit ⇐ 0
dbit ⇐ 0
for k = 1 to nT do
  Hs ⇐ H[1...(k−1)(k+1)...nT]
  for all s in Ak do
     ys ⇐ y − hks
     xs ⇐ LRADdetect 
                          Hs, ys, σ2
     x[1...(k−1)(k+1)...nT] = xs
     sk = s
     b ⇐ demod (x)
     e ⇐ exp 
                −�y−Hx�2
                    σ2

     for all bits in current bit vector do
        if the current bit is a '1' then
          nbit = nbit + e
        else
          dbit = dbit + e
        end if
     end for
  end for
end for
Le ⇐ log [n] − log [d]
```
The soft-output SLRAD algorithm is shown in Algorithm 8. This algorithm leads in a natural fashion to the top-level data flow diagram in Fig. 6. The candidate chain block in Fig. 7 performs the following key steps (once for each submatrix of **H** formed by deleting one column from **H**):

Candidate <sup>W</sup> Chain

W w<sup>i</sup> q<sup>o</sup> T <sup>L</sup>

Channel Processing

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

bˆj

P b = bˆ<sup>j</sup> 

Data Processing

<sup>s</sup><sup>i</sup> <sup>−</sup> <sup>a</sup>jw<sup>i</sup> <sup>b</sup>˜xL<sup>e</sup> Tˆx<sup>L</sup> Demod

s<sup>i</sup> bˆ<sup>j</sup>

q<sup>o</sup> T

[ˆx]

ky − H [ˆx]k 2

σ<sup>2</sup>

P b = bˆ<sup>j</sup> 

LLR

L<sup>e</sup>

http://dx.doi.org/10.5772/51649

247

L <sup>s</sup> y

wi

aj

**Figure 7.** Candidate Chain Data Flow Diagram

y

**Figure 6.** Top Level Data Flow Diagram

s

QR & <sup>H</sup> LR


#### **6. Hardware implementation**

#### **6.1. Existing work**

The first published VLSI implementation of a lattice reduction aided detector [23] is based on Brun's algorithm for finding integer relations [24]. Brun's algorithm offers lower complexity at a performance cost when compared to the commonly utilized complex LLL algorithm.

Data Processing

**Figure 6.** Top Level Data Flow Diagram

16 Design and Architectures for Digital Signal Processing

**n***bit* ⇐ 0 **d***bit* ⇐ 0

**for** *k* = 1 to *n*<sup>T</sup> **do**

**sk** <sup>=</sup> *<sup>s</sup>*

*e* ⇐ exp

**else**

**end if end for end for end for**

**L***<sup>e</sup>* ⇐ log [**n**] − log [**d**]

column from **H**):

**6.1. Existing work**

**H***<sup>s</sup>* ⇐ **H**[1...(*k*−1)(*k*+1)...*n*T] **for all** *s* in A*<sup>k</sup>* **do y***<sup>s</sup>* ⇐ **y** − **h***ks* **<sup>x</sup>***<sup>s</sup>* ⇐ LRADdetect

**<sup>x</sup>**[1...(*k*−1)(*k*+1)...*n*T] <sup>=</sup> **<sup>x</sup>***<sup>s</sup>*

**n***bit* = **n***bit* + *e*

**d***bit* = **d***bit* + *e*

1. subspace candidate estimate generation; 2. lattice reduced domain quantization; 3. reversal of the lattice basis transform;

**6. Hardware implementation**

4. bounding to ensure valid constellation symbols; and 5. demodulation and Euclidean distance calculation.

−�**y**−**<sup>H</sup><sup>x</sup>**�<sup>2</sup> *σ*2

**for all** bits in current bit vector **do if** the current bit is a '1' **then**

**<sup>b</sup>** ⇐ demod (**<sup>x</sup>**)

detectors is that every bit is guaranteed to have at least one candidate where the bit is a one and likewise a candidate where it is zero. Without this property, it is not possible to

The soft-output SLRAD algorithm is shown in Algorithm 8. This algorithm leads in a natural fashion to the top-level data flow diagram in Fig. 6. The candidate chain block in Fig. 7 performs the following key steps (once for each submatrix of **H** formed by deleting one

The first published VLSI implementation of a lattice reduction aided detector [23] is based on Brun's algorithm for finding integer relations [24]. Brun's algorithm offers lower complexity at a performance cost when compared to the commonly utilized complex LLL algorithm.

**H**, *σ*2, **y** 

accurately form an estimate for the ratio of the bit's value probabilities.

**Algorithm 8** Soft Output SLRAD Algorithm - [**L***e*] ⇐ SLRADdetect-soft

**H***s*, **y***s*, *σ*<sup>2</sup>

**Figure 7.** Candidate Chain Data Flow Diagram

Brun's algorithm is criticised in [25] as it achieves inferior performance and no analytical result has been reported to prove the level of diversity that can be achieved. This work applies a uniform scaling factor to the elements of the same matrix or vector to ensure that the magnitudes of the largest real and imaginary parts are as close as possible to, but smaller than one. This pragmatic approach offers a good compromise between true floating-point arithmetic, with its computational overhead, and a simple fixed point arithmetic with significantly reduced dynamic range. However, it appears that no active scaling is performed in the algorithm to prevent numeric overflow. Instead, it is claimed without substantiation that a bound exists which is used to calculate the required number of integer bits.

**6.2. Architecture for Subspace Lattice Reduction Aided Detection**

step is used is used in processing the data spanning multiple data frames.

vector of transmitted symbols **x**.

**6.3. Channel processing**

*6.3.2. Channel MAC (CMAC) Unit*

th row with the *i*

required for division through the use of Newton–Raphson iterations.

*6.3.1. Givens Control Unit*

**P***j*,*<sup>i</sup>* by rotating the *j*

bits.

Our proposed architecture implements a soft-output lattice reduction-aided detector based on the subspace LRAD (SLRAD) approach of Windpassinger [3, 4]. The top-level schematic layout is shown in Fig. 8. A key feature of the detector is the separation of channel and data processing sections, shown above and below the dashed line in Fig. 8, respectively. Channel processing is computationally expensive, and includes the decomposition and lattice reduction of the MIMO channel matrix **H** . The separation of channel and data processing therefore enables the receiver to exploit the typically slow variation in channel gains relative to the symbol rate, whereby the output of the computationally expensive channel processing

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

http://dx.doi.org/10.5772/51649

249

The channel processing section in Fig. 8 is fed with elements *h*in of the estimated MIMO channel matrix **H** generated by an external MIMO channel estimator (not shown), while channel multiply and accumulator (CMAC) units perform rotations under control of the Givens control unit. Data processing involves the subspace-based detection of incoming received values, in addition to the calculation of soft outputs in the form of log-likelihood ratio (LLR) values. The data processing section is fed elements of the received vector **y**, scaled by automatic gain control (AGC) to ensure that analog-to-digital converters (ADCs) are not saturated, and therefore that fixed-point inputs are within a defined range. The data multiply and accumulate (DMAC) and detection (DET) blocks in Fig. 8 are described in Section 6.4. The outputs of the data processing section are LLR values for the bits corresponding to each

Unlike [26, 27], this work implements the Scaled and Decoupled QR (SDQR) Decomposition [35]. The use of the SDQR provides a definitive bound on the required integer precision and allows the number of fractional bits to be varied with a constant and small number of integer

The calculation of the SDQR rotation values is performed by a Givens Control unit. This unit is a single cycle processor which generates the Givens rotation **G** which zeros the element

of a throughput of one rotation variable per cycle by calculating a Givens rotation every four cycles. Two rotation variables are emitted in the third cycle (values **G**1,1 and **G**1,2) and fourth cycle (values **G**2,1 and **G**2,2) of each Givens rotation calculation. The Givens Control Unit also maintains the decoupled *k* values and also dynamically scales **G** to maintain scaling of not only *k* but indirectly **P** and the rotated **y**. This processor implements the reciprocal function

The application of rotation operations are performed by processor units referred to as Channel Multiply and Accumulator (CMAC) units. Each CMAC unit includes sufficient register space to store a full column of the MIMO channel **H** as well as necessary intermediate values. All input, output and stored register values are complex numbers specified using

th row of **P** and **Φ**. The Givens Control unit is capable

The work in [26] implements a sorted QR decomposition using Householder CORDIC units to reduce the number of LLL iterations needed. The complex LLL algorithm is used but, as with most LLL implementations, requires the use of divisions, using the Newton–Raphson algorithm, throughout the LLL iterations.

The work of [27] builds on [26] and discusses novel search based extensions to LRAD introduced in [28] which generate a candidate list and therefore soft outputs. However, the hardware implementation does not discuss this and therefore it is presumed that the hardware implementation is hard output. Due to the time-multiplexed complex multiplier pipeline, this approach is forced to rely on the use of priority inversion to prevent deadlocks due to data dependencies. Analysis is not performed on the precision required and in particular magnitude bounding is not performed which results in a large number of integer bits being required.

In [29], the authors build on their prior work [26, 27] by offering several improvements. This revision implements Sorted QRD to reduce the number of LLL swapping steps. Once again, the hardware implementation is presumed to only offer hard outputs as no mention is made of the candidate generation required to form soft outputs nor the hardware required to calculate LLRs. Unlike the prior works, an upper bound of 4 integer bits is identified for the elements of the *R* matrix which offers a significant reduction in the precision required.

Several works [30, 31] make use of systolic arrays in their implementation. This requires careful scheduling to maximize component utilization. The former work makes use of the Complex LLL algorithm whereas the later extends the LLL algorithm through the use of the Siegel condition to avoid the requirement for division operations.

The field-programmable gate array (FPGA) implementation of [32] implements the Clarkson's algorithm variant of LLL [33]. However this implementation only considers slower off-the-shelf FPGA components, including the use of square root and division operations that have not been optimized. The FPGA and application-specific integrated circuit (ASIC) implementation [34] claims to achieve a "fivefold improvement in terms of throughput at the cost of only slightly more FPGA resources" over [26] and [32]. This work uses CORDIC units along with a modification of the LLL algorithm by replacing the size-reduction criterion with the reverse Siegel condition. The hard output performance of this implementation is also enhanced by the use of soft interference cancellation (SIC), which requires the use of the sorted QR decomposition.

#### **6.2. Architecture for Subspace Lattice Reduction Aided Detection**

Our proposed architecture implements a soft-output lattice reduction-aided detector based on the subspace LRAD (SLRAD) approach of Windpassinger [3, 4]. The top-level schematic layout is shown in Fig. 8. A key feature of the detector is the separation of channel and data processing sections, shown above and below the dashed line in Fig. 8, respectively. Channel processing is computationally expensive, and includes the decomposition and lattice reduction of the MIMO channel matrix **H** . The separation of channel and data processing therefore enables the receiver to exploit the typically slow variation in channel gains relative to the symbol rate, whereby the output of the computationally expensive channel processing step is used is used in processing the data spanning multiple data frames.

The channel processing section in Fig. 8 is fed with elements *h*in of the estimated MIMO channel matrix **H** generated by an external MIMO channel estimator (not shown), while channel multiply and accumulator (CMAC) units perform rotations under control of the Givens control unit. Data processing involves the subspace-based detection of incoming received values, in addition to the calculation of soft outputs in the form of log-likelihood ratio (LLR) values. The data processing section is fed elements of the received vector **y**, scaled by automatic gain control (AGC) to ensure that analog-to-digital converters (ADCs) are not saturated, and therefore that fixed-point inputs are within a defined range. The data multiply and accumulate (DMAC) and detection (DET) blocks in Fig. 8 are described in Section 6.4. The outputs of the data processing section are LLR values for the bits corresponding to each vector of transmitted symbols **x**.

Unlike [26, 27], this work implements the Scaled and Decoupled QR (SDQR) Decomposition [35]. The use of the SDQR provides a definitive bound on the required integer precision and allows the number of fractional bits to be varied with a constant and small number of integer bits.

#### **6.3. Channel processing**

#### *6.3.1. Givens Control Unit*

18 Design and Architectures for Digital Signal Processing

algorithm, throughout the LLL iterations.

bits being required.

Brun's algorithm is criticised in [25] as it achieves inferior performance and no analytical result has been reported to prove the level of diversity that can be achieved. This work applies a uniform scaling factor to the elements of the same matrix or vector to ensure that the magnitudes of the largest real and imaginary parts are as close as possible to, but smaller than one. This pragmatic approach offers a good compromise between true floating-point arithmetic, with its computational overhead, and a simple fixed point arithmetic with significantly reduced dynamic range. However, it appears that no active scaling is performed in the algorithm to prevent numeric overflow. Instead, it is claimed without substantiation

The work in [26] implements a sorted QR decomposition using Householder CORDIC units to reduce the number of LLL iterations needed. The complex LLL algorithm is used but, as with most LLL implementations, requires the use of divisions, using the Newton–Raphson

The work of [27] builds on [26] and discusses novel search based extensions to LRAD introduced in [28] which generate a candidate list and therefore soft outputs. However, the hardware implementation does not discuss this and therefore it is presumed that the hardware implementation is hard output. Due to the time-multiplexed complex multiplier pipeline, this approach is forced to rely on the use of priority inversion to prevent deadlocks due to data dependencies. Analysis is not performed on the precision required and in particular magnitude bounding is not performed which results in a large number of integer

In [29], the authors build on their prior work [26, 27] by offering several improvements. This revision implements Sorted QRD to reduce the number of LLL swapping steps. Once again, the hardware implementation is presumed to only offer hard outputs as no mention is made of the candidate generation required to form soft outputs nor the hardware required to calculate LLRs. Unlike the prior works, an upper bound of 4 integer bits is identified for the elements of the *R* matrix which offers a significant reduction in the precision required.

Several works [30, 31] make use of systolic arrays in their implementation. This requires careful scheduling to maximize component utilization. The former work makes use of the Complex LLL algorithm whereas the later extends the LLL algorithm through the use of the

The field-programmable gate array (FPGA) implementation of [32] implements the Clarkson's algorithm variant of LLL [33]. However this implementation only considers slower off-the-shelf FPGA components, including the use of square root and division operations that have not been optimized. The FPGA and application-specific integrated circuit (ASIC) implementation [34] claims to achieve a "fivefold improvement in terms of throughput at the cost of only slightly more FPGA resources" over [26] and [32]. This work uses CORDIC units along with a modification of the LLL algorithm by replacing the size-reduction criterion with the reverse Siegel condition. The hard output performance of this implementation is also enhanced by the use of soft interference cancellation (SIC), which

Siegel condition to avoid the requirement for division operations.

requires the use of the sorted QR decomposition.

that a bound exists which is used to calculate the required number of integer bits.

The calculation of the SDQR rotation values is performed by a Givens Control unit. This unit is a single cycle processor which generates the Givens rotation **G** which zeros the element **P***j*,*<sup>i</sup>* by rotating the *j* th row with the *i* th row of **P** and **Φ**. The Givens Control unit is capable of a throughput of one rotation variable per cycle by calculating a Givens rotation every four cycles. Two rotation variables are emitted in the third cycle (values **G**1,1 and **G**1,2) and fourth cycle (values **G**2,1 and **G**2,2) of each Givens rotation calculation. The Givens Control Unit also maintains the decoupled *k* values and also dynamically scales **G** to maintain scaling of not only *k* but indirectly **P** and the rotated **y**. This processor implements the reciprocal function required for division through the use of Newton–Raphson iterations.

#### *6.3.2. Channel MAC (CMAC) Unit*

The application of rotation operations are performed by processor units referred to as Channel Multiply and Accumulator (CMAC) units. Each CMAC unit includes sufficient register space to store a full column of the MIMO channel **H** as well as necessary intermediate values. All input, output and stored register values are complex numbers specified using

G

M1 Sel

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

M2 Sel

O

X

+

http://dx.doi.org/10.5772/51649

251

P WE

M3 Sel

DMAC units, a key design feature of the the proposed architecture.

P

0

Multiple DMAC units are implemented to achieve the necessary data throughput rate such that a single rotation operation can be applied to multiple received vectors in parallel. This builds on the presumption that the MIMO channel is approximately constant for multiple symbol periods. Given a sufficiently static MIMO channel, any number of DMAC units can be implemented. This allows a linear scaling of data throughput by simply adding more

As well as being loaded into CMAC units, when a new **H** is loaded into the processor, it is cached in the H&T register file. This is done to provide a copy of **H** for use when calculating the Euclidean distance of candidate estimates. The H&T register file is also used to store **T**, the lattice basis required to translate candidate estimates from the reduced basis prior to

Each DMAC unit feeds a symbol detection chain which performs candidate generation and finally bitwise log-likelihood accumulation. This implements the data flow detailed in Fig. 7.

Once a list of vectors of transmit symbol candidates has been generated, the probability of each of these vectors needs to be generated. Many approaches exist that avoid the need to implement the required log operations inherit in the calculation of log-likelihood ratio (LLR)

I

A

*6.4.2. H&T Register File*

*6.4.3. Candidate Detection (DET)*

*6.4.4. Log-likelihood ratio (LLR) Accumulator*

demodulation.

**Figure 9.** Custom Multiply and Accumulate Schematic Layout

A WE

**Figure 8.** Top-level schematic layout. Channel and data processing sections are shown above and below the dashed line, respectively

custom extensions to the VHDL fixed point math package. Arithmetic implemented includes a complex multiplier and complex addition unit with the output of the multiplier being one of the operands of the adder, as shown in Fig. 9.

Whilst this architecture greatly simplifies the challenge of processor unit scheduling, the units are still unavoidably under-utilized. The CMAC units become unused once the their corresponding column of **H** is fully zeroed. As a result, the CMAC unit corresponding to the *i* th column of **H** is in use for *i*/*n*<sup>T</sup> of the SDQR execution period.

The CMAC units provide outputs which feed a multiplexer, as shown in Fig. 8. Required in order to perform back substitution, this allows the transfer of register values between CMAC units by feeding the output of a unit to the input of another.

#### **6.4. Data processing**

#### *6.4.1. Data MAC (DMAC) Unit*

Processor units referred to as Data Multiply and Accumulator (DMAC) units, are implemented to apply Givens rotation operations on the received vector **y**. Each DMAC unit includes sufficient register space to store a full vector of received values **y** as well as necessary intermediate values.

Multiple DMAC units are implemented so that the necessary rotations required to apply a Givens rotation to a full row of **H** can be performed in parallel. This avoids the need to stall not only the Givens Control unit but also stalling of DMAC units that would otherwise need to occur whilst each row element of **H** is rotated.

<sup>250</sup> Design and Architectures for Digital Signal Processing A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection 21 A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection http://dx.doi.org/10.5772/51649 251

**Figure 9.** Custom Multiply and Accumulate Schematic Layout

Multiple DMAC units are implemented to achieve the necessary data throughput rate such that a single rotation operation can be applied to multiple received vectors in parallel. This builds on the presumption that the MIMO channel is approximately constant for multiple symbol periods. Given a sufficiently static MIMO channel, any number of DMAC units can be implemented. This allows a linear scaling of data throughput by simply adding more DMAC units, a key design feature of the the proposed architecture.

#### *6.4.2. H&T Register File*

20 Design and Architectures for Digital Signal Processing

cmac\_src

y\_in

respectively

*i*

**6.4. Data processing**

*6.4.1. Data MAC (DMAC) Unit*

necessary intermediate values.

to occur whilst each row element of **H** is rotated.

cmac\_in h\_in

H & T Register

of the operands of the adder, as shown in Fig. 9.

th column of **H** is in use for *i*/*n*<sup>T</sup> of the SDQR execution period.

units by feeding the output of a unit to the input of another.

Givens Control cmac\_c

dmac\_g

CMAC CMAC CMAC CMAC

c\_src

cmac\_out

DET DET DET ..... DET

DMAC DMAC DMAC ..... DMAC

LLR LLR LLR ..... LLR

**Figure 8.** Top-level schematic layout. Channel and data processing sections are shown above and below the dashed line,

custom extensions to the VHDL fixed point math package. Arithmetic implemented includes a complex multiplier and complex addition unit with the output of the multiplier being one

Whilst this architecture greatly simplifies the challenge of processor unit scheduling, the units are still unavoidably under-utilized. The CMAC units become unused once the their corresponding column of **H** is fully zeroed. As a result, the CMAC unit corresponding to the

The CMAC units provide outputs which feed a multiplexer, as shown in Fig. 8. Required in order to perform back substitution, this allows the transfer of register values between CMAC

Processor units referred to as Data Multiply and Accumulator (DMAC) units, are implemented to apply Givens rotation operations on the received vector **y**. Each DMAC unit includes sufficient register space to store a full vector of received values **y** as well as

Multiple DMAC units are implemented so that the necessary rotations required to apply a Givens rotation to a full row of **H** can be performed in parallel. This avoids the need to stall not only the Givens Control unit but also stalling of DMAC units that would otherwise need As well as being loaded into CMAC units, when a new **H** is loaded into the processor, it is cached in the H&T register file. This is done to provide a copy of **H** for use when calculating the Euclidean distance of candidate estimates. The H&T register file is also used to store **T**, the lattice basis required to translate candidate estimates from the reduced basis prior to demodulation.

#### *6.4.3. Candidate Detection (DET)*

Each DMAC unit feeds a symbol detection chain which performs candidate generation and finally bitwise log-likelihood accumulation. This implements the data flow detailed in Fig. 7.

#### *6.4.4. Log-likelihood ratio (LLR) Accumulator*

Once a list of vectors of transmit symbol candidates has been generated, the probability of each of these vectors needs to be generated. Many approaches exist that avoid the need to implement the required log operations inherit in the calculation of log-likelihood ratio (LLR) values. We implement the shifting method Log-MAP algorithm presented in [36], which utilizes the following piecewise linear approximation:

$$f(\mathbf{x}) = \begin{cases} 0.70 - \mathbf{x}/2 & 0.00 \le \mathbf{x} < 0.51 \\ 0.57 - \mathbf{x}/4 & 0.51 \le \mathbf{x} < 1.44 \\ 0.39 - \mathbf{x}/8 & 1.44 \le \mathbf{x} < 2.88 \\ 0.03 & 2.88 \le \mathbf{x} < 4.00 \\ 0.00 & 4.00 \le \mathbf{x} \end{cases} \tag{22}$$

the third C3 updates *ki* and calculates **G**2,1 and **G**2,2; and the fourth C4 updates **P***i*,*<sup>i</sup>* and calculate **G**1,1 and **G**1,2. The control unit also includes an operation CR which performs the reciprocation the value *kj* as needed. This supports back substitution as well as part of the LLL algorithm and is implemented using the Newton–Raphson algorithm. Finally, the control unit also performs other operations to marshal data between channel processing and

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

http://dx.doi.org/10.5772/51649

253

For the CMAC and DMAC units, the micro-operations, and their corresponding complex machine instructions are detailed in Table 1. In this table, the first column lists the micro-operation code, the third column lists the value provided on the input *I* of the CMAC

Data Load

Givens Rotation for SDQR and Basis Vector Swap Update

G3 Multiply by **<sup>G</sup>**1,2 - *<sup>A</sup>* = *<sup>G</sup>* × **<sup>P</sup>***<sup>j</sup>* + 0; **<sup>P</sup>***<sup>j</sup>* = *A A* = *<sup>G</sup>* × **<sup>Φ</sup>***<sup>j</sup>* + 0; **<sup>Φ</sup>***<sup>j</sup>* = *<sup>A</sup>*

Back Substitution

Lattice Size Reduction

The CMAC and DMAC units implement the micro-operations LC and LD to perform data load operations that load the channel matrix and received vector; G1 to G4 which implement the Givens rotations for not only the SDQR but also the zeroing step of LLL algorithm; B1 and B2 that performs back substitution operations; and R1 to R3 which perform the LLL

The DMAC units require more or less the same operations as their CMAC counterparts, however, for the case of back substitution and lattice size reduction the implementation differs. For back substitution, the CMAC units must pass off-diagonal elements **P***j*,*<sup>i</sup>* to the DMAC units. For lattice size reduction, the CMAC units add an integer multiple of one column of **P** to another by iteratively executing R3 which passes elements between CMAC units where the target unit performs the multiplication and addition. On the

**<sup>P</sup>***x*,*<sup>l</sup>* **<sup>P</sup>***x*,*<sup>j</sup>* = *<sup>G</sup>* × *<sup>I</sup>* + **<sup>P</sup>***x*,*<sup>j</sup>* -

th row - - **<sup>Φ</sup>***<sup>j</sup>* = <sup>1</sup> × **<sup>Φ</sup>***<sup>j</sup>* + *<sup>A</sup>*

**<sup>P</sup>***l*,*<sup>j</sup> <sup>O</sup>* = **<sup>P</sup>***l*,*<sup>j</sup> <sup>A</sup>* = *<sup>G</sup>* × **<sup>Φ</sup>***<sup>j</sup>* + <sup>0</sup>

G1 Multiply by **<sup>G</sup>**2,1 - *<sup>A</sup>* = *<sup>G</sup>* × **<sup>P</sup>***<sup>i</sup>* + <sup>0</sup> *<sup>A</sup>* = *<sup>G</sup>* × **<sup>Φ</sup>***<sup>i</sup>* + <sup>0</sup> G2 Multiply by **<sup>G</sup>**2,2 and add - *<sup>A</sup>* = *<sup>G</sup>* × **<sup>P</sup>***<sup>j</sup>* + *A A* = *<sup>G</sup>* × **<sup>Φ</sup>***<sup>j</sup>* + *<sup>A</sup>*

G4 Multiply by **<sup>G</sup>**2,2 and add - **<sup>P</sup>***<sup>i</sup>* = *<sup>G</sup>* × **<sup>P</sup>***<sup>i</sup>* + *<sup>A</sup>* **<sup>Φ</sup>***<sup>i</sup>* = *<sup>G</sup>* × **<sup>Φ</sup>***<sup>i</sup>* + *<sup>A</sup>*

B1 Multiply **<sup>Φ</sup>** row by −**P***j*,*<sup>i</sup>* −**P***j*,*<sup>i</sup>* - *<sup>A</sup>* = *<sup>G</sup>* × **<sup>Φ</sup>***<sup>i</sup>* + <sup>0</sup>

R2 Get reduced row of **<sup>Φ</sup>** - - **<sup>Φ</sup>***<sup>l</sup>* = <sup>1</sup> × **<sup>Φ</sup>***<sup>l</sup>* + *<sup>A</sup>*

and DMAC units and the final two columns detail the implemented function. Description I CMAC DMAC

LC Load elements of **H H***m*,*<sup>n</sup>* **P***m*,*<sup>n</sup>* = *I* - LD Load elements of **y y***<sup>x</sup>* - **Φ***<sup>x</sup>* = *I*

data processing.

B2 Accumulate with *j*

**P***l*,*j* 

> **P***l*,*j*

R1 *<sup>G</sup>* = round

R3 *<sup>G</sup>* = −round

column swap step.

**Table 1.** CMAC and DMAC Instruction Set

*6.5.2. CMAC and DMAC Micro-operations*

The schematic for the LLR block is shown in Fig. 10.

**Figure 10.** Log-likelihood ratio (LLR) Marginalization Schematic Layout

#### **6.5. Processor instruction set**

The overall architecture is a microcode-based system with detailed low level micro-operations that combine to implement higher level complex machine instructions. Each component including the Givens Control Unit, CMACs, DMACs and Detection Chains have their own micro-operations. The benefit is the provision of a flexible architecture capable of implementing the SLRAD algorithm, but which is also able to switch to simpler LRAD or even ZF algorithms based on the prevailing channel conditions.

#### *6.5.1. Control Unit Micro-operations*

The bulk of the channel processing involves the execution of the four operations that generate the Givens rotation **G**. The first two, C1 and C2, calculate the new values for **P***j*,*<sup>i</sup>* and *kj*; the third C3 updates *ki* and calculates **G**2,1 and **G**2,2; and the fourth C4 updates **P***i*,*<sup>i</sup>* and calculate **G**1,1 and **G**1,2. The control unit also includes an operation CR which performs the reciprocation the value *kj* as needed. This supports back substitution as well as part of the LLL algorithm and is implemented using the Newton–Raphson algorithm. Finally, the control unit also performs other operations to marshal data between channel processing and data processing.

#### *6.5.2. CMAC and DMAC Micro-operations*

22 Design and Architectures for Digital Signal Processing

utilizes the following piecewise linear approximation:

The schematic for the LLR block is shown in Fig. 10.

N WE

**Figure 10.** Log-likelihood ratio (LLR) Marginalization Schematic Layout

even ZF algorithms based on the prevailing channel conditions.

bˆj

P � b = bˆ<sup>j</sup> �

**6.5. Processor instruction set**

*6.5.1. Control Unit Micro-operations*

*f*(*x*) =

 

D

WE

values. We implement the shifting method Log-MAP algorithm presented in [36], which

0.70 − *x*/2 0.00 ≤ *x* < 0.51 0.57 − *x*/4 0.51 ≤ *x* < 1.44 0.39 − *x*/8 1.44 ≤ *x* < 2.88 0.03 2.88 ≤ *x* < 4.00

(22)

0.00 4.00 ≤ *x*

0 1

The overall architecture is a microcode-based system with detailed low level micro-operations that combine to implement higher level complex machine instructions. Each component including the Givens Control Unit, CMACs, DMACs and Detection Chains have their own micro-operations. The benefit is the provision of a flexible architecture capable of implementing the SLRAD algorithm, but which is also able to switch to simpler LRAD or

The bulk of the channel processing involves the execution of the four operations that generate the Givens rotation **G**. The first two, C1 and C2, calculate the new values for **P***j*,*<sup>i</sup>* and *kj*;

MAX


L<sup>e</sup>

LUT


For the CMAC and DMAC units, the micro-operations, and their corresponding complex machine instructions are detailed in Table 1. In this table, the first column lists the micro-operation code, the third column lists the value provided on the input *I* of the CMAC and DMAC units and the final two columns detail the implemented function.


**Table 1.** CMAC and DMAC Instruction Set

The CMAC and DMAC units implement the micro-operations LC and LD to perform data load operations that load the channel matrix and received vector; G1 to G4 which implement the Givens rotations for not only the SDQR but also the zeroing step of LLL algorithm; B1 and B2 that performs back substitution operations; and R1 to R3 which perform the LLL column swap step.

The DMAC units require more or less the same operations as their CMAC counterparts, however, for the case of back substitution and lattice size reduction the implementation differs. For back substitution, the CMAC units must pass off-diagonal elements **P***j*,*<sup>i</sup>* to the DMAC units. For lattice size reduction, the CMAC units add an integer multiple of one column of **P** to another by iteratively executing R3 which passes elements between CMAC units where the target unit performs the multiplication and addition. On the other hand, DMAC units are able to perform the equivalent reduction operation in the two micro-operations R1 and R2 as lattice size reduction is performed in a row-wise fashion.

iterative receivers involving joint detection and decoding when error-control codes such as turbo codes and LDPC codes are employed [37], [22]. The proposed approach therefore trades off increased latency for improved BER performance and the ability to readily deal

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

http://dx.doi.org/10.5772/51649

255

In this chapter we have presented the first known digital signal processing implementation of a soft-output MIMO wireless communications receiver based on lattice reduction aided detection (LRAD). Further research is needed to provide the ASIC and FPGA synthesis results needed to facilitate a comprehensive comparison with prior works providing only

School of Electrical Engineering and Computer Science, University of Newcastle, Callaghan,

[1] S.J. Johnson. *Iterative Error Correction: Turbo, Low-Density Parity-Check and*

[2] D. Wübben, D. Seethaler, J. Jaldén, and G. Matz. Lattice reduction. *IEEE Signal Process.*

[3] C. Windpassinger, L.H.J. Lampe, and R. Fischer. From lattice-reduction-aided detection towards maximum-likelihood detection in MIMO systems. In *Int. Conf.*

[4] C. Windpassinger. *Detection and Precoding for Multiple Input Multiple Output Channels*.

[5] A.T. Murray and S.R. Weller. Performance and complexity of adaptive lattice reduction in fading channels. In *Proc. Australian Comms. Workshop (AusCTW'09)*, pages 17–22, Sydney, Australia, February 2009. DOI: 10.1109/AUSCTW.2009.4805593.

[6] W. Liu, K.Choi, and H.Liu. Computationally efficient lattice reduction for MIMO-OFDM systems. In *Proc. 6th IEEE Int. Conf. on Wireless and Mobile Computing, Networking and Communications (WiMob'10)*, pages 264–267, October 2010. DOI:

[7] G.H. Hardy, E.W. Wright, and J.H. Silverman. *An Introduction to the Theory of Numbers*.

[8] A.K. Lenstra, H.W. Lenstra, and L. Lovász. Factoring polynomials with rational

*Repeat-Accumulate Codes*. Cambridge University Press, 2009.

*Mag.*, 28(3):70–91, May 2011. DOI: 10.1109/MSP.2010.938758.

*on Wireless and Optical Commun. (WOC'03)*, July 2003.

PhD thesis, Universität Erlangen-Nürnberg, 2004.

10.1109/WIMOB.2010.5645056.

Oxford University Press, 6th edition, 2008.

coefficients. *Math. Ann.*, 261(4):515–534, 1982.

with dense constellations, e.g. 64-QAM.

Alan T. Murray and Steven R. Weller

**7. Conclusion**

hard outputs.

**Author details**

NSW 2308, Australia

**References**

#### **6.6. Comparisons with previously published work**

The results in this section represent the first known digital signal processing architecture for a soft-output lattice reduction aided MIMO detector. For this reason we are unable to provide a direct comparison of our architecture with previously published work. Nevertheless, it is still possible to compare our implementation with three state-of-the-art VLSI implementations of hard-output LRAD-based MIMO detectors [32], [26], [34].

For *n*<sup>T</sup> = *n*<sup>R</sup> = 4, the combination of the CMAC micro-operations leads to the system latency outlined in Table 2. This table assumes a MIMO system represented by an extended channel matrix, requiring the zeroing of 16 elements of **H**. The majority of these elements require 4 cycles with the exception being the final element of each column requiring a 5th cycle due the the extra cycle required to compute the Newton-Raphson based reciprocal. An overhead of 12 cycles exists to load data into the processor.

For the LLL algorithm, column swap operations require 5 cycles to perform the single Givens rotation. Size reduction requires at most 3 cycles per pass over the full matrix. As with prior works, a simple strategy is used to fix the number of iterations of the LLL algorithm which caps the number of swaps and size reduction passes to 3. This yields 24 cycles per subspace or 96 cycles for the four subspaces.


**Table 2.** Latency of Channel Processor

To provide context for the results in Table 2, we compare in Table 3 the latency of the proposed architecture with the latencies of three hard-output LRAD-based MIMO detectors for a 4-input, 4-output MIMO system employing QPSK modulation.



While the latency of the proposed architecture compares favourably with Barbero et al.'s solution [32], the significant performance penalty for generating soft outputs is apparent in comparison with the results of Gestner et al. [26] and (esp.) Bruderer et al. [34]. We caution that the results in Table 3 need to be interpreted carefully, however, since it is well known that hard-output MIMO detectors such as [32], [26] and [34] do not facilitate high-performance iterative receivers involving joint detection and decoding when error-control codes such as turbo codes and LDPC codes are employed [37], [22]. The proposed approach therefore trades off increased latency for improved BER performance and the ability to readily deal with dense constellations, e.g. 64-QAM.

#### **7. Conclusion**

24 Design and Architectures for Digital Signal Processing

**6.6. Comparisons with previously published work**

hard-output LRAD-based MIMO detectors [32], [26], [34].

of 12 cycles exists to load data into the processor.

or 96 cycles for the four subspaces.

**Table 2.** Latency of Channel Processor

other hand, DMAC units are able to perform the equivalent reduction operation in the two micro-operations R1 and R2 as lattice size reduction is performed in a row-wise fashion.

The results in this section represent the first known digital signal processing architecture for a soft-output lattice reduction aided MIMO detector. For this reason we are unable to provide a direct comparison of our architecture with previously published work. Nevertheless, it is still possible to compare our implementation with three state-of-the-art VLSI implementations of

For *n*<sup>T</sup> = *n*<sup>R</sup> = 4, the combination of the CMAC micro-operations leads to the system latency outlined in Table 2. This table assumes a MIMO system represented by an extended channel matrix, requiring the zeroing of 16 elements of **H**. The majority of these elements require 4 cycles with the exception being the final element of each column requiring a 5th cycle due the the extra cycle required to compute the Newton-Raphson based reciprocal. An overhead

For the LLL algorithm, column swap operations require 5 cycles to perform the single Givens rotation. Size reduction requires at most 3 cycles per pass over the full matrix. As with prior works, a simple strategy is used to fix the number of iterations of the LLL algorithm which caps the number of swaps and size reduction passes to 3. This yields 24 cycles per subspace

To provide context for the results in Table 2, we compare in Table 3 the latency of the proposed architecture with the latencies of three hard-output LRAD-based MIMO detectors

> average cycles per matrix 420 130 14 222 soft outputs? No No No Yes

While the latency of the proposed architecture compares favourably with Barbero et al.'s solution [32], the significant performance penalty for generating soft outputs is apparent in comparison with the results of Gestner et al. [26] and (esp.) Bruderer et al. [34]. We caution that the results in Table 3 need to be interpreted carefully, however, since it is well known that hard-output MIMO detectors such as [32], [26] and [34] do not facilitate high-performance

**Table 3.** Latency comparison between the proposed architecture and three state-of-the-art implementations

[32] [26] [34] this work

for a 4-input, 4-output MIMO system employing QPSK modulation.

**Component Latency** QR decomposition 80 cycles Subspace Generation 30 cycles Subspace Back-substitution 16 cycles Subspace Lattice Reduction 96 cycles Total for SLRAD 222 cycles In this chapter we have presented the first known digital signal processing implementation of a soft-output MIMO wireless communications receiver based on lattice reduction aided detection (LRAD). Further research is needed to provide the ASIC and FPGA synthesis results needed to facilitate a comprehensive comparison with prior works providing only hard outputs.

#### **Author details**

Alan T. Murray and Steven R. Weller

School of Electrical Engineering and Computer Science, University of Newcastle, Callaghan, NSW 2308, Australia

#### **References**


[9] P. Silvola, K. Hooli, and M. Juntti. Suboptimal soft-output MAP detector with lattice reduction. *IEEE Sig. Proc. Letters*, 13(6):321–324, June 2006. DOI: 10.1109/LSP.2006.871726.

[23] A. Burg, D. Seethaler, and G. Matz. VLSI implementation of a lattice-reduction algorithm for multi-antenna broadcast precoding. In *Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS'07)*, pages 673–676, New Orleans, LA, 27–30 May 2007. DOI:

A Digital Signal Processing Architecture for Soft-Output MIMO Lattice Reduction Aided Detection

http://dx.doi.org/10.5772/51649

257

[24] D. Seethaler and G. Matz. Efficient vector perturbation in multi-antenna multi-user systems based on approximate integer relations. In *Proc. European Signal Proc. Conf.*

[25] W. Zhang, X. Ma, B. Gestner, and D.V. Anderson. Designing low-complexity equalizers for wireless systems. *IEEE Comms. Mag.*, 47(1):56–62, January 2009. DOI:

[26] B. Gestner, W. Zhang, X. Mai, and D.V. Anderson. VLSI implementation of a lattice reduction algorithm for low-complexity equalization. In *IEEE Int. Conf. on Circuits and Systems for Communications (ICCSC'08)*, pages 643–647, Shanghai, China, 26–28 May

[27] W. Zhang. *Wireless Receiver Designs: From Information Theory to VLSI Implementation*.

[28] W. Zhang and X. Ma. Approaching optimal performance by lattice-reduction aided soft detectors. In *Proc. 41st Annual Conf. on Information Sciences and Systems (CISS '07)*,

[29] B. Gestner, W. Zhang, X. Ma, and D.V. Anderson. VLSI implementation of an effective lattice reduction algorithm with fixed-point considerations. In *Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2009)*, pages 577–580, 2009. DOI:

[30] J. Soler-Garrido, H. Vetter, M. Sandell, D. Milford, and A. Lillie. Implementation of a reduced-lattice MIMO detector for OFDM systems. In *Design, Automation & Test in*

[31] N.-C. Wang, E. Biglieri, and K. Yao. Systolic arrays for lattice-reduction-aided MIMO detection. *J. Commun. Netw.*, 13(5):481–493, October 2011. DOI:

[32] L.G. Barbero, D.L. Milliner, T. Ratnarajah, J.R. Barry, and C. Cowan. Rapid prototyping of Clarkson's lattice reduction for MIMO detection. In *Proc. IEEE Int. Conf. on Communications (ICC'09)*, pages 1–5, Dresden, Germany, 14–18 June 2009.

[33] I.V.L. Clarkson. *Approximation of Linear Forms by Lattice Points with Applications to Signal Processing*. PhD thesis, Australian National University, January 1997.

[34] L. Bruderer, C. Studer, M. Wenk, D. Seethaler, and A. Burg. VLSI implementation of a low-complexity LLL lattice reduction algorithm for MIMO detection. In *Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS'10)*, pages 3745–3748, Paris, France, 30

*Europe Conference & Exhibition (DATE '09)*, pages 1626–1631, 2009.

*(EUSIPCO'06)*, pages 4–8, Florence, Italy, 4–8 September 2006.

PhD thesis, Georgia Institute of Technology, December 2009.

pages 818–822, 2007. DOI: 10.1109/CISS.2007.4298422.

10.1109/ISCAS.2007.377898.

10.1109/MCOM.2009.4752677.

10.1109/ICASSP.2009.4959649.

10.1109/JCN.2011.6112305.

DOI: 10.1109/ICC.2009.5199388.

May–2 June 2010. DOI: 10.1109/ISCAS.2010.5537742.

2008. DOI: 10.1109/ICCSC.2008.142.


[23] A. Burg, D. Seethaler, and G. Matz. VLSI implementation of a lattice-reduction algorithm for multi-antenna broadcast precoding. In *Proc. IEEE Int. Symp. on Circuits and Systems (ISCAS'07)*, pages 673–676, New Orleans, LA, 27–30 May 2007. DOI: 10.1109/ISCAS.2007.377898.

26 Design and Architectures for Digital Signal Processing

10.1109/LSP.2006.871726.

10.1109/GLOCOM.2005.1578299.

*Information Processing*, August 2009.

10.1109/TWC.2006.1687712.

10.1109/TCOMM.2008.060372.

10.1109/ICASSP.2000.859065.

January 2010. DOI: 10.1109/TIT.2009.2034893.

10.1109/TCOMM.2010.080310.070641.

1873.

1993.

[9] P. Silvola, K. Hooli, and M. Juntti. Suboptimal soft-output MAP detector with lattice reduction. *IEEE Sig. Proc. Letters*, 13(6):321–324, June 2006. DOI:

[10] Y.H. Gan and W.H. Mow. Complex lattice reduction algorithms for low-complexity MIMO detection. In *Proc. IEEE Global Telecommunications Conf. (GLOBECOM '05)*, pages 2953–2957, St. Louis, MO, 28 November–2 December 2005. DOI:

[11] W.H. Mow. Universal lattice decoding: Principles and recent advances. *Wirel.*

[12] F.T. Luk and S. Qiao. Conditioning properties of the LLL algorithm. In M.S. Schmalz, G.X. Ritter, J. Barrera, J.T. Astola, and F.T. Luk, editors, *Mathematics for Signal and*

[13] A. Korkine and G. Zolotarev. Sur les formes quadratiques. *Math. Ann.*, 6:366–389,

[14] M. Seysen. Simultaneous reduction of a lattice basis and its reciprocal basis. *Combinatorica*, 13(3):363–376, September 1993. DOI: 10.1007/BF01202355.

[15] C. Windpassinger, L.H.J. Lampe, R. Fischer, and T. Hehn. A performance study of MIMO detectors. *IEEE Trans. Wireless Commun.*, 5(8):2004–2008, August 2006. DOI:

[16] X. Ma and W. Zhang. Performance analysis for MIMO systems with lattice-reduction aided linear equalization. *IEEE Trans. Commun.*, 56(2):309–318, February 2008. DOI:

[17] X. Li and Z. Nie. Performance losses in V-BLAST due to correlation. *IEEE Antennas Wireless Propag. Lett.*, 3(1):291–294, January 2004. DOI: 10.1109/LAWP.2004.838813.

[18] B. Hassibi. An efficient square-root algorithm for BLAST. In *Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP '00)*, volume 2, pages 737–740, 2000.

[19] M.R. McKay, I.B. Collings, and A.M. Tulino. Achievable sum rate of MIMO MMSE receivers: A general analytic framework. *IEEE Trans. Inform. Theory*, 56(1):396–410,

[20] H. Cohen. *A Course in Computational Algebraic Number Theory*. Springer, December

[21] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger. Closest point search in lattices. *IEEE Trans. Inform. Theory*, 48(8):2201–2214, August 2002. DOI: 10.1109/TIT.2002.800499.

[22] W. Zhang X. Ma. Low-complexity soft-output decoding with lattice-reduction-aided detectors. *IEEE Trans. Commun.*, 58(9):2621–2629, September 2010. DOI:

*Commun. Mob. Com.*, 3(5):553–569, August 2003. DOI: 10.1002/wcm.140.


[35] L.M. Davis. Scaled and decoupled Cholesky and QR decompositions with application to spherical MIMO detection. In *Proc. IEEE Wireless Communications and Networking (WCNC'2003)*, volume 1, pages 326–331, New Orleans, LA, 16–20 March 2003. DOI: 10.1109/WCNC.2003.1200369.

**Chapter 11**

**Progress of Doppler Ultrasound System Design and**

Evolution of electronic technology and semiconductor technology in recent years can realize a high-speed and high-quality signal-processing with low cost, low size, and low power consumption. Various signal-processing devices were born, and their performances are con‐ tinuing developing. This article introduces the technical innovations and the effects of digi‐ tal signal-processing in accordance with generations of the Doppler ultrasound system architecture. The diagnostic image of the carotid artery by a Doppler ultrasound system is shown in Fig. 1. The upside image is a tomogram called color flow mapping (CFM). A Dop‐ pler range gate is set up in the center of the blood vessel in the CFM. Bloodflow information on this position is displayed as the spectrum Doppler image in the downside. The horizontal axis is time, and the vertical axis is the flow velocity corresponding to Doppler shift frequen‐ cy, and it expresses the time-change of velocity distribution of the bloodflow. The embedded technology of CFM and spectrum Doppler began from the composition of analog signalprocessing and primitive logical operation elements, and resulted to accumulator devices, PAL, various memories, and changed to FPGA, CPLD, ASIC, DSP, and CPU/GPU [1].

Doppler signal-processing has developed selecting the most suitable realization method in all generations. Architecture of the 1980s is shown in Fig. 2. Analog signal-processing (dark-or‐ ange block in Fig. 2) occupied most in this architecture. Henceforth, this is called the 1st gener‐ ation architecture. Since only fast Fourier transform (FFT) was the digital signal-processing,

> © 2013 Tatsuro; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2013 Tatsuro; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

distribution, and reproduction in any medium, provided the original work is properly cited.

**Architecture**

Baba Tatsuro

**1. Introduction**

http://dx.doi.org/10.5772/51508

Additional information is available at the end of the chapter

**2. Progress of Doppler signal-processing architecture**

**2.1. The 1st generation architecture (Fixed-point processing)**


## **Progress of Doppler Ultrasound System Design and Architecture**

Baba Tatsuro

28 Design and Architectures for Digital Signal Processing

258 Design and Architectures for Digital Signal Processing

10.1109/WCNC.2003.1200369.

[35] L.M. Davis. Scaled and decoupled Cholesky and QR decompositions with application to spherical MIMO detection. In *Proc. IEEE Wireless Communications and Networking (WCNC'2003)*, volume 1, pages 326–331, New Orleans, LA, 16–20 March 2003. DOI:

[36] L. Zhong, M. Gang, T. Yi-Zheng, and C. Yan-Min. A simplification of the log-MAP algorithm for turbo decoding. In *Proc. IEEE Asia-Pacific Conference on Circuits and Systems*, volume 2, pages 1057–1060, 2004. DOI: 10.1109/APCCAS.2004.1413065.

[37] D.L. Milliner and J.R. Barry. A lattice-reduction-aided soft detector for multiple-input multiple-output channels. In *Proc. IEEE Global Telecommunications Conference*

*(GLOBECOM '06)*, 2006. DOI: 10.1109/GLOCOM.2006.84.

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/51508

#### **1. Introduction**

Evolution of electronic technology and semiconductor technology in recent years can realize a high-speed and high-quality signal-processing with low cost, low size, and low power consumption. Various signal-processing devices were born, and their performances are con‐ tinuing developing. This article introduces the technical innovations and the effects of digi‐ tal signal-processing in accordance with generations of the Doppler ultrasound system architecture. The diagnostic image of the carotid artery by a Doppler ultrasound system is shown in Fig. 1. The upside image is a tomogram called color flow mapping (CFM). A Dop‐ pler range gate is set up in the center of the blood vessel in the CFM. Bloodflow information on this position is displayed as the spectrum Doppler image in the downside. The horizontal axis is time, and the vertical axis is the flow velocity corresponding to Doppler shift frequen‐ cy, and it expresses the time-change of velocity distribution of the bloodflow. The embedded technology of CFM and spectrum Doppler began from the composition of analog signalprocessing and primitive logical operation elements, and resulted to accumulator devices, PAL, various memories, and changed to FPGA, CPLD, ASIC, DSP, and CPU/GPU [1].

#### **2. Progress of Doppler signal-processing architecture**

#### **2.1. The 1st generation architecture (Fixed-point processing)**

Doppler signal-processing has developed selecting the most suitable realization method in all generations. Architecture of the 1980s is shown in Fig. 2. Analog signal-processing (dark-or‐ ange block in Fig. 2) occupied most in this architecture. Henceforth, this is called the 1st gener‐ ation architecture. Since only fast Fourier transform (FFT) was the digital signal-processing,

analog-digital converter (ADC) was arranged before FFT. In those days, the conversion speed of ADC was hundreds kHz in 12-16 bits. Since a complex butterfly-operation was required, FFT processing was realized by accumulators (TRW: 1010J) in the first stage. After a while, a fixed point DSP (Toshiba: DSP-T9508) was used from the second half of the 1980s.

in Section 5) with low-cutoff, HPF and LPF were replaced. Furthermore, the LPF with high sampling frequency was divided into two subcomponents. The first stage LPF was realized by FPGAs (Altera: FPGA), and the next stage LPF and HPF were realized by floating point DSPs (NEC: μPD77240A), respectively. Furthermore, FFT processing was realized by an ASIC (Toshiba: ASIC). However, as for Doppler audio processing (direction separation of complex signal, etc.), the conventional analog-circuit was used in consideration of cost-per‐ formance. Therefore, the digital filter output was converted into analog signal again by digi‐

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

261

tal-analog convertoer (DAC), and was inputted into analog-circuit.

**Figure 3.** The 2nd generation architecture

**Figure 4.** The 3rd generation architecture

**2.3. The 3rd generation architecture (Dynamic-range expansion by ASIC)**

In the second half of the 1990s, in order to merge CFM and spectrum Doppler, the develop‐ ment which reduces the size and cost of these systems started. Henceforth, this is called the 3rd generation architecture. In order to compare generations, only a spectrum Doppler portion is

**Figure 1.** Diagnostic image of Doppler ultrasound system

**Figure 2.** The 1st generation architecture

#### **2.2. The 2nd generation architecture (Floating point DSP and ASIC)**

The development of full digital system started early in the 1990s. The early digital architec‐ ture is shown in Fig. 3. Henceforth, this is called the 2nd generation architecture. An analog low-pass filter (LPF) was arranged after an analog high-pass filter (HPF) in the conventional analog signal-processing. Since it was difficult to realize a high-speed digital HPF (wall filter in Section 5) with low-cutoff, HPF and LPF were replaced. Furthermore, the LPF with high sampling frequency was divided into two subcomponents. The first stage LPF was realized by FPGAs (Altera: FPGA), and the next stage LPF and HPF were realized by floating point DSPs (NEC: μPD77240A), respectively. Furthermore, FFT processing was realized by an ASIC (Toshiba: ASIC). However, as for Doppler audio processing (direction separation of complex signal, etc.), the conventional analog-circuit was used in consideration of cost-per‐ formance. Therefore, the digital filter output was converted into analog signal again by digi‐ tal-analog convertoer (DAC), and was inputted into analog-circuit.

**Figure 3.** The 2nd generation architecture

analog-digital converter (ADC) was arranged before FFT. In those days, the conversion speed of ADC was hundreds kHz in 12-16 bits. Since a complex butterfly-operation was required, FFT processing was realized by accumulators (TRW: 1010J) in the first stage. After a while, a

fixed point DSP (Toshiba: DSP-T9508) was used from the second half of the 1980s.

**Figure 1.** Diagnostic image of Doppler ultrasound system

260 Design and Architectures for Digital Signal Processing

**Figure 2.** The 1st generation architecture

**2.2. The 2nd generation architecture (Floating point DSP and ASIC)**

The development of full digital system started early in the 1990s. The early digital architec‐ ture is shown in Fig. 3. Henceforth, this is called the 2nd generation architecture. An analog low-pass filter (LPF) was arranged after an analog high-pass filter (HPF) in the conventional analog signal-processing. Since it was difficult to realize a high-speed digital HPF (wall filter

#### **2.3. The 3rd generation architecture (Dynamic-range expansion by ASIC)**

In the second half of the 1990s, in order to merge CFM and spectrum Doppler, the develop‐ ment which reduces the size and cost of these systems started. Henceforth, this is called the 3rd generation architecture. In order to compare generations, only a spectrum Doppler portion is shown in Fig. 4. In the 3rd generation, since the system clock went up sharply, the floating point device was hard to use. CFM and spectrum Doppler were unified, and they were realiz‐ ed by five kinds of ASIC (Toshiba: fixed point ASIC). In this architecture, I adopted the newly developed digital complex IIR filter for the direction separation processing without using ana‐ log phase-shifters with heavy manual adjustments [2, 3]. Furthermore, an oversampling filter and high-speed DAC (Analog Device: DAC) were used for the Doppler audio processing. This reduced the analog-circuit, such as high-order switched capacitor filters (SCF). The scale of these large-scale ASIC reached more than twice of typical CPU (Intel: Pentium processor) re‐ spectively, but the total cost of spectrum Doppler declined in 1/3. Furthermore we were able to get the wide dynamic-range signal-processing which was difficult in analog processing. As a result, the sensitivity of bloodflow detection had improved and the diagnostic targets also spread to abdomen, surface blood vessels, and limbs etc.

from now on. It will be also an important theme to investigate for what this power should be used. Now I am trying to apply this ability to automatic measurement and automatic di‐ agnosis as an intellectual signal-processing [4-6]. Moreover, as another possibility, mounting the Doppler signal-processing on a Windows program is also tried. But problems, such as stability or response time, remain. In order to realize the real-time system which completes

As introduced in Section 2, the spectrum Doppler signal-processing architecture had changed from analog to digital in the first half of the 1990s. In this section, comparison of analog technology and digital technology and the innovations which digital technology

As compared with analog processing, merits of digital processing are shown in Table 1. Dig‐ ital processing realized quality improvement (reduction of variations), performance im‐ provement (wide dynamic-range), and size reduction. Moreover, development efficiency was improved by the separation of analog power supply and digital power supply, and by the reduction of the noise in analog systems. Also the cost of digital-circuit had been im‐

**Items Analog Processing Digital Processing**

Small.

ADC and DAC

year.

Only ADC needs adjustment. Wide dynamic-range is realizable.

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

263

Only the digital power except for

Cost, size and power consumption are improving substantially every

and complex processing.

(Expensive adjustment for analog-circuit)

Power supply and analog power supply is

high C/P ,but recently, it is low C/P compared

**Throughput** Digital processing realizes the high-speed

Limitation of dynamic-range

The isolation between digital

with the digital processing. Real time performance is good.

Noise countermeasure must be done to every

the processing within time, I think that the architecture based on DSP will remain.

**3. Considerations of digital technology**

**3.1. Comparison of analog processing and digital processing**

proved in the 1990s more easily than analog-circuit.

Big.

sub-block.

needed.

**Cost performance** Before the 1990s, analog processing had

**Table 1.** Comparison of analog processing and digital processing

At least 3 kinds

brought about are introduced.

**Variation caused by electrical parts**

**Tolerance to noise**

**Kinds of power**

#### **2.4. The 4th generation architecture (Reduction of circuits by large scale DSP)**

In the first stage of the 2000s I realized whole Doppler signal-processing using only one floating point DSP (TI: DSP TMS320C6701). The signal-processing block inside DSP is shown in Fig. 5. Henceforth this is called the 4th generation architecture. Since the clock fre‐ quency went up tens times compared with the 2nd generation floating point DSP, through‐ put improved sharply. Moreover, changes of the ultrasound system architecture contributed to downsizing. The interrupt cycle to DSP was changed into display frequency (Vsync: 50-75 Hz) from ultrasonic pulse repetition frequency (PRF: 1-50 kHz). Although real-time per‐ formance was spoiled a little by forming packet processing, drastic reduction of circuit scales was realized.

**Figure 5.** The 4th generation architecture

#### **2.5. The future architecture (Real-time analysis)**

Evolution of signal-processing had been influenced by the realization methods, like as from analog to digital, or from hardware to software. The size and cost of Doppler signal-process‐ ing were reduced day by day, and its performance (sensitivity or the response etc.) was also improved. It will be possible to gain more huge calculation power for signal-processing from now on. It will be also an important theme to investigate for what this power should be used. Now I am trying to apply this ability to automatic measurement and automatic di‐ agnosis as an intellectual signal-processing [4-6]. Moreover, as another possibility, mounting the Doppler signal-processing on a Windows program is also tried. But problems, such as stability or response time, remain. In order to realize the real-time system which completes the processing within time, I think that the architecture based on DSP will remain.

#### **3. Considerations of digital technology**

shown in Fig. 4. In the 3rd generation, since the system clock went up sharply, the floating point device was hard to use. CFM and spectrum Doppler were unified, and they were realiz‐ ed by five kinds of ASIC (Toshiba: fixed point ASIC). In this architecture, I adopted the newly developed digital complex IIR filter for the direction separation processing without using ana‐ log phase-shifters with heavy manual adjustments [2, 3]. Furthermore, an oversampling filter and high-speed DAC (Analog Device: DAC) were used for the Doppler audio processing. This reduced the analog-circuit, such as high-order switched capacitor filters (SCF). The scale of these large-scale ASIC reached more than twice of typical CPU (Intel: Pentium processor) re‐ spectively, but the total cost of spectrum Doppler declined in 1/3. Furthermore we were able to get the wide dynamic-range signal-processing which was difficult in analog processing. As a result, the sensitivity of bloodflow detection had improved and the diagnostic targets also

spread to abdomen, surface blood vessels, and limbs etc.

262 Design and Architectures for Digital Signal Processing

scales was realized.

**Figure 5.** The 4th generation architecture

**2.5. The future architecture (Real-time analysis)**

**2.4. The 4th generation architecture (Reduction of circuits by large scale DSP)**

In the first stage of the 2000s I realized whole Doppler signal-processing using only one floating point DSP (TI: DSP TMS320C6701). The signal-processing block inside DSP is shown in Fig. 5. Henceforth this is called the 4th generation architecture. Since the clock fre‐ quency went up tens times compared with the 2nd generation floating point DSP, through‐ put improved sharply. Moreover, changes of the ultrasound system architecture contributed to downsizing. The interrupt cycle to DSP was changed into display frequency (Vsync: 50-75 Hz) from ultrasonic pulse repetition frequency (PRF: 1-50 kHz). Although real-time per‐ formance was spoiled a little by forming packet processing, drastic reduction of circuit

Evolution of signal-processing had been influenced by the realization methods, like as from analog to digital, or from hardware to software. The size and cost of Doppler signal-process‐ ing were reduced day by day, and its performance (sensitivity or the response etc.) was also improved. It will be possible to gain more huge calculation power for signal-processing As introduced in Section 2, the spectrum Doppler signal-processing architecture had changed from analog to digital in the first half of the 1990s. In this section, comparison of analog technology and digital technology and the innovations which digital technology brought about are introduced.

#### **3.1. Comparison of analog processing and digital processing**

As compared with analog processing, merits of digital processing are shown in Table 1. Dig‐ ital processing realized quality improvement (reduction of variations), performance im‐ provement (wide dynamic-range), and size reduction. Moreover, development efficiency was improved by the separation of analog power supply and digital power supply, and by the reduction of the noise in analog systems. Also the cost of digital-circuit had been im‐ proved in the 1990s more easily than analog-circuit.


**Table 1.** Comparison of analog processing and digital processing

#### **3.2. Time-spatial resolutions and S/N ratio**

By development of digital technology, sampling frequency and pixel size are increasing in a digital camera and a digital audio field every year. When sampling frequency and pixel size increase, finer sampling becomes possible in time and space. The spatial-resolution and the time-resolution have improved recently, and a high-definition image and a high-fidelity au‐ dio can be enjoyed now. Moreover, the product performance that exceeds human vision and hearing is also improved. However, from the viewpoint of manufacturing cost, if the target performance to demand is filled, the present performance level may be enough. But the products which exceed this performance actually appeared one after another in the market. I consider this reason as follows. The present product level does not fill the dynamic-range of human vision and hearing. Since human sensitivity perceives physical quantity by loga‐ rithm, according to the surrounding environment, a wide dynamic-range is required. That is, I think that the commercial products of digital camera or digital audio have not reached the demand dynamic-range of luminosity or sound pressure yet [7-9].

changes widely (1 kHz to 50 kHz). Therefore, the cutoff frequency of digital filter before DA conversion had to be variable. The change range of sampling frequency is the same as not only Doppler audio processing but also digital filters of the 2nd generation. The effect of the over-sampling processing is shown in Fig. 7. Fig. 7(a) shows the sampling characteristic of the sampling frequency *fs*, and harmonic (a side lobe of -14 dB) is mixed because of a simple over-sampling (hold characteristic). In order to remove these harmonics and to keep the re‐ quired dynamic-range in required bandwidth, the filter with suitable bandwidth property (broken line) is required. In the case of Fig. 7(b), since *fs/BW* is small, a high-order filter is required. But in the case of Fig. 7(c), since *fs/BW* is large, a low-order filter is also fully realiz‐ able. The sampling frequency of high-speed digital devices is going up now. Since *fs/BW* is

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

265

expanded, both downsizing and high-performance are realized simultaneously [13].

**Figure 7.** Over-sampling processing (a) Over-sampling model (b) Small *fs/BW* case (c) Large *fs/BW* case

**Figure 6.** Principal of ensemble mean processing and S/N ratio expansion

Aside from improvement in spatial resolution or time resolution, another merit of digitiza‐ tion is a high S/N ratio. A digital signal is sampled in a spacial axis and/or a time-axis. An ensemble mean processing can remove the noise which adjoins in space or time, so it can extract a low-frequency component with sufficient accuracy. An ensemble mean model, a signal level, a noise level, and expansion of S/N ratio are shown in Fig. 6. When the ensem‐ ble number is set to *N*, a signal increases to *N* times and a noise increases to *N* times so an S/N ratio is expanded to *N* times [10, 11]. Bandwidth restriction filters (HPF, LPF, BPF etc.) has the same effect of the ensemble mean processing. In digital processing designs, we should be cautious of an internal dynamic-range and an output noise level. In analog sys‐ tem, since the dynamic-range was narrow, a wide dynamic-range signal was not processed faithfully. The artifact caused by saturation or quantization occurred in the intermediate processes, so the sufficient sensitivity of bloodflow detection was not obtained. After the 3rd generation architecture, as the wide dynamic-range signal-processing was realized, the big improvement in bloodflow diagnosis was brought about.

#### **3.3. Hardware reduction by over-sampling**

The design concept of the compact disk (CD) which started in the 1980s was that the stereo digital signal (44 kHz sampling) was changed into the stereo analog signals (maintaining 20 kHz bandwidth). For this purpose, the steep analog filter (at least 7th-order) which rejects harmonics after AD conversion output (22 kHz) was required. Several years afterward, 4 times over-sampling system (about 170 kHz) appeared. It was realized by a digital moving average filter (sampling-frequency: about 170 kHz, cutoff frequency: 20 kHz, cutoff proper‐ ty: loose), a 4 times DA conversion, and a simple analog filter (cutoff frequency: 20 kHz, cut‐ off property: about 2nd-order). Since this system had many merits (reduction of cost and size, improvement of S/N ratio etc.), it became mainstream [12].

The third generation architecture of Doppler ultrasound system was designed based on this over-sampling concept. In Doppler audio processing, unlike CD, a sampling frequency changes widely (1 kHz to 50 kHz). Therefore, the cutoff frequency of digital filter before DA conversion had to be variable. The change range of sampling frequency is the same as not only Doppler audio processing but also digital filters of the 2nd generation. The effect of the over-sampling processing is shown in Fig. 7. Fig. 7(a) shows the sampling characteristic of the sampling frequency *fs*, and harmonic (a side lobe of -14 dB) is mixed because of a simple over-sampling (hold characteristic). In order to remove these harmonics and to keep the re‐ quired dynamic-range in required bandwidth, the filter with suitable bandwidth property (broken line) is required. In the case of Fig. 7(b), since *fs/BW* is small, a high-order filter is required. But in the case of Fig. 7(c), since *fs/BW* is large, a low-order filter is also fully realiz‐ able. The sampling frequency of high-speed digital devices is going up now. Since *fs/BW* is expanded, both downsizing and high-performance are realized simultaneously [13].

**Figure 6.** Principal of ensemble mean processing and S/N ratio expansion

**3.2. Time-spatial resolutions and S/N ratio**

264 Design and Architectures for Digital Signal Processing

By development of digital technology, sampling frequency and pixel size are increasing in a digital camera and a digital audio field every year. When sampling frequency and pixel size increase, finer sampling becomes possible in time and space. The spatial-resolution and the time-resolution have improved recently, and a high-definition image and a high-fidelity au‐ dio can be enjoyed now. Moreover, the product performance that exceeds human vision and hearing is also improved. However, from the viewpoint of manufacturing cost, if the target performance to demand is filled, the present performance level may be enough. But the products which exceed this performance actually appeared one after another in the market. I consider this reason as follows. The present product level does not fill the dynamic-range of human vision and hearing. Since human sensitivity perceives physical quantity by loga‐ rithm, according to the surrounding environment, a wide dynamic-range is required. That is, I think that the commercial products of digital camera or digital audio have not reached

Aside from improvement in spatial resolution or time resolution, another merit of digitiza‐ tion is a high S/N ratio. A digital signal is sampled in a spacial axis and/or a time-axis. An ensemble mean processing can remove the noise which adjoins in space or time, so it can extract a low-frequency component with sufficient accuracy. An ensemble mean model, a signal level, a noise level, and expansion of S/N ratio are shown in Fig. 6. When the ensem‐ ble number is set to *N*, a signal increases to *N* times and a noise increases to *N* times so an S/N ratio is expanded to *N* times [10, 11]. Bandwidth restriction filters (HPF, LPF, BPF etc.) has the same effect of the ensemble mean processing. In digital processing designs, we should be cautious of an internal dynamic-range and an output noise level. In analog sys‐ tem, since the dynamic-range was narrow, a wide dynamic-range signal was not processed faithfully. The artifact caused by saturation or quantization occurred in the intermediate processes, so the sufficient sensitivity of bloodflow detection was not obtained. After the 3rd generation architecture, as the wide dynamic-range signal-processing was realized, the big

The design concept of the compact disk (CD) which started in the 1980s was that the stereo digital signal (44 kHz sampling) was changed into the stereo analog signals (maintaining 20 kHz bandwidth). For this purpose, the steep analog filter (at least 7th-order) which rejects harmonics after AD conversion output (22 kHz) was required. Several years afterward, 4 times over-sampling system (about 170 kHz) appeared. It was realized by a digital moving average filter (sampling-frequency: about 170 kHz, cutoff frequency: 20 kHz, cutoff proper‐ ty: loose), a 4 times DA conversion, and a simple analog filter (cutoff frequency: 20 kHz, cut‐ off property: about 2nd-order). Since this system had many merits (reduction of cost and

The third generation architecture of Doppler ultrasound system was designed based on this over-sampling concept. In Doppler audio processing, unlike CD, a sampling frequency

the demand dynamic-range of luminosity or sound pressure yet [7-9].

improvement in bloodflow diagnosis was brought about.

size, improvement of S/N ratio etc.), it became mainstream [12].

**3.3. Hardware reduction by over-sampling**

**Figure 7.** Over-sampling processing (a) Over-sampling model (b) Small *fs/BW* case (c) Large *fs/BW* case

#### **4. Cascade digital filter design**

The design tools for various digital filters, such as LPFs, HPFs, and BPFs, were popular and spread. However, in order to realize large-scale signal-processing combining various filters which have different sampling frequency, there are many points that should be taken into con‐ sideration, such as an aliasing and artifacts. In this section, two design examples of the cascade digital filter are introduced. One is a cascade BPF for continuous wave Doppler processing that has the down-sampling processing exceeding 10-4 or less. Another is a cascade LPF for Dop‐ pler audio processing that has the over-sampling processing exceeding 103 or more.

The latter part of LPF had re-sample frequency *M1\*fr*, and it was realized by the single-pre‐ cision floating point DSP (low-order LPF, cutoff-frequency: *fr*). An HPF was arranged after these two steps of LPFs. The HPF carried out scaling-processing to the LPF output by resampling frequency *M2\*fr*. High-order and wide-range cutoff HPF processing (cutoff-fre‐ quency: *fr*/2 to *fr*/200) was realized by the double-precision floating point DSP. By this architecture, the required bandwidth restrictions and dynamic-range could be realized even in the continuous-wave Doppler processing which had heavy mixing of clatter artifacts. The frequency characteristic of the cascade digital filter (Fig. 8) is shown in Fig. 9. Actually a chirp waveform (0 to 40 kHz) was inputted into the cascade digital filter and its perform‐ ance was checked by the spectrum Doppler image of the trial product. We can check the

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

267

loose bandwidth restrictions near ±6 kHz (*fr*) in this figure.

**Figure 10.** Analog Doppler audio system of the 2nd generation architecture

**4.2. Doppler audio signal-processing**

**Figure 9.** Frequency characteristics of continuous-wave Doppler filter of the 2nd generation architecture

After quadrature detection, the Doppler audio system divides IQ signal into a forward com‐ ponent and a reverse component with a direction separation filter, and outputs them to the left and right stereo speakers. In the case of a pulse wave Doppler, since about 4 kHz sam‐ pling frequency is interlocked with *fr* (pulse repetition frequency), it is hard to hear blood‐ flow signal as it is (by the mixing of harmonics). In order to remove harmonics, it is required to realize the steep filter which has cutoff of *fr*/2 in consideration of audio bandwidth (20 Hz to 20 kHz), and to reject unnecessary harmonics. The conventional analog filter architecture is shown in Fig. 10. In the Fig. 10, LPF1 had the steep cutoff characteristic by SCF after direc‐

#### **4.1. Continuous wave Doppler signal-processing**

In the Doppler signal-processing, an HPF is effective to extract weak bloodflow. There is a clutter signal which mixes unnecessary reflective ingredients from a blood vessel wall etc. By removing it at the entrance of frequency-analysis, the dynamic-range of FFT processing and audio processing can be held down, and also the signal-processing load can be reduced. High-order HPF array (bandwidth: several kHz, cutoff-frequency: several hundred to sever‐ al thousand Hz) had been used in the analog system. For digitizing this, the 2nd generation architecture was developed. The filter array for continuous-wave Doppler signal-processing is shown in Fig. 8.

**Figure 8.** Digital filter design of continuous-wave Doppler signal-processing

In the conventional system, after an analog HPF, a bandwidth restriction was applied by the anti-alias LPF before ADC. To realize the digital HPF which had high-order and high sam‐ pling frequency was difficult in those days. For example, when a high-order IIR filter (sever‐ al MHz and 2ch processing) was assumed, hundreds Mflps performance was required for calculation. Then, LPF was arranged before HPF in digitization. A high-speed and low-cut‐ off LPF was required. For example, the relative cutoff 1/1000 (that means several kHz band‐ width restrictions with tens of MHz sampling) was required. In order to prevent expansion of the tap-length (number of delay-registers) and bit-length of internal registers (inner dy‐ namic-range), I chose the system which divided LPF into two steps and applied down-sam‐ pling. Since the output after quadrature detection was high-speed of tens MHz (*f1* Hz in Fig. 8), in the front part of LPF the delta-sigma LPF that had cutoff frequency *N1\*fr* was adopted, and it was realized by FPGA.

The latter part of LPF had re-sample frequency *M1\*fr*, and it was realized by the single-pre‐ cision floating point DSP (low-order LPF, cutoff-frequency: *fr*). An HPF was arranged after these two steps of LPFs. The HPF carried out scaling-processing to the LPF output by resampling frequency *M2\*fr*. High-order and wide-range cutoff HPF processing (cutoff-fre‐ quency: *fr*/2 to *fr*/200) was realized by the double-precision floating point DSP. By this architecture, the required bandwidth restrictions and dynamic-range could be realized even in the continuous-wave Doppler processing which had heavy mixing of clatter artifacts. The frequency characteristic of the cascade digital filter (Fig. 8) is shown in Fig. 9. Actually a chirp waveform (0 to 40 kHz) was inputted into the cascade digital filter and its perform‐ ance was checked by the spectrum Doppler image of the trial product. We can check the loose bandwidth restrictions near ±6 kHz (*fr*) in this figure.

**Figure 9.** Frequency characteristics of continuous-wave Doppler filter of the 2nd generation architecture

**Figure 10.** Analog Doppler audio system of the 2nd generation architecture

#### **4.2. Doppler audio signal-processing**

**4. Cascade digital filter design**

266 Design and Architectures for Digital Signal Processing

**4.1. Continuous wave Doppler signal-processing**

**Figure 8.** Digital filter design of continuous-wave Doppler signal-processing

is shown in Fig. 8.

and it was realized by FPGA.

The design tools for various digital filters, such as LPFs, HPFs, and BPFs, were popular and spread. However, in order to realize large-scale signal-processing combining various filters which have different sampling frequency, there are many points that should be taken into con‐ sideration, such as an aliasing and artifacts. In this section, two design examples of the cascade digital filter are introduced. One is a cascade BPF for continuous wave Doppler processing that has the down-sampling processing exceeding 10-4 or less. Another is a cascade LPF for Dop‐

In the Doppler signal-processing, an HPF is effective to extract weak bloodflow. There is a clutter signal which mixes unnecessary reflective ingredients from a blood vessel wall etc. By removing it at the entrance of frequency-analysis, the dynamic-range of FFT processing and audio processing can be held down, and also the signal-processing load can be reduced. High-order HPF array (bandwidth: several kHz, cutoff-frequency: several hundred to sever‐ al thousand Hz) had been used in the analog system. For digitizing this, the 2nd generation architecture was developed. The filter array for continuous-wave Doppler signal-processing

In the conventional system, after an analog HPF, a bandwidth restriction was applied by the anti-alias LPF before ADC. To realize the digital HPF which had high-order and high sam‐ pling frequency was difficult in those days. For example, when a high-order IIR filter (sever‐ al MHz and 2ch processing) was assumed, hundreds Mflps performance was required for calculation. Then, LPF was arranged before HPF in digitization. A high-speed and low-cut‐ off LPF was required. For example, the relative cutoff 1/1000 (that means several kHz band‐ width restrictions with tens of MHz sampling) was required. In order to prevent expansion of the tap-length (number of delay-registers) and bit-length of internal registers (inner dy‐ namic-range), I chose the system which divided LPF into two steps and applied down-sam‐ pling. Since the output after quadrature detection was high-speed of tens MHz (*f1* Hz in Fig. 8), in the front part of LPF the delta-sigma LPF that had cutoff frequency *N1\*fr* was adopted,

or more.

pler audio processing that has the over-sampling processing exceeding 103

After quadrature detection, the Doppler audio system divides IQ signal into a forward com‐ ponent and a reverse component with a direction separation filter, and outputs them to the left and right stereo speakers. In the case of a pulse wave Doppler, since about 4 kHz sam‐ pling frequency is interlocked with *fr* (pulse repetition frequency), it is hard to hear blood‐ flow signal as it is (by the mixing of harmonics). In order to remove harmonics, it is required to realize the steep filter which has cutoff of *fr*/2 in consideration of audio bandwidth (20 Hz to 20 kHz), and to reject unnecessary harmonics. The conventional analog filter architecture is shown in Fig. 10. In the Fig. 10, LPF1 had the steep cutoff characteristic by SCF after direc‐ tion separation processing. And LPF2 removed the harmonics generated in LPF1 (SCF noise). The S/N ratio of SCF was 50 dB or less in audio-range, and sound quality was quite bad compared with the present system.

separate weak bloodflow signal from high-power artifacts, like a blood vessel wall, the steep HPF (called a wall filter) had been arranged before frequency-analysis. With digitization, I

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

269

The locations of wall filters in Doppler ultrasound system are shown in Fig. 12. Quadrature detection outputs (IQ signal) are divided into CFM processing and the spectrum Doppler processing. In both CFM processing and the spectrum Doppler signal-processing, to save

Beam scanning methods and the display modes corresponding to them are shown in Fig. 13. Fig. 13(a) shows the scanning method of a B-mode echo image (tomogram), and it scans a beam from the right to the left. Fig. 13(b) shows the scanning method of spectrum Doppler, and it scans the same beam in a tomogram continuously. Fig. 13(c) shows the scanning method of CFM, and it scans a beam from the right to the left like Fig. 13(a), but the same beam is scanned twice or more. The sampling methods of Fig. 13(b) and Fig. 13(c) are shown in Fig. 14(a) and Fig. 14(b). Beam data is sampled by *fr*, and includes the information on the depth direction. As shown in Fig. 14(a), since the spectrum Doppler has a long time series signal, a detailed frequency analysis can be realized. On the other hand, as shown in Fig. 14(b), CFM has plurality data series (hundreds points) on the depth direction. Because CFM processing consists of a finite wall filter and a complex autocorrelation processing, the anal‐ ysis data of CFM is same as sampling beam number (5 in the case of Fig. 14(b)), and is very

dynamic-range, wall filters are arranged before frequency analyses respectively [15].

investigated a new wall filter designs.

small compared with that of the spectrum Doppler.

**Figure 13.** Ultrasound beam scan mode (a) B-mode (b) Spectrum Doppler (c) CFM

The clutter is the high-power and low-frequency component at the quadrature detection output. Compared with a bloodflow signal, it has high-power low-frequency component in abdomen (about 20 dB) and heart (more than 40-60 dB). The quadrature detection outputs and wall filter outputs collected in the heart (left ventricle outflow) are shown in Fig. 15. The horizontal axes of Fig. 15(a) and Fig. 15(b) are time, and vertical axes are amplitude. The

**5.1. Purpose of wall filter**

The Doppler audio filter of the 3rd generation is shown in Fig. 11. The signals of direction separation (complex BPF output) were oversampled by LPF1 whose sampling frequency (*M3\*fs*) was hundreds times larger than *fr*. And LPF1 had a loose bandwidth restriction by a moving average. In the following stage LPF2, scaling was applied by *f2* (same as the ADC clock frequency). LPF2 had the bandwidth restriction by the IIR filter (cutoff frequency: *fr*/2). After AD conversion, in order to remove harmonics, loose bandwidth restriction was again applied by LPF3. This cascade filter processing could realize a high-quality Doppler audio (S/N ratio: more than 90 dB). Moreover, since the oversampling frequency of LPF1 and the conversion frequency of ADC were set up more highly, the simple filter (lower-order) was used and drastic hardware reduction of LPF2 and LPF3 were realized [14].

**Figure 11.** Digital Doppler audio system of the 3rd generation architecture

**Figure 12.** Wall filter arrangements of Doppler ultrasound system

#### **5. High-precision digital filter design**

Doppler ultrasound diagnostic method spread to many diagnostic fields, such as cardiac and abdomen. On the other hand, the improvement of bloodflow detection (sensitivity and velocity-range) had been desired for a long time. For this purpose, since it was required to separate weak bloodflow signal from high-power artifacts, like a blood vessel wall, the steep HPF (called a wall filter) had been arranged before frequency-analysis. With digitization, I investigated a new wall filter designs.

#### **5.1. Purpose of wall filter**

tion separation processing. And LPF2 removed the harmonics generated in LPF1 (SCF noise). The S/N ratio of SCF was 50 dB or less in audio-range, and sound quality was quite

The Doppler audio filter of the 3rd generation is shown in Fig. 11. The signals of direction separation (complex BPF output) were oversampled by LPF1 whose sampling frequency (*M3\*fs*) was hundreds times larger than *fr*. And LPF1 had a loose bandwidth restriction by a moving average. In the following stage LPF2, scaling was applied by *f2* (same as the ADC clock frequency). LPF2 had the bandwidth restriction by the IIR filter (cutoff frequency: *fr*/2). After AD conversion, in order to remove harmonics, loose bandwidth restriction was again applied by LPF3. This cascade filter processing could realize a high-quality Doppler audio (S/N ratio: more than 90 dB). Moreover, since the oversampling frequency of LPF1 and the conversion frequency of ADC were set up more highly, the simple filter (lower-order) was

Doppler ultrasound diagnostic method spread to many diagnostic fields, such as cardiac and abdomen. On the other hand, the improvement of bloodflow detection (sensitivity and velocity-range) had been desired for a long time. For this purpose, since it was required to

used and drastic hardware reduction of LPF2 and LPF3 were realized [14].

**Figure 11.** Digital Doppler audio system of the 3rd generation architecture

**Figure 12.** Wall filter arrangements of Doppler ultrasound system

**5. High-precision digital filter design**

bad compared with the present system.

268 Design and Architectures for Digital Signal Processing

The locations of wall filters in Doppler ultrasound system are shown in Fig. 12. Quadrature detection outputs (IQ signal) are divided into CFM processing and the spectrum Doppler processing. In both CFM processing and the spectrum Doppler signal-processing, to save dynamic-range, wall filters are arranged before frequency analyses respectively [15].

Beam scanning methods and the display modes corresponding to them are shown in Fig. 13. Fig. 13(a) shows the scanning method of a B-mode echo image (tomogram), and it scans a beam from the right to the left. Fig. 13(b) shows the scanning method of spectrum Doppler, and it scans the same beam in a tomogram continuously. Fig. 13(c) shows the scanning method of CFM, and it scans a beam from the right to the left like Fig. 13(a), but the same beam is scanned twice or more. The sampling methods of Fig. 13(b) and Fig. 13(c) are shown in Fig. 14(a) and Fig. 14(b). Beam data is sampled by *fr*, and includes the information on the depth direction. As shown in Fig. 14(a), since the spectrum Doppler has a long time series signal, a detailed frequency analysis can be realized. On the other hand, as shown in Fig. 14(b), CFM has plurality data series (hundreds points) on the depth direction. Because CFM processing consists of a finite wall filter and a complex autocorrelation processing, the anal‐ ysis data of CFM is same as sampling beam number (5 in the case of Fig. 14(b)), and is very small compared with that of the spectrum Doppler.

**Figure 13.** Ultrasound beam scan mode (a) B-mode (b) Spectrum Doppler (c) CFM

The clutter is the high-power and low-frequency component at the quadrature detection output. Compared with a bloodflow signal, it has high-power low-frequency component in abdomen (about 20 dB) and heart (more than 40-60 dB). The quadrature detection outputs and wall filter outputs collected in the heart (left ventricle outflow) are shown in Fig. 15. The horizontal axes of Fig. 15(a) and Fig. 15(b) are time, and vertical axes are amplitude. The horizontal axis of Fig. 15(c) is frequency and a vertical axis is power. The wall filter has the 4th-order Butterworth characteristic with 200 Hz cutoff frequency. The power spectra of Fig. 15(c) show that a big clatter component (20dB bigger) is removed. The wall filter is required high-order (steep) and low-cutoff characteristic [16].

response is set to -20 dB (10% of step input amplitude) is shown in Fig. 16(c). Since perform‐ ance of the wall filter with finite input is insufficient, the technology for reducing a transient response was required. The wall filter systems of CFM are shown in Fig. 17. In the finite im‐ pulse response (FIR) system of Fig. 17(a), if the number of delay registers N is small, suffi‐ cient performance cannot be obtained. So the IIR system of Fig. 17(b) became main-stream. However, unlike the wall filter of spectrum Doppler, the transient response of IIR filter has a serious influence to frequency-analysis. In order to solve this problem the adaptive filter which is consistuted by a time-variant FIR filter shown in Fig. 17(c) appeared recently.

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

271

**Figure 16.** Step responses of HPF (a) Responses @ relative cutoff: 1/16 (b) Responses @ relative cutoff: 1/128 (c) Cut‐

Since there is no feedback in the FIR system of Fig. 17(a), saturation does not occur easily, and the internal dynamic-range can be made small. However, since it is necessary to increase the

**Figure 17.** Wall filter systems of CFM (a) FIR filter (b) Biquad filter (c) Time valiant FIR filter

off and transient response

**Figure 14.** Sampling methods (a) Spectrum Doppler (b) CFM

**Figure 15.** Effect of wall filter (a) Input signals of wall filter (b) Output signals of wall filter (c) Spectra of (a) and (b)

#### **5.2. Wall filter of CFM**

The high-order analog filter had been used for the wall filter of spectrum Doppler, and a fi‐ nite digital filter had been used for the wall filter of CFM for a long time. The step responses of infinite impulse response (IIR) Butterworth filter in the case of changing cutoff and order are shown in Fig. 16. Fig. 16(a) shows the responses when order is changed at the relative cutoff frequency 1/16 (normalized by sampling frequency), and Fig. 16(b) shows the re‐ sponses at the relative cutoff frequency 1/128. A transient response becomes long at the time of low-cutoff and high-order (steep). Relation between cutoff and order when the transient response is set to -20 dB (10% of step input amplitude) is shown in Fig. 16(c). Since perform‐ ance of the wall filter with finite input is insufficient, the technology for reducing a transient response was required. The wall filter systems of CFM are shown in Fig. 17. In the finite im‐ pulse response (FIR) system of Fig. 17(a), if the number of delay registers N is small, suffi‐ cient performance cannot be obtained. So the IIR system of Fig. 17(b) became main-stream. However, unlike the wall filter of spectrum Doppler, the transient response of IIR filter has a serious influence to frequency-analysis. In order to solve this problem the adaptive filter which is consistuted by a time-variant FIR filter shown in Fig. 17(c) appeared recently.

horizontal axis of Fig. 15(c) is frequency and a vertical axis is power. The wall filter has the 4th-order Butterworth characteristic with 200 Hz cutoff frequency. The power spectra of Fig. 15(c) show that a big clatter component (20dB bigger) is removed. The wall filter is required

**Figure 15.** Effect of wall filter (a) Input signals of wall filter (b) Output signals of wall filter (c) Spectra of (a) and (b)

The high-order analog filter had been used for the wall filter of spectrum Doppler, and a fi‐ nite digital filter had been used for the wall filter of CFM for a long time. The step responses of infinite impulse response (IIR) Butterworth filter in the case of changing cutoff and order are shown in Fig. 16. Fig. 16(a) shows the responses when order is changed at the relative cutoff frequency 1/16 (normalized by sampling frequency), and Fig. 16(b) shows the re‐ sponses at the relative cutoff frequency 1/128. A transient response becomes long at the time of low-cutoff and high-order (steep). Relation between cutoff and order when the transient

high-order (steep) and low-cutoff characteristic [16].

270 Design and Architectures for Digital Signal Processing

**Figure 14.** Sampling methods (a) Spectrum Doppler (b) CFM

**5.2. Wall filter of CFM**

**Figure 16.** Step responses of HPF (a) Responses @ relative cutoff: 1/16 (b) Responses @ relative cutoff: 1/128 (c) Cut‐ off and transient response

**Figure 17.** Wall filter systems of CFM (a) FIR filter (b) Biquad filter (c) Time valiant FIR filter

Since there is no feedback in the FIR system of Fig. 17(a), saturation does not occur easily, and the internal dynamic-range can be made small. However, since it is necessary to increase the number of taps in order to obtain low-cutoff, it is disadvantageous in a response and size. Bi‐ quad filter system of Fig. 17(b) has good response. However, since big internal dynamic-range is required at the time of low-cutoff, there are problems of a quantizing noise or a transient re‐ sponse. Compared with these, although time delay is given as a packet unit, a time-variant FIR filter system shown in Fig. 17(c) has many advantages. The filter response is calculated from packet data based on matrices. By progress of signal-processing device in recent years, the de‐ velopment of an adaptive filter based on the time-variant FIR system became also easy. A timevariant FIR filter with input x(n), output y(n), and the state variable *v(n)* shown in Fig. 17(c) consists of the state equation and output equation in equation (1).

$$\begin{aligned} \mathbf{v}(n+1) &= F \ast \mathbf{v}(n) + q \ast \mathbf{x} \mathbf{0}(n) \\ \mathbf{y}(n) &= \quad \mathbf{g}^T \ast \mathbf{v}(n) + d \ast \mathbf{x}(n) \end{aligned} \tag{1}$$

between cutoff and the dynamic-range (the bit-length of inner register) is shown in Fig. 19. In order to realize a high-precision digital filter, accuracy of operation registers and filter co‐ efficients is important. I checked the minimum bit-length that was not influenced by quan‐ tizing noise. The responses of the fixed point 8th-order Butterworth filters were simulated. If quantizing noise is mixed, the unstable oscillation such as a limit cycle etc. will occur. I changed cutoff frequency and measured the limit of stability. As a result, in order to realize low cutoff, it turned out that sufficient mantissa-length of operation-registers and sufficient multiplication-coefficient length of multipliers were required. In fact, since cutoff frequency became about 1/200 in the spectrum Doppler processing, a huge internal dynamic-range

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

273

**Figure 18.** Step response of HPF and transit response (a) Input *x(n)* and output *y(n)* (b) Responses of internal resisters

**Figure 19.** Dynamic-range (bit-length) and cutoff frequency of HPF

(about 200dB) was required.

The coefficients of an IIR filter (fig. 17(b)) are transposed into the matrix *F*, *q*, *g* and *d* in equation (2). The signal-processing equivalent to an IIR system can be realized by a timevariant FIR system.

$$F = \begin{bmatrix} 0 & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ & & \dots & \\ 0 & 0 & 0 & \dots & 1 \\ -a\_K & -a\_{K-1} & -a\_{K-2} & \dots & -a\_1 \end{bmatrix} \qquad q = \begin{bmatrix} 0 \\ 0 \\ \dots \\ 0 \\ 0 \\ 1 \end{bmatrix} \tag{2}$$
 
$$\mathbf{g} = \begin{bmatrix} b\_K - b\_0 \ast a\_n \\ b\_{K-1} - b\_0 \ast a\_{K-1} \\ \dots \\ b\_1 - b\_0 \ast a\_1 \end{bmatrix} \qquad d = b\_0$$

Thus, since the time-variant FIR system shown in Fig. 17(c) can solve both the problem of the internal dynamic-range and a transient response, it will develop as a new wall filter of CFM from now on.

#### **5.3. Wall filter of spectrum Doppler**

The step responses of 8th-order Butterworth filter (4 cascade biquads, relative cutoff 1/128) are shown in Fig. 18. Fig. 18(a) shows the step input *x(n)* and the output *y(n)*. Since it is HPF, its output approaches DC with a damped oscillation. The responses of the inner registers for each stage (*Z1(n)*, *Z3(n)*, *Z5(n)* and *Z7(n)* in Fig. 20(b)) are shown in Fig. 18(b). Although *Z3(n)*, *Z5(n)* and *Z7(n)* are converged on DC with about tens times amplitude of an input, *Z1(n)* holds about 400 times amplitude of an input. Thus, when the HPF prevents saturation or keeps internal accuracy, wide dynamic-range of internal registers is required. The relation between cutoff and the dynamic-range (the bit-length of inner register) is shown in Fig. 19. In order to realize a high-precision digital filter, accuracy of operation registers and filter co‐ efficients is important. I checked the minimum bit-length that was not influenced by quan‐ tizing noise. The responses of the fixed point 8th-order Butterworth filters were simulated. If quantizing noise is mixed, the unstable oscillation such as a limit cycle etc. will occur. I changed cutoff frequency and measured the limit of stability. As a result, in order to realize low cutoff, it turned out that sufficient mantissa-length of operation-registers and sufficient multiplication-coefficient length of multipliers were required. In fact, since cutoff frequency became about 1/200 in the spectrum Doppler processing, a huge internal dynamic-range (about 200dB) was required.

number of taps in order to obtain low-cutoff, it is disadvantageous in a response and size. Bi‐ quad filter system of Fig. 17(b) has good response. However, since big internal dynamic-range is required at the time of low-cutoff, there are problems of a quantizing noise or a transient re‐ sponse. Compared with these, although time delay is given as a packet unit, a time-variant FIR filter system shown in Fig. 17(c) has many advantages. The filter response is calculated from packet data based on matrices. By progress of signal-processing device in recent years, the de‐ velopment of an adaptive filter based on the time-variant FIR system became also easy. A timevariant FIR filter with input x(n), output y(n), and the state variable *v(n)* shown in Fig. 17(c)

( 1) ( ) 0( )

*vn F vn q x n*

+ = \* +\*

() () () *<sup>T</sup>*

The coefficients of an IIR filter (fig. 17(b)) are transposed into the matrix *F*, *q*, *g* and *d* in equation (2). The signal-processing equivalent to an IIR system can be realized by a time-

> 0 1 0 ... 0 0 0 0 1 ... 0 0

é ù é ù ê ú ê ú

= = ê ú

0 0 0 ... 1 0


= \* +\* (1)

ê ú

ê ú ê ú

(2)

*yn g vn d xn*

12 1

*F q*

*aa a a*


0

Thus, since the time-variant FIR system shown in Fig. 17(c) can solve both the problem of the internal dynamic-range and a transient response, it will develop as a new wall filter of

The step responses of 8th-order Butterworth filter (4 cascade biquads, relative cutoff 1/128) are shown in Fig. 18. Fig. 18(a) shows the step input *x(n)* and the output *y(n)*. Since it is HPF, its output approaches DC with a damped oscillation. The responses of the inner registers for each stage (*Z1(n)*, *Z3(n)*, *Z5(n)* and *Z7(n)* in Fig. 20(b)) are shown in Fig. 18(b). Although *Z3(n)*, *Z5(n)* and *Z7(n)* are converged on DC with about tens times amplitude of an input, *Z1(n)* holds about 400 times amplitude of an input. Thus, when the HPF prevents saturation or keeps internal accuracy, wide dynamic-range of internal registers is required. The relation

... ...

... 1

consists of the state equation and output equation in equation (1).

272 Design and Architectures for Digital Signal Processing

0 10 1


*K n K K*

*b ba b ba*

é ù - \*

*KK K*

*g d b*

ê ú - \* = = ê ú

1 01

*bba*

...

ê ú ê ú - \* ë û

variant FIR system.

CFM from now on.

**5.3. Wall filter of spectrum Doppler**

**Figure 18.** Step response of HPF and transit response (a) Input *x(n)* and output *y(n)* (b) Responses of internal resisters

**Figure 19.** Dynamic-range (bit-length) and cutoff frequency of HPF

The signal-processing architecture which reduces the internal dynamic-range of a digital fil‐ ter was developed in the 2nd generation architecture. Simultaneously, the algorithm which reduces calculation in a real-time system was also investigated. The systems to realize 8thorder digital filter are shown in Fig. 20. Fig. 20(a) shows the loop system which makes the internal dynamic-range small with four delay-registers. Fig. 20(b) shows the loop biquad fil‐ ter system with two delay-registers. The upper *Z -1* corresponds to *Z1(n)*, *Z3(n)*, *Z5(n)* and *Z7(n)* of Fig. 18(b). Fig. 20(c) shows the system with eight delay-registers in series. While the calculation cycle becomes small, the dynamic-range of internal registers becomes large.

essing, and its relative cutoff was set to 1/256. In the system 3 (equivalent to Fig. 20(c)), al‐ though both the number of double-precision registers and the operation cycles were small, internal bit-length became large. Since inner bit-length exceeded above 50 bits, even the double floating point arithmetic (mantissa-length: 48 bits) run short of accuracy, and was difficult to realize. In the system 1 (equivalent to Fig. 20(a)), many operation registers were required although internal bit-length was small. Since low-speed external memory access was required, its operation cycle increased. In the system 2 (equivalent to Fig. 20(b)), inter‐ nal bit-length did not exceed the range of double-precision floating-point arithmetic, and an operation cycle was comparatively small. As mentioned above, the system 2 was judged the

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

275

The dynamic-range of conventional system was insufficient for some clinical applications. In recent years, diagnostic ultrasound system was improved through the use of high-frequency electronics and integrated circuits. A new diagnostic method became effective by the higher dynamic-range system. However, the higher dynamic-range system also means more com‐ plicated gain control. It is possible to optimize the gain automatically through the use of ul‐ trasound system parameters. This technology reduces the size of hardware and reduces the

Conventional ultrasound signal-processing is shown in Fig. 22. The transceiver processor (Tx/Rx Proc.) receives signals from the probe elements and the receive signals are amplified by the preamplifier. Gain compensation is applied to the signal to correct for range-distance attenuation (STC: sensitivity time control) and an analog gain correction for probe character‐ istics (frequency, sensitivity, etc) is applied. The signal is then sent to an ADC. After AD conversion the digital beam former (DBF) applies a delay pattern to the data to focus it and produce beam data. This data is processed by B-Mode Image Proc. and Doppler Image Proc., then displayed as a tomogram image and/or a spectrum Doppler image in the display processor (Display). The S/N ratio is increased in the DBF as the number of channels (corre‐ sponds to transducer elements) to which delay calculations are applied increases [18, 19]. In Doppler signal-processing, quadrature detection (Mixer) is applied to the DBF output and a BPF provides band-limitation and clutter rejection. The result is a base-band Doppler signal. At this time, the S/N ratio is sharply increased because of the band-limitation of the BPF. In the Doppler signal-processing, I apply range-gate integration (RG) across the range direc‐ tion of the ROI. This also increases the S/N ratio. In the case of continuous wave Doppler the dynamic-range is even larger. And the HPF applied to this data must be more sophisticated. The dynamic-range of the signal leaving the HPF is also much larger than in the pulse wave case, in the order of 100 dB. After FFT the S/N ratio is greatly increased because of the but‐ terfly integration. The dynamic-range of the signal is now very large and considerable gain

adjustment and display compression must be done in order to display the data.

best system for mounting [17].

gain control range substantially.

**6.1. Signal-processing of ultrasound system**

**6. Wide dynamic-range system design**

**Figure 20.** Wall filter systems of spectrum Doppler (a) Sysytem1: IIR+FIR loop system (b) System2: Biquad loop system (c) System3: Direct IIR system

**Figure 21.** Benchmark based on floating point DSP (μPD77240)

Fig. 21 shows the evaluation of above systems. As benchmark condition, NECμPD77240 was used for floating point DSP. The 8th-order Butterworth HPF was chosen as benchmark proc‐ essing, and its relative cutoff was set to 1/256. In the system 3 (equivalent to Fig. 20(c)), al‐ though both the number of double-precision registers and the operation cycles were small, internal bit-length became large. Since inner bit-length exceeded above 50 bits, even the double floating point arithmetic (mantissa-length: 48 bits) run short of accuracy, and was difficult to realize. In the system 1 (equivalent to Fig. 20(a)), many operation registers were required although internal bit-length was small. Since low-speed external memory access was required, its operation cycle increased. In the system 2 (equivalent to Fig. 20(b)), inter‐ nal bit-length did not exceed the range of double-precision floating-point arithmetic, and an operation cycle was comparatively small. As mentioned above, the system 2 was judged the best system for mounting [17].

#### **6. Wide dynamic-range system design**

The signal-processing architecture which reduces the internal dynamic-range of a digital fil‐ ter was developed in the 2nd generation architecture. Simultaneously, the algorithm which reduces calculation in a real-time system was also investigated. The systems to realize 8thorder digital filter are shown in Fig. 20. Fig. 20(a) shows the loop system which makes the internal dynamic-range small with four delay-registers. Fig. 20(b) shows the loop biquad fil‐ ter system with two delay-registers. The upper *Z -1* corresponds to *Z1(n)*, *Z3(n)*, *Z5(n)* and *Z7(n)* of Fig. 18(b). Fig. 20(c) shows the system with eight delay-registers in series. While the calculation cycle becomes small, the dynamic-range of internal registers becomes large.

**Figure 20.** Wall filter systems of spectrum Doppler (a) Sysytem1: IIR+FIR loop system (b) System2: Biquad loop system

Fig. 21 shows the evaluation of above systems. As benchmark condition, NECμPD77240 was used for floating point DSP. The 8th-order Butterworth HPF was chosen as benchmark proc‐

(c) System3: Direct IIR system

274 Design and Architectures for Digital Signal Processing

**Figure 21.** Benchmark based on floating point DSP (μPD77240)

The dynamic-range of conventional system was insufficient for some clinical applications. In recent years, diagnostic ultrasound system was improved through the use of high-frequency electronics and integrated circuits. A new diagnostic method became effective by the higher dynamic-range system. However, the higher dynamic-range system also means more com‐ plicated gain control. It is possible to optimize the gain automatically through the use of ul‐ trasound system parameters. This technology reduces the size of hardware and reduces the gain control range substantially.

#### **6.1. Signal-processing of ultrasound system**

Conventional ultrasound signal-processing is shown in Fig. 22. The transceiver processor (Tx/Rx Proc.) receives signals from the probe elements and the receive signals are amplified by the preamplifier. Gain compensation is applied to the signal to correct for range-distance attenuation (STC: sensitivity time control) and an analog gain correction for probe character‐ istics (frequency, sensitivity, etc) is applied. The signal is then sent to an ADC. After AD conversion the digital beam former (DBF) applies a delay pattern to the data to focus it and produce beam data. This data is processed by B-Mode Image Proc. and Doppler Image Proc., then displayed as a tomogram image and/or a spectrum Doppler image in the display processor (Display). The S/N ratio is increased in the DBF as the number of channels (corre‐ sponds to transducer elements) to which delay calculations are applied increases [18, 19]. In Doppler signal-processing, quadrature detection (Mixer) is applied to the DBF output and a BPF provides band-limitation and clutter rejection. The result is a base-band Doppler signal. At this time, the S/N ratio is sharply increased because of the band-limitation of the BPF. In the Doppler signal-processing, I apply range-gate integration (RG) across the range direc‐ tion of the ROI. This also increases the S/N ratio. In the case of continuous wave Doppler the dynamic-range is even larger. And the HPF applied to this data must be more sophisticated. The dynamic-range of the signal leaving the HPF is also much larger than in the pulse wave case, in the order of 100 dB. After FFT the S/N ratio is greatly increased because of the but‐ terfly integration. The dynamic-range of the signal is now very large and considerable gain adjustment and display compression must be done in order to display the data.

when quantization accuracy is inadequate, quantizing noise mixes. The gain adjustment that takes into consideration when detecting the weak Doppler signal around a system noise, and has sufficient quantizing margin is required. The influence of quantization is shown in Fig. 23(b). It is a spectrum when inputting sinusoidal (0.02\**fs*) including white noise. The horizontal axis is time and the vertical axis is frequency normalized by a sampling frequen‐ cy *fs*. The quantizing level of an input range was changed every 2 seconds with 3, 5, 9, 17. It turns out that the harmonics components (-20 to -30dB) by quantization has occurred near the frequency -0.3\**fs*, -0.2\**fs*, and +0.25\**fs*. The mirror effect is an imaginal image symmetri‐ cally generated with a real image on both sides of a baseline. In the analog system, it is mainly caused by the phase error of quadrature detection, or the small gain difference be‐ tween IQ signals. In a digital system, although these influences do not receive, a mirror ef‐ fect generates them owing to saturation. As shown in Fig. 23(c), on both sides of 0 Hz, a symmetrical mirror effect occurs in the spectrum image. Fig. 23 (c) is the spectrum image which raised the gain 6 dB at a time every 2 seconds to the sinusoidal input including white noise. In this figure FFT input dynamic-range is 16 bit. The horizontal axis is time and the vertical axis is the frequency that normalized by a sampling frequency fs. The mirror effect component (-0.2\**fs*) has occurred by saturation to an original signal component (+0.2\**fs*). So in conventional ultrasound design both the mirror artifact and quantization artifact are

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

277

caused by insufficient system dynamic-range in Doppler signal-processing.

Effect of D.R. Increment (\*1)

+30dB

+40dB

+50dB

+10dB

**ADC output** (Analog Gain) 50dB DR1 @f1 DR1 @f1

DR3=DR1+DR2 + 20log(*f1*/*f2*)

+ 20log(*N3*)

Amplitude - DR5 DR5\_opt

+DR5 + 20log(*N4*)

DR4=DR1+DR2+DR3 + 20log(*N2*)

DR5=DR1+DR2+DR3+DR4

DR6=DR1+DR2+DR3+DR4

Conventional System Fig. 24(a) (dB)

+50dB DR2=DR1 + 20log(*N1*) DR2\_opt=DR1 + 20log(√*N1*)

New System Fig. 24(b) (dB)

DR3\_opt=DR1+DR2\_opt +20log(√(*f1*/*f2*))

DR4\_opt=DR1+DR2\_opt

DR5\_opt=DR1+DR2\_opt

+DR4\_opt + 20log(√*N3*)

DR6\_opt=DR1+DR2\_opt

+DR4\_opt+DR5\_opt + 20log(√*N4*)

+DR3\_opt + 20log(√*N2*)

+DR3\_opt

+DR3\_opt

**Module Cause**

BeamSum effect(*N1* channel)

BandLimitation effect (*f1*/*f2*)

effect(*N2* tap)

FFT number and window(*N3* sampling)

Power to

Moving Average effect(*N4* average)

**RG** RG Integration

**DBF**

**Mixer/BPF**

**FFT**

**PSD/Pre-Compres.**

**MA**

**Figure 22.** Ultrasound signal-processing system

It is also necessary to take into consideration that the dynamic-range is increased in Doppler signal-processing. In addition, mirror effect and/or quantization artifacts are introduced when performing automatic gain compensation. Although the beam data leaving the DBF has a frequency in the order of about 10-100 MHz, it is re-sampled at about 1-100 kHz. So the input dynamic-range of FFT increased by the band-limitation effect. Moreover the S/N ratio of FFT output is increased in a similar manner to that of ensemble mean processing.

#### **6.2. Purpose of gain adjustment**

Gain adjustment corrects for diagnostic target, bloodflow sensitivity, and the difference of user's skill. In addition to this, gain adjustment compensates for variation in other equip‐ ment parameters, such as the number of summing channel in DBF, apodization function, bandwidth of the BPF, the integration length of the range-gate (RG), FFT number, window function, and the number of the shift addition of power spectrum according to sweep speed etc. The maximum signal level and a noise level change because of change of these equip‐ ment parameters. In order to realize highly sensitive Doppler bloodflow diagnosis without saturation, a system with wide dynamic-range must perform gain compensation according to all these parameters. Table 2 illustrates the rough estimation of the dynamic-range and S/N ratio based on the virtual system [20].

The model of the signal-processing accompanied by expansion of the S/N ratio is shown in Fig. 23(a). The noise level and the maximum signal level of the incoming signal are expand‐ ed by signal-processing. But the expansion of the noise level differs from the expansion of the maximum signal level, and the overall S/N ratio is increased. Under optimal gain adjust‐ ment (range shown in light green in Fig. 23(a)) that there is no saturation of the maximum signal level and quantizing noise and signal are not mixed in the output. When gain adjust‐ ment is unsuitable (range shown in light pink in Fig. 23(a)), mirror effect or quantization ar‐ tifacts occur on the spectrum image due to saturation or omission. As for signal amplitude, when quantization accuracy is inadequate, quantizing noise mixes. The gain adjustment that takes into consideration when detecting the weak Doppler signal around a system noise, and has sufficient quantizing margin is required. The influence of quantization is shown in Fig. 23(b). It is a spectrum when inputting sinusoidal (0.02\**fs*) including white noise. The horizontal axis is time and the vertical axis is frequency normalized by a sampling frequen‐ cy *fs*. The quantizing level of an input range was changed every 2 seconds with 3, 5, 9, 17. It turns out that the harmonics components (-20 to -30dB) by quantization has occurred near the frequency -0.3\**fs*, -0.2\**fs*, and +0.25\**fs*. The mirror effect is an imaginal image symmetri‐ cally generated with a real image on both sides of a baseline. In the analog system, it is mainly caused by the phase error of quadrature detection, or the small gain difference be‐ tween IQ signals. In a digital system, although these influences do not receive, a mirror ef‐ fect generates them owing to saturation. As shown in Fig. 23(c), on both sides of 0 Hz, a symmetrical mirror effect occurs in the spectrum image. Fig. 23 (c) is the spectrum image which raised the gain 6 dB at a time every 2 seconds to the sinusoidal input including white noise. In this figure FFT input dynamic-range is 16 bit. The horizontal axis is time and the vertical axis is the frequency that normalized by a sampling frequency fs. The mirror effect component (-0.2\**fs*) has occurred by saturation to an original signal component (+0.2\**fs*). So in conventional ultrasound design both the mirror artifact and quantization artifact are caused by insufficient system dynamic-range in Doppler signal-processing.

**Figure 22.** Ultrasound signal-processing system

276 Design and Architectures for Digital Signal Processing

**6.2. Purpose of gain adjustment**

S/N ratio based on the virtual system [20].

It is also necessary to take into consideration that the dynamic-range is increased in Doppler signal-processing. In addition, mirror effect and/or quantization artifacts are introduced when performing automatic gain compensation. Although the beam data leaving the DBF has a frequency in the order of about 10-100 MHz, it is re-sampled at about 1-100 kHz. So the input dynamic-range of FFT increased by the band-limitation effect. Moreover the S/N ratio of FFT output is increased in a similar manner to that of ensemble mean processing.

Gain adjustment corrects for diagnostic target, bloodflow sensitivity, and the difference of user's skill. In addition to this, gain adjustment compensates for variation in other equip‐ ment parameters, such as the number of summing channel in DBF, apodization function, bandwidth of the BPF, the integration length of the range-gate (RG), FFT number, window function, and the number of the shift addition of power spectrum according to sweep speed etc. The maximum signal level and a noise level change because of change of these equip‐ ment parameters. In order to realize highly sensitive Doppler bloodflow diagnosis without saturation, a system with wide dynamic-range must perform gain compensation according to all these parameters. Table 2 illustrates the rough estimation of the dynamic-range and

The model of the signal-processing accompanied by expansion of the S/N ratio is shown in Fig. 23(a). The noise level and the maximum signal level of the incoming signal are expand‐ ed by signal-processing. But the expansion of the noise level differs from the expansion of the maximum signal level, and the overall S/N ratio is increased. Under optimal gain adjust‐ ment (range shown in light green in Fig. 23(a)) that there is no saturation of the maximum signal level and quantizing noise and signal are not mixed in the output. When gain adjust‐ ment is unsuitable (range shown in light pink in Fig. 23(a)), mirror effect or quantization ar‐ tifacts occur on the spectrum image due to saturation or omission. As for signal amplitude,



in Fig. 22. The DBF has the beam-summing effect of *N1* channels. The Mixer/BPF has the band-limitation effect of *f1*/*f2*. The RG has the integration effect of *N2* taps. The FFT has the integration effect (weighted by the window and operator) of *N3*. The PSD/Pre-Comp. just transfer the dimension (amplitude into power) using the square-root. The MA has the mov‐ ing average effect of *N4* columns according to the sweep speed of spectrum display. Fig. 24 shows the gain charts of the conventional system and new system based on Table 2. The gain chart of the conventional system which does not take the realization scale of hardware into consideration is shown in Fig. 24(a). This time, I developed the system that can reduce the gain-control range and can also reduce an internal dynamic-range. The automatic gain compensation according to the change-range of system parameters is realized for every subblock of Doppler signal-processing accompanied by improved S/N ratio. Since the ranges of system parameters are known, the improvement of S/N ratio, the maximum signal level, and the noise level are calculable. An internal dynamic-range scale and the gain adjustment range can be optimally designed for every sub-block. By connecting the partial optimal subblock in series and uniting it the internal dynamic-range of the system can be reduced so the system size and the total gain control range can both be sharply reduced. The internal S/N ratio is increased by *N* . Then, supposing an input signal dynamic-range is DRin [dB], the range expansion equivalent to 20\*log( *N* ) [dB] will occur. Moreover, the internal dynamicrange DRproc [dB] which added more than the margin (20\*log( 12 ) to quantizing noise is

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

279

*DRproc DRin N* ³ +× +× 20 log 20 log 12 ( ) ( ) (3)

When digitizing, the Least Significant Bit (LSB) must be rounded not truncated otherwise an error of 1/2\*LSB will exist. The RMS value of the quantizing noise is equivalent to 1/2\*LSB/ 3 . So an additional margin of 12 for dynamic-range must be maintained so rounding can be performed accurately. Although the internal dynamic-range DRproc is enough only in the automatic gain compensation with system parameters, it is necessary to consider a mar‐ gin to the original gain adjustment that adjusts for diagnostic target and Doppler sensitivity variation at an internal dynamic-range. The gain chart of the sub-block signal-processing ac‐ companied by range expansion is shown in Fig. 24(b). The gain-control range of the conven‐ tional system (corresponds to Fig. 24(a)) is DR6-DR7, and the gain-control range of the new system (corresponds to Fig. 24(b)) is DR6\_opt-DR7. In the general Doppler signal-processing DR6 is above 200 dB, and DR7 (the digital gain output) is display luminance range (about 70 dB). So the gain-control range of conventional system should be more than 130 dB. This is very large. On the other hand the DR6\_opt of new system is smaller than DR6 about 100 dB. I can reduce not only gain-control range but also the inner dynamic-ranges of sub-modules

The effect of the automatic gain optimization using the new system was checked in a simu‐ lation of RG-integral processing. The spectrum images of the conventional system and the new system when changing RG-width are shown in Fig. 25(a) and Fig. 25(b). The horizontal

roughly calculable using equation (3).

at the same time [21].

(\*1) This estimation is based on virtual model of the Doppler ultrasound system

**Figure 23.** Artifacts caused by inadequate gain control (a) Problems of inadequate gain control (b) Artifacts caused by quantization (c) Artifacts caused by mirror effect.

#### **6.3. Wide dynamic-range design and its optimization**

Table 2 depicts the dynamic-range increment and the gain-control range increment of the conventional system and new system based on the signal-processing block diagram shown in Fig. 22. The DBF has the beam-summing effect of *N1* channels. The Mixer/BPF has the band-limitation effect of *f1*/*f2*. The RG has the integration effect of *N2* taps. The FFT has the integration effect (weighted by the window and operator) of *N3*. The PSD/Pre-Comp. just transfer the dimension (amplitude into power) using the square-root. The MA has the mov‐ ing average effect of *N4* columns according to the sweep speed of spectrum display. Fig. 24 shows the gain charts of the conventional system and new system based on Table 2. The gain chart of the conventional system which does not take the realization scale of hardware into consideration is shown in Fig. 24(a). This time, I developed the system that can reduce the gain-control range and can also reduce an internal dynamic-range. The automatic gain compensation according to the change-range of system parameters is realized for every subblock of Doppler signal-processing accompanied by improved S/N ratio. Since the ranges of system parameters are known, the improvement of S/N ratio, the maximum signal level, and the noise level are calculable. An internal dynamic-range scale and the gain adjustment range can be optimally designed for every sub-block. By connecting the partial optimal subblock in series and uniting it the internal dynamic-range of the system can be reduced so the system size and the total gain control range can both be sharply reduced. The internal S/N ratio is increased by *N* . Then, supposing an input signal dynamic-range is DRin [dB], the range expansion equivalent to 20\*log( *N* ) [dB] will occur. Moreover, the internal dynamicrange DRproc [dB] which added more than the margin (20\*log( 12 ) to quantizing noise is roughly calculable using equation (3).

**Module Cause**

278 Design and Architectures for Digital Signal Processing

Digital Gain Input

Digital Gain Output

Gain Control

quantization (c) Artifacts caused by mirror effect.

**6.3. Wide dynamic-range design and its optimization**

**Digital Gain**

Effect of D.R. Increment (\*1)

> D.R. before Input

D.R. after Output

(\*1) This estimation is based on virtual model of the Doppler ultrasound system

**Table 2.** Comparison of inner dynamic-range and gain control range

Conventional System Fig. 24(a) (dB)

> 70dB (DR7)

Range - 160dB(DR6-DR7) 70dB(DR6\_opt - DR7)

**Figure 23.** Artifacts caused by inadequate gain control (a) Problems of inadequate gain control (b) Artifacts caused by

Table 2 depicts the dynamic-range increment and the gain-control range increment of the conventional system and new system based on the signal-processing block diagram shown

New System Fig. 24(b) (dB)

> 70dB (DR7)

230dB (DR6) 140dB (DR6\_opt)

$$DRproc \ge DRin + 20 \cdot \log\left(\sqrt{N}\right) + 20 \cdot \log\left(\sqrt{12}\right) \tag{3}$$

When digitizing, the Least Significant Bit (LSB) must be rounded not truncated otherwise an error of 1/2\*LSB will exist. The RMS value of the quantizing noise is equivalent to 1/2\*LSB/ 3 . So an additional margin of 12 for dynamic-range must be maintained so rounding can be performed accurately. Although the internal dynamic-range DRproc is enough only in the automatic gain compensation with system parameters, it is necessary to consider a mar‐ gin to the original gain adjustment that adjusts for diagnostic target and Doppler sensitivity variation at an internal dynamic-range. The gain chart of the sub-block signal-processing ac‐ companied by range expansion is shown in Fig. 24(b). The gain-control range of the conven‐ tional system (corresponds to Fig. 24(a)) is DR6-DR7, and the gain-control range of the new system (corresponds to Fig. 24(b)) is DR6\_opt-DR7. In the general Doppler signal-processing DR6 is above 200 dB, and DR7 (the digital gain output) is display luminance range (about 70 dB). So the gain-control range of conventional system should be more than 130 dB. This is very large. On the other hand the DR6\_opt of new system is smaller than DR6 about 100 dB. I can reduce not only gain-control range but also the inner dynamic-ranges of sub-modules at the same time [21].

The effect of the automatic gain optimization using the new system was checked in a simu‐ lation of RG-integral processing. The spectrum images of the conventional system and the new system when changing RG-width are shown in Fig. 25(a) and Fig. 25(b). The horizontal axis is time and the vertical axis is the frequency normalized by a sampling frequency fs. The range-gate was adjusted from 1mm to 4 mm to 16 mm in 1.8 s intervals. A sinusoidal signal including white noise was used as an input. In the conventional system of Fig. 25(a), the signal level and noise level increase with expanding RG-width. For this reason, the user should reducing gain manually when the RG-width is expanded. In the new system in Fig. 25(b), although the signal level will rise if RG-width is expanded, it turns out that a noise level does not change. As mentioned above, by Doppler automatic gain compensation, the input bit length of each signal-processing block could be made smaller, and also the gain ad‐ justment range could be made small to necessary minimum.

**7. Conclusion**

**Author details**

Address all correspondence to: EZD03014@nifty.ne.jp

Toshiba Medical Systems Corporation, Japan

Baba Tatsuro

**References**

The technical innovations which digital signal-processing brought about and their results were introduced based on some examples of the Doppler ultrasound system architectures. Not only extensive improvement of cost, size, power consumption, and adjustment, but also the improvement of sensitivity and accuracy has been realized by digital technology. Al‐ though DSP is most suitable for real-time system at present, the system architecture will be mounted as software when the calculation power of CPU/GPU improves further. In the fu‐ ture we will be able to acquire huge calculation ability easily, and it will be possible to apply it to real-time automatic diagnostic technology etc. besides conventional signal-processing.

Progress of Doppler Ultrasound System Design and Architecture

http://dx.doi.org/10.5772/51508

281

[1] Baba, T. (2009). Progress of Doppler ultrasound system architecture and considera‐ tions: The problems caused by digital system and their solutions. *Society of Signal*

[2] Baba, T. (2005). Investigation on direction split technique of Doppler ultrasound: Comparison of six kinds of Doppler audio processing. *Society of Signal Processing Ap‐*

[3] Baba, T. (2007). Direction separation in Doppler audio of ultrasound diagnosis equip‐ ment: Signal processing for Doppler audio dealiasing. *Acoustical Science and Technolo‐*

[4] Baba, T. (2005). Research on Doppler ultrasound automatic heartbeat cycle detection: The investigation of heartbeat cycle detection from Doppler waveform using adap‐

[5] Baba, T. (2006). Velocity range tracking in Doppler diagnostic ultrasound systems: Range optimization using Doppler trace wave form histograms. *The Journal of the*

[6] Baba, T., Ohmae, N., & Osuka, K. (2008). The Optimization of Ultrasound System Doppler Velocity Range using Hybrid Control. *Transactions of the Society of Instrument*

tive BPF. *The Journal of the Acoustical Society of Japan*, 61(11), 629-635.

*Processing Applications and Technology of Japan*, 12(1), 2-8.

*plications and Technology of Japan*, 8(2), 14-20.

*gy*, 28(3), 202-210, DOI: ast.28.202.

*Acoustical Society of Japan*, 62(4), 327-331.

*and Control Engineers*, 44(9), 760-765.

**Figure 24.** Comparison of gain control systems (a) Gain chart of conventional system (b) Gain chart of new system.

**Figure 25.** Effect of automatic gain compensation: example of RG integration process. (a) Conventional system (b) New system

#### **7. Conclusion**

axis is time and the vertical axis is the frequency normalized by a sampling frequency fs. The range-gate was adjusted from 1mm to 4 mm to 16 mm in 1.8 s intervals. A sinusoidal signal including white noise was used as an input. In the conventional system of Fig. 25(a), the signal level and noise level increase with expanding RG-width. For this reason, the user should reducing gain manually when the RG-width is expanded. In the new system in Fig. 25(b), although the signal level will rise if RG-width is expanded, it turns out that a noise level does not change. As mentioned above, by Doppler automatic gain compensation, the input bit length of each signal-processing block could be made smaller, and also the gain ad‐

**Figure 24.** Comparison of gain control systems (a) Gain chart of conventional system (b) Gain chart of new system.

**Figure 25.** Effect of automatic gain compensation: example of RG integration process. (a) Conventional system (b)

New system

justment range could be made small to necessary minimum.

280 Design and Architectures for Digital Signal Processing

The technical innovations which digital signal-processing brought about and their results were introduced based on some examples of the Doppler ultrasound system architectures. Not only extensive improvement of cost, size, power consumption, and adjustment, but also the improvement of sensitivity and accuracy has been realized by digital technology. Al‐ though DSP is most suitable for real-time system at present, the system architecture will be mounted as software when the calculation power of CPU/GPU improves further. In the fu‐ ture we will be able to acquire huge calculation ability easily, and it will be possible to apply it to real-time automatic diagnostic technology etc. besides conventional signal-processing.

#### **Author details**

Baba Tatsuro

Address all correspondence to: EZD03014@nifty.ne.jp

Toshiba Medical Systems Corporation, Japan

#### **References**


[7] Ohashi, T. (2001). Recording of World Heritage on the High Definition Audio-visual Media: Documentation of History and Tradition. *ITE Transactions on Media Technolo‐ gy and Applications*, 55(1), 37-46.

**Chapter 12**

**Provisional chapter**

**FPGA Based Serial and Single-Clock Cycle Pipelined Fast**

**Fast Fourier Transforms in a Radio Detection of**

**FPGA Based Serial and Single-Clock Cycle Pipelined**

**Fourier Transforms in a Radio Detection of Cosmic Rays**

Results from various cosmic rays experiments located on the ground level, point to the need for very large aperture detection systems for ultra-high energy cosmic rays. With its nearly 100% duty cycle, its high angular resolution, and its sensitivity to the longitudinal air-shower evolution, the radio technique is particularly well-suited for detection of Ultra High-Energy Cosmic Rays (UHECRs) in large-scale arrays. The present challenges are to understand the emission mechanisms and the features of the radio signal, and to develop an adequate measuring instrument. Electron-positron pairs generated in the shower development are separated and deflected by the Earth's magnetic field [1], [2], hence they introduce an electromagnetic emission. During shower development, charged particles are concentrated in a shower disk of a few meters thickness. This results in a coherent radio emission up to about 100 MHz. Short but coherent radio pulses of 10 ns up to a few 100 ns duration are generated with an electric field strength increasing approximately linearly with the energy of the primary cosmic particle inducing the extended air showers (EAS), i.e. a quadratic dependence of the radio pulse energy vs. primary particle energy. In contrast to the fluorescence technique (e.g. used in the Pierre Auger Observatory [3]) with a duty cycle of about 12% (fluorescence detectors can work only during moonless nights), the radio technique allows nearly full-time measurements and long range observations due to the high

transparency of the air to radio signals in the investigated frequency range.

The radio detection technique will be complementary to the water Cherenkov detectors and allows a more precise study of the electromagnetic part of air showers in the atmosphere. In addition to a strong physics motivation, many technical aspects relating to the efficiency, saturation effects and dynamic range, the precision for timing, the stability of the hardware

> ©2012 Szadkowski, licensee InTech. This is an open access chapter distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. © 2013 Szadkowski; licensee InTech. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

© 2013 Szadkowski; licensee InTech. This is a paper distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

distribution, and reproduction in any medium, provided the original work is properly cited.

Zbigniew Szadkowski

Zbigniew Szadkowski

10.5772/52946

**1. Introduction**

**Cosmic Rays**

http://dx.doi.org/10.5772/52946

Additional information is available at the end of the chapter

Additional information is available at the end of the chapter


**Chapter 12**

**Provisional chapter**

#### **FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays Fast Fourier Transforms in a Radio Detection of Cosmic Rays**

**FPGA Based Serial and Single-Clock Cycle Pipelined**

Zbigniew Szadkowski Zbigniew Szadkowski

[7] Ohashi, T. (2001). Recording of World Heritage on the High Definition Audio-visual Media: Documentation of History and Tradition. *ITE Transactions on Media Technolo‐*

[8] Yoshikawa, S. (2002). Present status of high definition audio. *The Journal of the Acous‐*

[9] Misawa, T. (2004). The image sensor for digital cameras. *Japanese Journal of Optics*,

[10] The Physical Society of Japan. (1978). Physics experiment data processing by a com‐

[11] Miyagawa, H. (1981). Digital Signal Processing The 9th edition. Tokyo: CORONA

[13] Baba, T., & Toshiba, Corp. (2008). Ultrasonic diagnostic equipment. Japanese Patent

[14] Baba, T., & Toshiba, Corp. (2006). Ultrasonic diagnostic equipment. *Japanese Patent*

[15] Jensen, JA. (1996). Estimation of Blood Velocities Using Ultrasound: A Signal Proc‐

[16] Baba, T. (2008). Evaluation of Post Wall Filter for Doppler Ultrasound Systems.

[17] Baba, T. (2006). Investigation of wall filters in Doppler ultrasound system. *Society of*

[18] Kozak, M., & Karaman, M. (2001). Digital Phased Array Beamforming Using Single-Bit Delta-Sigma Conversion with Non-Uniform Oversampling. *IEEE Transactions on*

[19] Engelberg, S. (2006). Implementing a ⊿∑ DAC in Fixed Point Arithmetic. *IEEE Signal*

[20] Baba, T., & Toshiba, Corp. (2005). Ultrasonic diagnostic equipment. *Japanese Patent*

[21] Baba, T. (2009). Investigation of gain optimization technique in Doppler ultrasound

system. *Acoustical Science and Technology*, 30(2), 61-71, DOI: ast.30.67.

[12] Nakajima, T. (1996). Compact disk reader, the 3rd edition. Tokyo: Ohmsha Ltd.

*gy and Applications*, 55(1), 37-46.

282 Design and Architectures for Digital Signal Processing

*tical Society of Japan*, 58(4), 250-255.

puter, the 4th edition. Tokyo: SAIENSU-SHA Co., Ltd.

essing Approach:. *Cambridge University Press*, 0-521-46484.

*Signal Processing Applications and Technology of Japan*, 9(2), 14-19.

*Processing Magazine:*, DOI:10.1109/SP-M.2006.248716, 66-69.

*Acoustical Imaging*, 29-133, 978-1-40208-822-3.

*UFFC*, 48(4), 922-931, DOI: 10.1109/58.935709.

33(9), 544-549.

PUBLISHING Co., LTD.

4068208; Jan. 18.

*3746113*, Feb. 15.

*3663206*.

Additional information is available at the end of the chapter Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/52946 10.5772/52946

#### **1. Introduction**

Results from various cosmic rays experiments located on the ground level, point to the need for very large aperture detection systems for ultra-high energy cosmic rays. With its nearly 100% duty cycle, its high angular resolution, and its sensitivity to the longitudinal air-shower evolution, the radio technique is particularly well-suited for detection of Ultra High-Energy Cosmic Rays (UHECRs) in large-scale arrays. The present challenges are to understand the emission mechanisms and the features of the radio signal, and to develop an adequate measuring instrument. Electron-positron pairs generated in the shower development are separated and deflected by the Earth's magnetic field [1], [2], hence they introduce an electromagnetic emission. During shower development, charged particles are concentrated in a shower disk of a few meters thickness. This results in a coherent radio emission up to about 100 MHz. Short but coherent radio pulses of 10 ns up to a few 100 ns duration are generated with an electric field strength increasing approximately linearly with the energy of the primary cosmic particle inducing the extended air showers (EAS), i.e. a quadratic dependence of the radio pulse energy vs. primary particle energy. In contrast to the fluorescence technique (e.g. used in the Pierre Auger Observatory [3]) with a duty cycle of about 12% (fluorescence detectors can work only during moonless nights), the radio technique allows nearly full-time measurements and long range observations due to the high transparency of the air to radio signals in the investigated frequency range.

The radio detection technique will be complementary to the water Cherenkov detectors and allows a more precise study of the electromagnetic part of air showers in the atmosphere. In addition to a strong physics motivation, many technical aspects relating to the efficiency, saturation effects and dynamic range, the precision for timing, the stability of the hardware

©2012 Szadkowski, licensee InTech. This is an open access chapter distributed under the terms of the Creative

developed, deployed and used, as well as the data collecting and system-health monitoring processes will be studied and optimized.

10.5772/52946

285

http://dx.doi.org/10.5772/52946

*<sup>N</sup> k* = 0, 1, ...*N* − 1 (2)

The following equation shows the length-N inverse DFT:

*<sup>x</sup>*˜*<sup>k</sup>* <sup>=</sup> <sup>1</sup> *N* *N*−1 ∑ *n*=0

*X*¯ *ne* 2*iπkn*

into N/r sequences of length r and requires *logrN* stages of computation.

requires special algorithms optimized for a particular solution.

<sup>−</sup>2*iπkn*/*<sup>N</sup>* =

*X*¯ <sup>2</sup>*<sup>k</sup>* =

*N*−1 ∑ *n*=0

*N*−1 ∑ *n*=0

time samples. For the Radix-2 DiT, we get :

*N*−1 ∑ *n*=0 *xne*

= *DFT <sup>N</sup>* 2

*X*¯ <sup>2</sup>*k*+<sup>1</sup> =

*X*¯ *<sup>k</sup>* =

For the Radix-2 DiF, we get :

**2.1. Radix-2 : Decimation-in-Time and Decimation-in-Frequency**

*N* <sup>2</sup> −1 ∑ *n*=0

[*x*0, *x*2, ..., *xN*−2] + *W<sup>k</sup>*

*xnW*2*kn*

*xnW*(2*k*+1)*<sup>n</sup>*

*<sup>N</sup>* = *DFT <sup>N</sup>*

*<sup>N</sup>* = *DFT <sup>N</sup>*

The complexity of the DFT direct computation can be significantly reduced by using fast algorithms that use a nested decomposition of the summation in equations one and two-in addition to exploiting various symmetries inherent in the complex multiplications. Such algorithms are the Radix-r Decimation-in-Time (DiT) or Radix-r Decimation-in-Frequency (DiF) Fast Fourier Transforms (FFT), which recursively divides the input/output sequence

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

The commercially offered FFT processors for FPGA applications require several clock cycles to accomplish calculation of all complex DFT coefficients. Each stage of the decomposition typically shares the same hardware, with the data being read from memory, passed through the FFT processor and written back to memory. Each pass through the FFT processor is required to be performed *logrN* times. Popular choices of the Radix are r = 2, 4, and 16. Increasing the Radix of the decomposition leads to a reduction in the number of passes required through the FFT processor at the expense of device resources. Such an approach is very widely useful for many applications, where timing is not crucial. However, there are areas, where the FFT coefficients (based on a new set of samples) have to be known in each clock cycle. Commercial FFT processors, unfortunately, cannot be used. This approach

The Radix-2 algorithm is the simplest FFT one. The decimation-in-time (DIT) Radix-2 FFT recursively partitions a DFT into two half-length DFTs of the even-indexed and odd-indexed

*x*2*ne*

−*i* <sup>2</sup>*πkn <sup>N</sup>*/2 + *e*

*<sup>N</sup>* × *DFT <sup>N</sup>*

2 

2

−*i* <sup>2</sup>*π<sup>k</sup> N N* <sup>2</sup> −1 ∑ *n*=0

2

*xn* + *xn*<sup>+</sup> *<sup>N</sup>*

*xn* <sup>−</sup> *xn*<sup>+</sup> *<sup>N</sup>*

2 

> 2 *W<sup>n</sup> N*

*x*2*n*+1*e*

[*x*1, *x*3, ..., *xN*−1]

−*i* <sup>2</sup>*πkn*

*<sup>N</sup>*/2 = (3)

(4)

EAS are investigated in several experiments utilizing different detection techniques (scintilators, water Cherenkov and fluorescence detectors). Signals in the detectors depend on several parameters such as the energy, the type of the primary particle, a distance from the core, the angle of registered shower, etc. Usually the triggering conditions are chosen such as to detect as wide as possible classes of events. However, sometimes the standard trigger conditions are not optimized for the specific class of events, which are either not registered at all or for which the registration efficiency is poor. In experiments utilizing water Cherenkov detectors, signals from photo-multipliers (PMTs) are usually digitized in ADCs and next processed by often-sophisticated electronics. In order to increase the signal/noise ratio coincidence techniques are widely used. Typically signals from PMTs are analyzed on-line in both amplitude and time domains. Strong signals in all PMT channels, corresponding to energetic showers detected near the core, are registered because of many-fold coincidence single bin trigger with a fixed thresholds. Showers detected far from the core give much lower signals usually spread in time. Such events are detected by the other type of trigger investigating the structure of signal in some period (in a sliding time window).

The structure of signals detected in water Cherenkov tanks and generated by horizontal showers depend strongly on the point of the EAS initialization. "Old" showers generated by hadrons early in the atmosphere give flat muonic front; showers generated by deeply interacting neutrinos are characterized by a curved front (radius of curvature of a few km), a large electromagnetic component and with particles spread over a few microseconds interval [4]. In both cases muonic front produces a bump, which can be a starting signature of horizontal showers. The bump for the "old" showers is shorter and sharper than for the "young" ones and results in a larger contribution in higher Fourier coefficients. For "young" showers, with relatively smooth shape of a signal profile, the lower Fourier components should dominate. The on-line analysis of the Fourier components may trigger specific events.

The existing software procedures, available as commercial IP routines, can calculate Fourier coefficients effectively utilizing a FFT algorithm. However the software implementation is too slow to be able to trigger events in the real time. On-line triggering requires the hardware implementation calculating multi-point DFT with a sufficient speed. Modern powerful FPGAs can do this job, however, the resource requirement increases dramatically with the number of points. The analysis time interval should be a reasonable compromise between the time resolution and the resources occupancy in the FPGA.

#### **2. DFT**

The discrete Fourier transform (DFT), of length N, calculates the sampled Fourier transform of a discrete-time sequence at N evenly distributed points *<sup>ω</sup><sup>k</sup>* = <sup>2</sup>*π<sup>k</sup> <sup>N</sup>* the unit circle. The following equation shows the length-N forward DFT of a sequence x(n):

$$X\_k = \sum\_{n=0}^{N-1} x\_n e^{\frac{-2i\pi kn}{N}} \qquad \qquad k = 0, 1, \ldots \\ N - 1 \tag{1}$$

The following equation shows the length-N inverse DFT:

2 Design and Architectures for Digital Signal Processing

processes will be studied and optimized.

developed, deployed and used, as well as the data collecting and system-health monitoring

EAS are investigated in several experiments utilizing different detection techniques (scintilators, water Cherenkov and fluorescence detectors). Signals in the detectors depend on several parameters such as the energy, the type of the primary particle, a distance from the core, the angle of registered shower, etc. Usually the triggering conditions are chosen such as to detect as wide as possible classes of events. However, sometimes the standard trigger conditions are not optimized for the specific class of events, which are either not registered at all or for which the registration efficiency is poor. In experiments utilizing water Cherenkov detectors, signals from photo-multipliers (PMTs) are usually digitized in ADCs and next processed by often-sophisticated electronics. In order to increase the signal/noise ratio coincidence techniques are widely used. Typically signals from PMTs are analyzed on-line in both amplitude and time domains. Strong signals in all PMT channels, corresponding to energetic showers detected near the core, are registered because of many-fold coincidence single bin trigger with a fixed thresholds. Showers detected far from the core give much lower signals usually spread in time. Such events are detected by the other type of trigger

investigating the structure of signal in some period (in a sliding time window).

the time resolution and the resources occupancy in the FPGA.

*X*¯ *<sup>k</sup>* =

of a discrete-time sequence at N evenly distributed points *<sup>ω</sup><sup>k</sup>* = <sup>2</sup>*π<sup>k</sup>*

following equation shows the length-N forward DFT of a sequence x(n):

*N*−1 ∑ *n*=0 *xne* −2*iπkn*

**2. DFT**

The structure of signals detected in water Cherenkov tanks and generated by horizontal showers depend strongly on the point of the EAS initialization. "Old" showers generated by hadrons early in the atmosphere give flat muonic front; showers generated by deeply interacting neutrinos are characterized by a curved front (radius of curvature of a few km), a large electromagnetic component and with particles spread over a few microseconds interval [4]. In both cases muonic front produces a bump, which can be a starting signature of horizontal showers. The bump for the "old" showers is shorter and sharper than for the "young" ones and results in a larger contribution in higher Fourier coefficients. For "young" showers, with relatively smooth shape of a signal profile, the lower Fourier components should dominate. The on-line analysis of the Fourier components may trigger specific events. The existing software procedures, available as commercial IP routines, can calculate Fourier coefficients effectively utilizing a FFT algorithm. However the software implementation is too slow to be able to trigger events in the real time. On-line triggering requires the hardware implementation calculating multi-point DFT with a sufficient speed. Modern powerful FPGAs can do this job, however, the resource requirement increases dramatically with the number of points. The analysis time interval should be a reasonable compromise between

The discrete Fourier transform (DFT), of length N, calculates the sampled Fourier transform

*<sup>N</sup>* the unit circle. The

*<sup>N</sup> k* = 0, 1, ...*N* − 1 (1)

$$\mathfrak{X}\_k = \frac{1}{N} \sum\_{n=0}^{N-1} X\_n e^{\frac{2\pi k n}{N}} \qquad \qquad k = 0, 1, \ldots N - 1 \tag{2}$$

The complexity of the DFT direct computation can be significantly reduced by using fast algorithms that use a nested decomposition of the summation in equations one and two-in addition to exploiting various symmetries inherent in the complex multiplications. Such algorithms are the Radix-r Decimation-in-Time (DiT) or Radix-r Decimation-in-Frequency (DiF) Fast Fourier Transforms (FFT), which recursively divides the input/output sequence into N/r sequences of length r and requires *logrN* stages of computation.

The commercially offered FFT processors for FPGA applications require several clock cycles to accomplish calculation of all complex DFT coefficients. Each stage of the decomposition typically shares the same hardware, with the data being read from memory, passed through the FFT processor and written back to memory. Each pass through the FFT processor is required to be performed *logrN* times. Popular choices of the Radix are r = 2, 4, and 16. Increasing the Radix of the decomposition leads to a reduction in the number of passes required through the FFT processor at the expense of device resources. Such an approach is very widely useful for many applications, where timing is not crucial. However, there are areas, where the FFT coefficients (based on a new set of samples) have to be known in each clock cycle. Commercial FFT processors, unfortunately, cannot be used. This approach requires special algorithms optimized for a particular solution.

#### **2.1. Radix-2 : Decimation-in-Time and Decimation-in-Frequency**

The Radix-2 algorithm is the simplest FFT one. The decimation-in-time (DIT) Radix-2 FFT recursively partitions a DFT into two half-length DFTs of the even-indexed and odd-indexed time samples. For the Radix-2 DiT, we get :

$$X\_k = \sum\_{n=0}^{N-1} x\_n e^{-2i\pi kn/N} = \sum\_{n=0}^{\frac{N}{2}-1} x\_{2n} e^{-i\frac{2\pi kn}{N/2}} + e^{-i\frac{2\pi k}{N}} \sum\_{n=0}^{\frac{N}{2}-1} x\_{2n+1} e^{-i\frac{2\pi kn}{N/2}} = \tag{3}$$

$$=DFT\_{\frac{N}{2}}\left[\mathbf{x}\_{0\prime}\mathbf{x}\_{2\prime}\dots\mathbf{x}\_{N-2}\right] + \mathcal{W}\_{N}^{k} \times DFT\_{\frac{N}{2}}\left[\mathbf{x}\_{1\prime}\mathbf{x}\_{3\prime}\dots\mathbf{x}\_{N-1}\right]$$

For the Radix-2 DiF, we get :

$$\bar{X}\_{2k} = \sum\_{n=0}^{N-1} x\_n \mathcal{W}\_N^{2kn} = DFT\_{\frac{N}{2}} \left[ x\_n + x\_{n + \frac{N}{2}} \right] \tag{4}$$

$$X\_{2k+1} = \sum\_{n=0}^{N-1} \mathbf{x}\_n \mathbf{W}\_N^{(2k+1)n} = DFT\_{\frac{N}{2}} \left[ \left( \mathbf{x}\_n - \mathbf{x}\_{n + \frac{N}{2}} \right) \mathbf{W}\_N^n \right],$$

(a) The 8-point Radix-2 Decimation-in-Time algorithm (left). For real samples *xk* the Fourier coefficients *Gk* and *Hk* for N/2-point DFT are complex. Calculations of final N-point Fourier coefficients require complex multiplications by factors *W<sup>k</sup> <sup>N</sup>* for *<sup>k</sup>* > 0.

10.5772/52946

287

http://dx.doi.org/10.5772/52946

**2.3. Architectures of the Altera®'s FFT MegaCore® functions.**

The Radix-4 decomposition, which divides the input sequence recursively to form four-point sequences, has the advantage that it requires only trivial multiplications in the 4-point DFT, it is the chosen Radix algorithm in the Altera® FFT MegaCore® function. This results in the highest throughput decomposition, while requiring non-trivial complex multiplications in the post-butterfly twiddle-factor rotations only. In cases where N is an odd power of two, the FFT MegaCore automatically implements the Radix-2 pass on the last pass to complete

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

To maintain a high signal-to-noise ratio throughout the transform computation, the FFT MegaCore function uses a block-floating-point architecture, which is a compromise point between fixed-point and full-floating point architectures. In the fixed-point architecture, the data precision needs to be large enough to correctly represent all intermediate values throughout the transform computation. For large FFT transform sizes, the FFT fixed-point implementation that allows for word growth can make either the data width excessive or can

In the floating-point architecture each number is represented as a mantissa with an individual exponent, while this leads to greatly improved precision, floating-point operations tend to

In the block-floating point architecture, all of the values have an independent mantissa but share a common exponent in each data block. Data is input to the FFT function as fixed point

**Figure 2.** A simulation of the Fourier transform for the Altera® library routine of 1024 points and for streaming architecture. Each block of 1024 Fourier coefficients (Fc) is scaled by the factor FFT.exp. Fourier coefficients are provided in a serial way, each pair of real and imaginary parts of a single Fc in a single time bin. All Fc are calculated in 1024 time bins. FFT.sop (start of

The block-floating point architecture ensures full use of the data width within the FFT function and throughout the transform. After every pass through the Radix-4 FFT, the data

block dynamic range on the output of the previous pass. The number of shifts is accumulated and then output as an exponent for the entire block. This shifting ensures that the minimum of least significant bits (LSBs) are discarded prior to the rounding of the post-multiplication output. In effect, the block-floating point representation acts as a digital automatic gain control. To yield uniform scaling across successive output blocks, you must scale the FFT

<sup>√</sup>2) = 2.5 bits. The data is scaled according to a measure of the

package) and FFT.eop (end of package) indicate begin and end of each 1024-point block.

*2.3.1. Streaming architecture*

lead to a loss of precision.

complex numbers (Figure 2).

width may grow up to *log*2(4

function output by the final exponent [5].

demand increased device resources.

the transform.

(b) The 8-point Radix-2 Decimation-in-Frequency algorithm. For real samples *xk* supporting variables g(k) and h(k) require only real additions and subtractions.

**Figure 1.** Splitting of N-point DFT on two N/2-point parallel procedures for Decimation-in-Time (left) and Decimation-in-Frequency (right), respectively, on the basis of the 8-point Radix-2 algorithms.

The N-point DFT can be easily split on two N/2-point transforms. The outputs from DFT procedures are complex. So, a calculation of final DFT coefficients by using DiT algorithm requires complex multiplication for final merging data from parallel DFT procedures with lower order (i.e. multiplication of twiddle factors *W<sup>k</sup> <sup>N</sup>*) :

$$\mathcal{W}\_N^k = e^{-i\frac{2\pi k}{N}} \tag{5}$$

by *G*[*k*] and *H*[*k*] in Figure 1. For the DiF algorithm the 1*st* stage requires additions and subtractions only. Odd indexes require additional multiplications, however, even indexes remain without modifications for the next N/2-point DFT procedure (compare Figures 1a and 1b).

#### **2.2. Radix-4 algorithm**

The Radix-4 algorithm consists of four inputs and four outputs. The FFT length is 4*p*, where p is the number of stages. A stage is half of Radix-2. The Radix-4 DIF FFT divides an N-point DFT into four N/4 -point DFTs, then into 16 N/16 -point DFTs, and so on.

For Radix-4 DiF, we get :

$$\mathbf{X}\_{k} = \sum\_{n=0}^{N-1} \mathbf{x}\_{n} e^{\frac{-2\pi kn}{N}} = \sum\_{n=0}^{N/4-1} \mathbf{x}\_{n} e^{\frac{-2\pi kn}{N}} + \sum\_{n=N/4}^{N/2-1} \mathbf{x}\_{n} e^{\frac{-2\pi kn}{N}} + \sum\_{n=N/2}^{3N/4-1} \mathbf{x}\_{n} e^{\frac{-2\pi kn}{N}} + \sum\_{n=3N/4}^{N-1} \mathbf{x}\_{n} e^{\frac{-2\pi kn}{N}} = \sum\_{n=0}^{N/4-1} e^{-\frac{2\pi n}{N}} \left[ \mathbf{x}\_{n} + (-i)^{k} \mathbf{x}\_{n+N/4} + (-1)^{k} \mathbf{x}\_{n+N/2} + (i)^{k} \mathbf{x}\_{n+3N/4} \right] \tag{6}$$

$$= \sum\_{n=0}^{N/4-1} e^{-\frac{2\pi kn}{N}} \left[ \mathbf{x}\_{n} + (-i)^{k} \mathbf{x}\_{n+N/4} + (-1)^{k} \mathbf{x}\_{n+N/2} + (i)^{k} \mathbf{x}\_{n+3N/4} \right] \tag{7}$$

This algorithm is widely used, however, as it is shown in a next section, the simple application of the DiT or DiF algorithms in all sequential steps remains still an area for further optimization.

### **2.3. Architectures of the Altera®'s FFT MegaCore® functions.**

#### *2.3.1. Streaming architecture*

4 Design and Architectures for Digital Signal Processing

(a) The 8-point Radix-2 Decimation-in-Time algorithm (left). For real samples *xk* the Fourier coefficients *Gk* and *Hk* for N/2-point DFT are complex. Calculations of final N-point Fourier coefficients require complex

*<sup>N</sup>* for *<sup>k</sup>* > 0.

lower order (i.e. multiplication of twiddle factors *W<sup>k</sup>*

*N*/4−1 ∑ *n*=0

*xne* −2*iπkn <sup>N</sup>* +

Decimation-in-Frequency (right), respectively, on the basis of the 8-point Radix-2 algorithms.

(b) The 8-point Radix-2 Decimation-in-Frequency algorithm. For real samples *xk* supporting variables g(k) and h(k) require only real additions and

*<sup>N</sup>* (5)

subtractions.

*<sup>N</sup>*) :

−*i* <sup>2</sup>*π<sup>k</sup>*

**Figure 1.** Splitting of N-point DFT on two N/2-point parallel procedures for Decimation-in-Time (left) and

The N-point DFT can be easily split on two N/2-point transforms. The outputs from DFT procedures are complex. So, a calculation of final DFT coefficients by using DiT algorithm requires complex multiplication for final merging data from parallel DFT procedures with

by *G*[*k*] and *H*[*k*] in Figure 1. For the DiF algorithm the 1*st* stage requires additions and subtractions only. Odd indexes require additional multiplications, however, even indexes remain without modifications for the next N/2-point DFT procedure (compare Figures 1a

The Radix-4 algorithm consists of four inputs and four outputs. The FFT length is 4*p*, where p is the number of stages. A stage is half of Radix-2. The Radix-4 DIF FFT divides an N-point

> *xne* −2*iπkn <sup>N</sup>* +

This algorithm is widely used, however, as it is shown in a next section, the simple application of the DiT or DiF algorithms in all sequential steps remains still an area for

3*N*/4−1 ∑ *n*=*N*/2

*xn* + (−*i*)*kxn*<sup>+</sup>*N*/4 + (−1)*kxn*<sup>+</sup>*N*/2 + (*i*)*kxn*<sup>+</sup>3*N*/4

*xne* −2*iπkn <sup>N</sup>* +

*N*−1 ∑*n*=3*N*/4

*xne* −2*iπkn <sup>N</sup>* =

(6)

DFT into four N/4 -point DFTs, then into 16 N/16 -point DFTs, and so on.

*N*/2−1 ∑ *n*=*N*/4

*W<sup>k</sup> <sup>N</sup>* = *e*

multiplications by factors *W<sup>k</sup>*

and 1b).

*X*¯ *<sup>k</sup>* =

*N*−1 ∑ *n*=0 *xne* −2*iπkn <sup>N</sup>* =

**2.2. Radix-4 algorithm**

For Radix-4 DiF, we get :

=

further optimization.

*N*/4−1 ∑ *n*=0 *e* − <sup>2</sup>*iπkn N*  The Radix-4 decomposition, which divides the input sequence recursively to form four-point sequences, has the advantage that it requires only trivial multiplications in the 4-point DFT, it is the chosen Radix algorithm in the Altera® FFT MegaCore® function. This results in the highest throughput decomposition, while requiring non-trivial complex multiplications in the post-butterfly twiddle-factor rotations only. In cases where N is an odd power of two, the FFT MegaCore automatically implements the Radix-2 pass on the last pass to complete the transform.

To maintain a high signal-to-noise ratio throughout the transform computation, the FFT MegaCore function uses a block-floating-point architecture, which is a compromise point between fixed-point and full-floating point architectures. In the fixed-point architecture, the data precision needs to be large enough to correctly represent all intermediate values throughout the transform computation. For large FFT transform sizes, the FFT fixed-point implementation that allows for word growth can make either the data width excessive or can lead to a loss of precision.

In the floating-point architecture each number is represented as a mantissa with an individual exponent, while this leads to greatly improved precision, floating-point operations tend to demand increased device resources.

In the block-floating point architecture, all of the values have an independent mantissa but share a common exponent in each data block. Data is input to the FFT function as fixed point complex numbers (Figure 2).

**Figure 2.** A simulation of the Fourier transform for the Altera® library routine of 1024 points and for streaming architecture. Each block of 1024 Fourier coefficients (Fc) is scaled by the factor FFT.exp. Fourier coefficients are provided in a serial way, each pair of real and imaginary parts of a single Fc in a single time bin. All Fc are calculated in 1024 time bins. FFT.sop (start of package) and FFT.eop (end of package) indicate begin and end of each 1024-point block.

The block-floating point architecture ensures full use of the data width within the FFT function and throughout the transform. After every pass through the Radix-4 FFT, the data width may grow up to *log*2(4 <sup>√</sup>2) = 2.5 bits. The data is scaled according to a measure of the block dynamic range on the output of the previous pass. The number of shifts is accumulated and then output as an exponent for the entire block. This shifting ensures that the minimum of least significant bits (LSBs) are discarded prior to the rounding of the post-multiplication output. In effect, the block-floating point representation acts as a digital automatic gain control. To yield uniform scaling across successive output blocks, you must scale the FFT function output by the final exponent [5].

#### *2.3.2. Variable streaming architecture*

The variable streaming architecture uses two different types of architecture, depending on which representation: the fixed-point or the floating-point is selected. For the fixed-point data representation, the FFT variation uses a Radix-2<sup>2</sup> single delay feedback architecture, which is a fully pipelined architecture. For the floating point representation, the FFT variation uses a mixed Radix-4/2 architecture. For a length N transform, log4(N) stages are concatenated together. The Radix-22 algorithm has the same multiplicative complexity of the fully pipelined Radix-4 architecture, but the butterfly unit retains the Radix-2 architecture. In the Radix-4/2 algorithm, a combination of Radix-4 and Radix-2 architectures are implemented to achieve the computational advantage of the Radix-4 architecture while supporting FFT computation with a wider range of transform lengths. The butterfly units use the DIF decomposition.

10.5772/52946

289

http://dx.doi.org/10.5772/52946

the time domain. This chain of the digital signal processing strongly enhances the signal to

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

Due to the Nyquist theorem, the used 80 MHz band should be sampled with at least 160 MHz. An application of 16-bit ADCs with such a sampling rate would be a challenge for the price, the power consumption and PCB routing to keep a reasonable noise level. The used practical option is an 12-bit ADC with 180 MSPS, leaving sufficient space for the anti-aliasing filter and implementing a high and low gain channel to obtain the required dynamic range.

**Figure 3.** A diagram showing a (FFT + Median filter + iFFT) chain cleaning the signal from the RFI contamination. The 1*st* graph shows the ADC input as unsigned data with an offset of ca. 2300 ADC-counts, the 2*nd* - the absolute values of FFT coefficients in the frequency domain, the 3*rd* - FFT coefficients "decontaminated" by the median filter and 4*th* - signal converted back to the time domain. Additionally, the 0*th* FFT coefficient has been zeroed. Thus, the cleaned signal in the time domain is represented as signed data without the offset. The amplitude of the signal remains roughly the same and the noise is considerably reduced.

The necessary filtering accuracy requires at least 1024-point Fourier transforms. For the 180 MHz sampling, it corresponds to 360 kHz resolution in the frequency domain. Shorter transformation blocks give too rough filtering and may affect real signals from showers. For

The Altera® FFT MegaCore offers 4 types of FFT engines with various architectures :

calculating the FFT and iFFT in real-time. All architectures can be implemented a fixed point FFT, whereas the variable streaming architecture can also be configured in a floating point data representation. A comparison of resource occupancy of different architectures is given

these parameters, the RFI filter has been developed and optimized [7].

in Table 1. Parameters are shown for 12-bit and 16-bit data processing.

**3.2. Selection of the FFT architecture**

• streaming

• burst

• variable streaming

• buffered burst

noise ratio, and thus improves the radio pulse detection sensitivity (Figure 3).

The fixed point representation allows for natural word growth through the pipeline. The maximum growth of each stage is 2 bits. After the complex multiplication the data is rounded down to the expanded data size using convergent rounding. The overall bit growth is less than or equal to log2(N)+1. The floating point internal data representation is the single precision floating point (32-bit). Floating point operations provide more precise computation results but are costly in hardware resources. To reduce the amount of logic required for floating point operations, the variable streaming FFT uses "fused" floating point kernels. The reduction in logic occurs by fusing together several floating point operations and reducing the number of normalizations that need to occur [5].

#### **3. An FPGA based RFI filter for radio detection of cosmic rays**

#### **3.1. A physical background**

The energy threshold of radio detection of cosmic rays is limited by the considerable radio background and noise. The very high level of radio frequency interferences (RFI) in the FM and the short wave band has to be eliminated by a band pass filter amplifier. Within the remaining receiver bandwidth of 30 to 80MHz the noise at the quiet-rural environment of cosmic-rays experiments is dominated by the frequency dependent galactic noise [6] with noise temperatures of 5000K at 60 MHz.

In addition to galactic noise, there is still a human made background. This background consists of continuous signals, as from a few radio and TV stations, and transients produced by machines. Without an effective trigger, a stable and low-level energy threshold is not guaranteed. Furthermore, the data rate for communication of the triggered data to the central DAQ would exceed the available power budget.

For self-triggered measurements, the data will be digitized and processed in real time by a powerful FPGA chip. The narrow peaks in the frequency domain due to radio frequency interferences have to be strongly suppressed before building a trigger. These peaks are removed by a median filter. The filter works in the frequency domain using the Fast Fourier Transform (FFT) routine provided by Altera®. Furthermore, the phase of the signal deformed by the steep band pass filter is reconstructed by a deconvolution in the frequency domain.

The median FPGA filter eliminates mono-frequent carriers, but broadband radio pulses from cosmic showers are not affected. After a second inverse FFT, signals are converted back to the time domain. This chain of the digital signal processing strongly enhances the signal to noise ratio, and thus improves the radio pulse detection sensitivity (Figure 3).

Due to the Nyquist theorem, the used 80 MHz band should be sampled with at least 160 MHz. An application of 16-bit ADCs with such a sampling rate would be a challenge for the price, the power consumption and PCB routing to keep a reasonable noise level. The used practical option is an 12-bit ADC with 180 MSPS, leaving sufficient space for the anti-aliasing filter and implementing a high and low gain channel to obtain the required dynamic range.

**Figure 3.** A diagram showing a (FFT + Median filter + iFFT) chain cleaning the signal from the RFI contamination. The 1*st* graph shows the ADC input as unsigned data with an offset of ca. 2300 ADC-counts, the 2*nd* - the absolute values of FFT coefficients in the frequency domain, the 3*rd* - FFT coefficients "decontaminated" by the median filter and 4*th* - signal converted back to the time domain. Additionally, the 0*th* FFT coefficient has been zeroed. Thus, the cleaned signal in the time domain is represented as signed data without the offset. The amplitude of the signal remains roughly the same and the noise is considerably reduced.

The necessary filtering accuracy requires at least 1024-point Fourier transforms. For the 180 MHz sampling, it corresponds to 360 kHz resolution in the frequency domain. Shorter transformation blocks give too rough filtering and may affect real signals from showers. For these parameters, the RFI filter has been developed and optimized [7].

#### **3.2. Selection of the FFT architecture**

The Altera® FFT MegaCore offers 4 types of FFT engines with various architectures :


6 Design and Architectures for Digital Signal Processing

The variable streaming architecture uses two different types of architecture, depending on which representation: the fixed-point or the floating-point is selected. For the fixed-point data representation, the FFT variation uses a Radix-2<sup>2</sup> single delay feedback architecture, which is a fully pipelined architecture. For the floating point representation, the FFT variation uses a mixed Radix-4/2 architecture. For a length N transform, log4(N) stages are concatenated together. The Radix-22 algorithm has the same multiplicative complexity of the fully pipelined Radix-4 architecture, but the butterfly unit retains the Radix-2 architecture. In the Radix-4/2 algorithm, a combination of Radix-4 and Radix-2 architectures are implemented to achieve the computational advantage of the Radix-4 architecture while supporting FFT computation with a wider range of transform lengths. The butterfly units

The fixed point representation allows for natural word growth through the pipeline. The maximum growth of each stage is 2 bits. After the complex multiplication the data is rounded down to the expanded data size using convergent rounding. The overall bit growth is less than or equal to log2(N)+1. The floating point internal data representation is the single precision floating point (32-bit). Floating point operations provide more precise computation results but are costly in hardware resources. To reduce the amount of logic required for floating point operations, the variable streaming FFT uses "fused" floating point kernels. The reduction in logic occurs by fusing together several floating point operations and reducing

The energy threshold of radio detection of cosmic rays is limited by the considerable radio background and noise. The very high level of radio frequency interferences (RFI) in the FM and the short wave band has to be eliminated by a band pass filter amplifier. Within the remaining receiver bandwidth of 30 to 80MHz the noise at the quiet-rural environment of cosmic-rays experiments is dominated by the frequency dependent galactic noise [6] with

In addition to galactic noise, there is still a human made background. This background consists of continuous signals, as from a few radio and TV stations, and transients produced by machines. Without an effective trigger, a stable and low-level energy threshold is not guaranteed. Furthermore, the data rate for communication of the triggered data to the central

For self-triggered measurements, the data will be digitized and processed in real time by a powerful FPGA chip. The narrow peaks in the frequency domain due to radio frequency interferences have to be strongly suppressed before building a trigger. These peaks are removed by a median filter. The filter works in the frequency domain using the Fast Fourier Transform (FFT) routine provided by Altera®. Furthermore, the phase of the signal deformed by the steep band pass filter is reconstructed by a deconvolution in the frequency domain. The median FPGA filter eliminates mono-frequent carriers, but broadband radio pulses from cosmic showers are not affected. After a second inverse FFT, signals are converted back to

**3. An FPGA based RFI filter for radio detection of cosmic rays**

*2.3.2. Variable streaming architecture*

use the DIF decomposition.

**3.1. A physical background**

noise temperatures of 5000K at 60 MHz.

DAQ would exceed the available power budget.

the number of normalizations that need to occur [5].

• buffered burst

calculating the FFT and iFFT in real-time. All architectures can be implemented a fixed point FFT, whereas the variable streaming architecture can also be configured in a floating point data representation. A comparison of resource occupancy of different architectures is given in Table 1. Parameters are shown for 12-bit and 16-bit data processing.


10.5772/52946

291

http://dx.doi.org/10.5772/52946

(a) 12-bit processing (b) 16-bit processing

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

**Figure 4.** Histograms of reconstruction errors for the streaming architecture (differences between the original ADC data and data after application of the 12-bit (a) and 16-bit (b) wide FFT and inverse FFT). "T" denotes ideal configuration with zeroed

The streaming architecture accepts as input a two's complement format with a complex data vector of length N, where N is the desired transformation block length. The function output is given as a complex vector in the natural order. An accumulated block exponent is given to indicate any data scaling that has occurred during the transformation to maintain precision

The signed block exponent, used for scaling of internal signal values, remains constant for a full data block. For relatively small variations of the signal samples *xn* (typical for noise background), but with not negligible pedestal the Fourier component *X*¯ 0, may be relatively large whereas the *X*¯ *<sup>n</sup>*<sup>=</sup><sup>0</sup> components are rounded off to relatively small values. This may cause large errors of the reconstructed signals after going through the FFT/iFFT chain. Hence, the pedestal has to be subtracted carefully from the input signal. Errors of the reconstruction for the 1024-point transforms of a real event signal recorded in real cosmic

The streaming architecture introduces, unfortunately, significant distortions of signals in a data processing for the FFT+iFFT cascade chain. The reconstruction errors for the 12-bit processing are on unacceptable level of 10 and more ADC-counts. The 16-bit configuration introduces smaller reconstruction errors and maybe used for real data processing, however, an influence of the data processing errors have to be carefully take into account for the final

Figure 5 shows a possible optimization, where 12-bit data is processing in 14-bit FFT engine

The 12-bit input FFT routine with the variable streaming architecture yields 25-bit Re/Im Fourier coefficients. Processing of both buses with the full width in the iFFT procedure would be too spendthrift and slows down the speed significantly. A reasonable compromise

and 2 lower significant bits are grounded and treated as potentially fractional part.

for a selection of the input lines driving the iFFT routine is required.

offset. However, for a small shift of 2% only, error distributions become wider and more flat.

and maximize the internal numerical signal-to-noise ratio.

**3.3. Streaming architecture**

rays experiment are shown in Figure 4.

**3.4. Variable streaming architecture**

trigger and recorded data.

**Table 1.** An utilization of resources for various FFT architectures at 12-bit and 16-bit data processing. The 2*nd* column shows Transform Calculation Cycles (TCC), required by the Altera® wizard, the 3*rd* - Block Throughput Cycles (TBC), the 4*th* - required Logic Elements (LE), the 5*th* - required memory bits, the 6*th* - required Digital Signal Processing blocks. Parameters in columns 4*th* to 6*th* correspond to the 12-bit data processing, in columns 7*th* to 9*th* to the 16-bit processing, respectively.

For the RFI filtering scheme shown on Figure 3 sampled ADC data have to be processed continuously in real-time. "Continuously" means that any dead-time is not acceptable. Data can be processed in blocks of fixed length, but no any sample can be ignored. This requirement eliminates two architectures: burst and buffered burst, because for i.e. 1024-point (and 1024 clock cycles when samples appear from the ADC output) these architectures require more than 1024 clock cycles for processing (BTC = 3162, 2394, 1103 and 1099 for burst and buffered burst and single and 4 engines, respectively). For any configuration the fundamental requirement of no dead-time is not obeyed.

The floating point representation for the variable streaming architecture requires huge amount of logic elements and DSP blocks. For two cascade FFT engines for two polarization channels almost all resources could be utilized for the FFT engines only. There would not be resources for other tasks. Additionally, the Altera®'s documentation shows that the registered performance for this architecture is much below our expectations (on the level of 110 MHz, while we need at least 180 MHz for the signal processing).

Some FFT applications require the FFT + the user operation + the iFFT chain. In this case, a careful selection of the input and output order can significantly save a memory and a latency. If the input to the first FFT is in the natural order and the output is in the bit-reversed order, the FFT engine operates in a mode with a minimal resource utilization (called Engine-only mode). Thus, if the iFFT operation is configured to accept bit-reversed inputs and produces natural order outputs (iFFT is operating again in Engine-only mode), only the minimum amount of memory is required, which provides a saving of N complex memory words, and a latency saving of N clock cycles, where N is the size of the current transform.

However, in the case of the RFI filtering by the median filter the sequence of FFT coefficients in the frequency domain has to be natural, to eliminate/suppress narrow-band peaks. The FFT routines have to be working with Engine with bit-reversal modes only. Two architectures: (a) streaming and (b) variable streaming with the natural order and the fixed-point data representation survived the selection.

<sup>290</sup> Design and Architectures for Digital Signal Processing FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays 9 10.5772/52946 FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays http://dx.doi.org/10.5772/52946 291

**Figure 4.** Histograms of reconstruction errors for the streaming architecture (differences between the original ADC data and data after application of the 12-bit (a) and 16-bit (b) wide FFT and inverse FFT). "T" denotes ideal configuration with zeroed offset. However, for a small shift of 2% only, error distributions become wider and more flat.

#### **3.3. Streaming architecture**

8 Design and Architectures for Digital Signal Processing

variable streaming

**LE memory DSP LE memory DSP**

**12-bits 12-bits 12-bits 16-bits 16-bits 16-bits**

**architecture TCC BTC wizard bits wizard bits**

**Table 1.** An utilization of resources for various FFT architectures at 12-bit and 16-bit data processing. The 2*nd* column shows Transform Calculation Cycles (TCC), required by the Altera® wizard, the 3*rd* - Block Throughput Cycles (TBC), the 4*th* - required Logic Elements (LE), the 5*th* - required memory bits, the 6*th* - required Digital Signal Processing blocks. Parameters in columns

For the RFI filtering scheme shown on Figure 3 sampled ADC data have to be processed continuously in real-time. "Continuously" means that any dead-time is not acceptable. Data can be processed in blocks of fixed length, but no any sample can be ignored. This requirement eliminates two architectures: burst and buffered burst, because for i.e. 1024-point (and 1024 clock cycles when samples appear from the ADC output) these architectures require more than 1024 clock cycles for processing (BTC = 3162, 2394, 1103 and 1099 for burst and buffered burst and single and 4 engines, respectively). For any

The floating point representation for the variable streaming architecture requires huge amount of logic elements and DSP blocks. For two cascade FFT engines for two polarization channels almost all resources could be utilized for the FFT engines only. There would not be resources for other tasks. Additionally, the Altera®'s documentation shows that the registered performance for this architecture is much below our expectations (on the level

Some FFT applications require the FFT + the user operation + the iFFT chain. In this case, a careful selection of the input and output order can significantly save a memory and a latency. If the input to the first FFT is in the natural order and the output is in the bit-reversed order, the FFT engine operates in a mode with a minimal resource utilization (called Engine-only mode). Thus, if the iFFT operation is configured to accept bit-reversed inputs and produces natural order outputs (iFFT is operating again in Engine-only mode), only the minimum amount of memory is required, which provides a saving of N complex memory words, and

However, in the case of the RFI filtering by the median filter the sequence of FFT coefficients in the frequency domain has to be natural, to eliminate/suppress narrow-band peaks. The FFT routines have to be working with Engine with bit-reversal modes only. Two architectures: (a) streaming and (b) variable streaming with the natural order and the fixed-point data

4*th* to 6*th* correspond to the 12-bit data processing, in columns 7*th* to 9*th* to the 16-bit processing, respectively.

configuration the fundamental requirement of no dead-time is not obeyed.

of 110 MHz, while we need at least 180 MHz for the signal processing).

a latency saving of N clock cycles, where N is the size of the current transform.

representation survived the selection.

streaming 1024 1024 3723 155 648 24 4952 155 648 24

fixed bit reverse 1024 1024 6139 31792 48 7175 39 976 56 floating bit reverse 1024 1024 23000 73 568 128 23000 73 568 128 fixed natural order 1024 1024 6139 82 992 48 7175 100 380 56 floating natural order 1024 1024 23000 139 104 128 23000 139 104 128 burst (single engine) 1113 3162 2814 57 344 24 3804 57 344 24 burst (4 engines) 345 2394 7864 114 688 96 11 136 114 688 96 buffered burst (single engine) 1103 1291 3202 122 880 24 4197 122 880 24 buffered burst (4 engines) 335 1099 8517 245 760 96 11 885 245 760 96

> The streaming architecture accepts as input a two's complement format with a complex data vector of length N, where N is the desired transformation block length. The function output is given as a complex vector in the natural order. An accumulated block exponent is given to indicate any data scaling that has occurred during the transformation to maintain precision and maximize the internal numerical signal-to-noise ratio.

> The signed block exponent, used for scaling of internal signal values, remains constant for a full data block. For relatively small variations of the signal samples *xn* (typical for noise background), but with not negligible pedestal the Fourier component *X*¯ 0, may be relatively large whereas the *X*¯ *<sup>n</sup>*<sup>=</sup><sup>0</sup> components are rounded off to relatively small values. This may cause large errors of the reconstructed signals after going through the FFT/iFFT chain. Hence, the pedestal has to be subtracted carefully from the input signal. Errors of the reconstruction for the 1024-point transforms of a real event signal recorded in real cosmic rays experiment are shown in Figure 4.

> The streaming architecture introduces, unfortunately, significant distortions of signals in a data processing for the FFT+iFFT cascade chain. The reconstruction errors for the 12-bit processing are on unacceptable level of 10 and more ADC-counts. The 16-bit configuration introduces smaller reconstruction errors and maybe used for real data processing, however, an influence of the data processing errors have to be carefully take into account for the final trigger and recorded data.

> Figure 5 shows a possible optimization, where 12-bit data is processing in 14-bit FFT engine and 2 lower significant bits are grounded and treated as potentially fractional part.

#### **3.4. Variable streaming architecture**

The 12-bit input FFT routine with the variable streaming architecture yields 25-bit Re/Im Fourier coefficients. Processing of both buses with the full width in the iFFT procedure would be too spendthrift and slows down the speed significantly. A reasonable compromise for a selection of the input lines driving the iFFT routine is required.

Figure 6 shows that cropping the output FFT bus to 12 bits provides already a good reconstruction. The error is on the level of one ADC-count. This is achieved at the expenses of 2000 additional LEs and 24 additional DSP blocks. However, this architecture's maximum clock frequency of roughly 200 MHz (for selected FPGA from Cyclone® III family) is too low. 10.5772/52946

293

http://dx.doi.org/10.5772/52946

Both problems can only be solved, without introducing dead time between the blocks, by using an overlapping routine [7]. Therefore the filter engine must run in another clock domain with higher frequency. Preliminary estimation shows that for an overlapping of N = 32 errors due to an aliasing contribution is acceptable, however for a better safety margin N = 64 is preferred. N = 128, allows a total removal of aliasing effect, however this option requires too high over-clocking according to Table III. An odd value like N = 73 seems to be a valid compromise, although requiring some special modules to assure a seamless hand over

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

(a) (b)

(c) (d)

**Figure 7.** An example of spurious envelopes due to aliasing, when a signal appears close to the border of converted blocks,

Figures 7 show a potential danger if the aliasing were ignored. If the signal appears relatively far from the end of the block border (i.e. 128 time bins for 1024-point conversion) the envelope of the signal is reconstructed rather good (Figure 7a). There is no any false peaks, which could be recognized as spurious triggers. If the signal appears relatively close the end of the block border (Figure 7b) one can observe some spurious wings on the borders of neighboring blocks. However, if a relatively strong signal appears close to the block border (Figure 7c) the spurious peaks are created on both borders and there is a very high danger that these spurious peaks can be mistakenly taken as a trigger. If the signal appears exactly on the border of two blocks (Figure 7d), the spurious peaks can get an amplitude of more than 30 % of real signal. An additional procedure removing a spectral leakage has to be

of the data stream between the different clock domains.

128, 32, 8 and exactly on the border, respectively

absolutely used to keep a high reliability of the system.

#### **3.5. Aliasing and leakage removal**

The incoming data stream must be chopped into blocks to be processed by the FFT routine. If signal pulses are located close to the border of a block, aliasing occurs. It manifests by a spurious contribution in the opposite border of the block and in the neighboring block as well. This effect may cause spurious pulses and has to be eliminated. The leakage effect is caused by the finite length of the blocks, acting like an applied rectangular window function. Thus, a signal amplitude leaks from one frequency bin to another. By using a suitable window function, the leakage effect can be reduced. To keep algorithmic costs low, we use a window function with a constant middle part like a trapezoidal shape or a Tukey-window.

**Figure 5.** Histograms of reconstruction errors for the streaming architecture (differences between the original ADC data and data after application of the 14-bit wide FFT and inverse FFT). The width of input data is 12 bits connected to low 12 bits (starting from LSB) ("low") or to higher 12 bits (starting from MSB (high). For "low" configuration 13*th* and 12*th* input bits are connected to the sign (11*th*) bit. A distribution of the reconstruction errors is rather wide. For the "high" configuration 0*th* and 1*st* are grounded and they play role of a fractional zeroed input part. For a such modification of input connection only, the error distribution is significantly narrower.

**Figure 6.** Histogram (a) of reconstruction errors for the variable streaming architecture (differences between the original ADC data and data after application of the 12-bit wide FFT and inverse FFT). The right plot (b) shows differences for raw data.

Both problems can only be solved, without introducing dead time between the blocks, by using an overlapping routine [7]. Therefore the filter engine must run in another clock domain with higher frequency. Preliminary estimation shows that for an overlapping of N = 32 errors due to an aliasing contribution is acceptable, however for a better safety margin N = 64 is preferred. N = 128, allows a total removal of aliasing effect, however this option requires too high over-clocking according to Table III. An odd value like N = 73 seems to be a valid compromise, although requiring some special modules to assure a seamless hand over of the data stream between the different clock domains.

10 Design and Architectures for Digital Signal Processing

**3.5. Aliasing and leakage removal**

distribution is significantly narrower.

Figure 6 shows that cropping the output FFT bus to 12 bits provides already a good reconstruction. The error is on the level of one ADC-count. This is achieved at the expenses of 2000 additional LEs and 24 additional DSP blocks. However, this architecture's maximum clock frequency of roughly 200 MHz (for selected FPGA from Cyclone® III family) is too low.

The incoming data stream must be chopped into blocks to be processed by the FFT routine. If signal pulses are located close to the border of a block, aliasing occurs. It manifests by a spurious contribution in the opposite border of the block and in the neighboring block as well. This effect may cause spurious pulses and has to be eliminated. The leakage effect is caused by the finite length of the blocks, acting like an applied rectangular window function. Thus, a signal amplitude leaks from one frequency bin to another. By using a suitable window function, the leakage effect can be reduced. To keep algorithmic costs low, we use a window function with a constant middle part like a trapezoidal shape or a Tukey-window.

**Figure 5.** Histograms of reconstruction errors for the streaming architecture (differences between the original ADC data and data after application of the 14-bit wide FFT and inverse FFT). The width of input data is 12 bits connected to low 12 bits (starting from LSB) ("low") or to higher 12 bits (starting from MSB (high). For "low" configuration 13*th* and 12*th* input bits are connected to the sign (11*th*) bit. A distribution of the reconstruction errors is rather wide. For the "high" configuration 0*th* and 1*st* are grounded and they play role of a fractional zeroed input part. For a such modification of input connection only, the error

(a) (b)

**Figure 6.** Histogram (a) of reconstruction errors for the variable streaming architecture (differences between the original ADC data and data after application of the 12-bit wide FFT and inverse FFT). The right plot (b) shows differences for raw data.

**Figure 7.** An example of spurious envelopes due to aliasing, when a signal appears close to the border of converted blocks, 128, 32, 8 and exactly on the border, respectively

Figures 7 show a potential danger if the aliasing were ignored. If the signal appears relatively far from the end of the block border (i.e. 128 time bins for 1024-point conversion) the envelope of the signal is reconstructed rather good (Figure 7a). There is no any false peaks, which could be recognized as spurious triggers. If the signal appears relatively close the end of the block border (Figure 7b) one can observe some spurious wings on the borders of neighboring blocks. However, if a relatively strong signal appears close to the block border (Figure 7c) the spurious peaks are created on both borders and there is a very high danger that these spurious peaks can be mistakenly taken as a trigger. If the signal appears exactly on the border of two blocks (Figure 7d), the spurious peaks can get an amplitude of more than 30 % of real signal. An additional procedure removing a spectral leakage has to be absolutely used to keep a high reliability of the system.

#### **3.6. Simultaneous processing of two signals with perpendicular polarizations**

Each antenna station measures radio signals in two opposite polarization channels. Thus, it would be straightforward to use two FFT engines for calculating the frequency domain signal, while setting their imaginary input to zero. A more efficient way is to exploit the symmetries of the FFT. Therefore the data streams of both antenna channels (*N* windowed signal samples *fj* and *gj*) are connected to the real respectively imaginary component input of the FFT engine. The resulting output components, *Hn*, are given in (7).

$$H\_{\rm nl} = \sum\_{j} e^{2\pi i j n / N} (f\_j + i g\_j) \tag{7}$$

10.5772/52946

295

http://dx.doi.org/10.5772/52946

*<sup>i</sup>ωkn*∆*<sup>t</sup>* (10)

<sup>2</sup> (11)

It is possible to compute the wavelet transform in the time domain according to (9). However, it is much simpler to use the fact that the wavelet transform is the convolution between the two functions X and *ψ*, and to carry out the wavelet transform in Fourier space using the

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

*<sup>X</sup>*¯ *<sup>m</sup>*Ψ¯ <sup>∗</sup>(*sωk*)*<sup>e</sup>*

Unlike the convolution, the FFT method allows the computation of all n points

Wavelets coefficients allow an estimation of the signal power. The global wavelet spectrum, defined as the time average over a series of p-wavelet powers, can be expressed as [8]:

A sum of products of Fourier coefficients calculated in a FFT32 routine for ADC data (*xn*) in each clock cycle with pre-calculated Fourier coefficients of a reference wavelet gives an estimation of the signal power for selected type of the wavelet. Only a single FFT32 routine for the on-line calculation of Fourier coefficients for data is needed. Fourier coefficients for various wavelets can be calculated earlier and be available for final power estimation as

A fundamental limitation for the on-line wavelet analysis in the FPGA is an amount of embedded DSP multipliers. A multiplication by an utilization of logic elements is rather inefficients. The Quartus® II environment for an Altera® FPGA programming provides parametrized FFT routines with various architectures: streaming, variable streaming, burst and buffered burst. However, all routines deliver the FFT coefficients in a serial form (Figure

If FFT coefficients are spread in time, the wavelet transform can be also calculated in a serial way (in a single clock cycle only a single pair of *X*¯ *<sup>n</sup>* is multiplied by a single pair of *ψ*∗), however, a product will strongly depend on a relative position of *X*¯ *<sup>n</sup>* and *ψ*∗. If the variables are shifted between themselves, even strong signal may give a negligible final contribution. Some additional procedure is needed, which could tune a wavelet transform regarding to

This problem can be automatically solved if all Fourier coefficients were provided simultaneously in each clock cycle. A synchronous multiplication with Fourier coefficients of wavelets would give required power estimation independently of any relatively configurations of these variables. The Fourier coefficients of selected wavelets are fixed, a sliding window of N ADC samples gives all Fourier coefficients in each clock cycle. This assures that for some set of samples (if a signal appears) the product of both transforms may

4). No any Altera® routine allows calculating all FFT coefficients simultaneously.

<sup>2</sup> <sup>=</sup> <sup>1</sup> *N* *N*−1 ∑ *k*=0


Fast Fourier Transform (FFT). In the Fourier domain, the wavelet transform is :

simultaneously, and can be efficiently coded using any standard FFT package.

*N*−1 ∑ *k*=0

*Wn*(*s*) =

*<sup>W</sup>*¯ <sup>2</sup>(*p*) = <sup>1</sup>

the Fourier transform of ADC samples.

give a significant contribution and may be used as a trigger.

constants.

*N*

*N*−1 ∑ *k*=0


The *Hn* can then easily be disentangled into the Fourier components, *Fn* and *Gn*, by the following equations (8)

$$H\_{\rm n} + H\_{\rm N-n}^\* = 2F\_{\rm n} \quad H\_{\rm n} - H\_{\rm N-n}^\* = 2iG\_{\rm n} \tag{8}$$

The (N-n) indices in (8) in a real time system correspond to a time reversed order. The *Hn* and *H*<sup>∗</sup> *<sup>N</sup>*−*<sup>n</sup>* are synchronized by a routine inverting the order of the *<sup>H</sup>*<sup>∗</sup> *<sup>n</sup>* like First In Last Out (FILO) and by using a delay routine for the *Hn* in parallel. Doing so, the amount of needed FFT engines can be reduced from two to one.

After the iFFT, the envelopes *fenv*(*t*) and *genv*(*t*) (Figure 8) of the output signal *x*(*t*) have to be created to allow the following trigger algorithms to discriminate specific pulse shapes in each channel.

**Figure 8.** Schematic view of the resources-optimized implementation of the used two antenna chains with opposite polarization, each consisting of FFT, median filter, deconvolution, Hilbert transform Im(f) and Im(g) and FIR filters.

#### **4. Wavelets**

Let us investigate a time series X, with values of *xn*, at time index n. Each value is separated in time by a constant time interval ∆t. The wavelet transform *Wn*(*s*) is just the inner product (or convolution) of the wavelet function with our original time series:

$$\mathcal{W}\_{\mathfrak{n}}(\mathbf{s}) = \sum\_{m=0}^{N-1} \mathbf{x}\_{m} \boldsymbol{\upvarphi}^{\*} \left[ \frac{(m-n)\Delta t}{s} \right] \tag{9}$$

where the asterisk (\*) denotes complex conjugate. The above sum can be evaluated for various values of the scale s (usually taken to be multiples of the lowest possible frequency), as well as all values of n between the start and end dates.

It is possible to compute the wavelet transform in the time domain according to (9). However, it is much simpler to use the fact that the wavelet transform is the convolution between the two functions X and *ψ*, and to carry out the wavelet transform in Fourier space using the Fast Fourier Transform (FFT). In the Fourier domain, the wavelet transform is :

12 Design and Architectures for Digital Signal Processing

following equations (8)

and *H*<sup>∗</sup>

each channel.

**4. Wavelets**

**3.6. Simultaneous processing of two signals with perpendicular polarizations**

of the FFT engine. The resulting output components, *Hn*, are given in (7).

*Hn* = ∑ *j e*

*<sup>N</sup>*−*<sup>n</sup>* are synchronized by a routine inverting the order of the *<sup>H</sup>*<sup>∗</sup>

*Hn* + *H*<sup>∗</sup>

FFT engines can be reduced from two to one.

Each antenna station measures radio signals in two opposite polarization channels. Thus, it would be straightforward to use two FFT engines for calculating the frequency domain signal, while setting their imaginary input to zero. A more efficient way is to exploit the symmetries of the FFT. Therefore the data streams of both antenna channels (*N* windowed signal samples *fj* and *gj*) are connected to the real respectively imaginary component input

The *Hn* can then easily be disentangled into the Fourier components, *Fn* and *Gn*, by the

*<sup>N</sup>*−*<sup>n</sup>* <sup>=</sup> <sup>2</sup>*Fn* , *Hn* <sup>−</sup> *<sup>H</sup>*<sup>∗</sup>

The (N-n) indices in (8) in a real time system correspond to a time reversed order. The *Hn*

(FILO) and by using a delay routine for the *Hn* in parallel. Doing so, the amount of needed

After the iFFT, the envelopes *fenv*(*t*) and *genv*(*t*) (Figure 8) of the output signal *x*(*t*) have to be created to allow the following trigger algorithms to discriminate specific pulse shapes in

**Figure 8.** Schematic view of the resources-optimized implementation of the used two antenna chains with opposite

Let us investigate a time series X, with values of *xn*, at time index n. Each value is separated in time by a constant time interval ∆t. The wavelet transform *Wn*(*s*) is just the inner product

> *xmψ*<sup>∗</sup>

where the asterisk (\*) denotes complex conjugate. The above sum can be evaluated for various values of the scale s (usually taken to be multiples of the lowest possible frequency),

(*m* − *n*)∆*t s*

polarization, each consisting of FFT, median filter, deconvolution, Hilbert transform Im(f) and Im(g) and FIR filters.

*N*−1 ∑ *m*=0

(or convolution) of the wavelet function with our original time series:

*Wn*(*s*) =

as well as all values of n between the start and end dates.

<sup>2</sup>*πijn*/*N*(*fj* <sup>+</sup> *igj*) (7)

*<sup>N</sup>*−*<sup>n</sup>* = 2*iGn*, (8)

*<sup>n</sup>* like First In Last Out

(9)

$$\mathcal{W}\_{\rm n}(s) = \sum\_{k=0}^{N-1} \vec{X}\_{\rm m} \vec{\Psi}^\*(s\omega\_k) e^{i\omega\_k n\Delta t} \tag{10}$$

Unlike the convolution, the FFT method allows the computation of all n points simultaneously, and can be efficiently coded using any standard FFT package.

Wavelets coefficients allow an estimation of the signal power. The global wavelet spectrum, defined as the time average over a series of p-wavelet powers, can be expressed as [8]:

$$\bar{\mathcal{W}}^2(p) = \frac{1}{N} \sum\_{k=0}^{N-1} |\mathcal{W}\_k(p)|^2 = \frac{1}{N} \sum\_{k=0}^{N-1} |\bar{X}\_k \times \bar{\Psi}\_k(p)|^2 \tag{11}$$

A sum of products of Fourier coefficients calculated in a FFT32 routine for ADC data (*xn*) in each clock cycle with pre-calculated Fourier coefficients of a reference wavelet gives an estimation of the signal power for selected type of the wavelet. Only a single FFT32 routine for the on-line calculation of Fourier coefficients for data is needed. Fourier coefficients for various wavelets can be calculated earlier and be available for final power estimation as constants.

A fundamental limitation for the on-line wavelet analysis in the FPGA is an amount of embedded DSP multipliers. A multiplication by an utilization of logic elements is rather inefficients. The Quartus® II environment for an Altera® FPGA programming provides parametrized FFT routines with various architectures: streaming, variable streaming, burst and buffered burst. However, all routines deliver the FFT coefficients in a serial form (Figure 4). No any Altera® routine allows calculating all FFT coefficients simultaneously.

If FFT coefficients are spread in time, the wavelet transform can be also calculated in a serial way (in a single clock cycle only a single pair of *X*¯ *<sup>n</sup>* is multiplied by a single pair of *ψ*∗), however, a product will strongly depend on a relative position of *X*¯ *<sup>n</sup>* and *ψ*∗. If the variables are shifted between themselves, even strong signal may give a negligible final contribution. Some additional procedure is needed, which could tune a wavelet transform regarding to the Fourier transform of ADC samples.

This problem can be automatically solved if all Fourier coefficients were provided simultaneously in each clock cycle. A synchronous multiplication with Fourier coefficients of wavelets would give required power estimation independently of any relatively configurations of these variables. The Fourier coefficients of selected wavelets are fixed, a sliding window of N ADC samples gives all Fourier coefficients in each clock cycle. This assures that for some set of samples (if a signal appears) the product of both transforms may give a significant contribution and may be used as a trigger.

The radio signal is spread in a time interval of an order of couple hundred nanoseconds, most of registered samples gave a time interval below 200 ns. The frequency window in the atmosphere, where a signal suppression is on an acceptable level (the atmosphere is relatively transparent) is ca. 30-80 MHz. According to the Nyquist theorem the sampling frequency should be at least twice higher than the maximal frequency in a investigated spectrum. The anti-aliasing filter should have the cut-off frequency of ca. 85 MHz. Taking into account some width of the transition range for the filter (from pass-band to stop-band) the final sampling frequency should not be lower than 180 MHz (200 MHz in our considerations). This frequency corresponds to 5 ns between rising edges of the clock.

10.5772/52946

297

http://dx.doi.org/10.5772/52946

*<sup>k</sup>* can be expressed

*<sup>k</sup>*) = −*Im*(*X*¯ *<sup>N</sup>*−*k*) (21)

<sup>−</sup>*i<sup>π</sup>* = −<sup>1</sup> (22)

<sup>−</sup>*iπ*/2 = −*<sup>i</sup>* (23)

<sup>2</sup> -1 for (19-20). The 1*st* stage

We can introduce the new set of variables defined for n = 0,. . . ,N/4-1 as follows:

*N*/4−1 ∑ *n*=0

registers clocked synchronously with the ADC. The DFT coefficients *X*¯

*xn* represent signals in time domain. They can be easily available from outputs of shift

by new set of variables *Am*, Because *Am* are simple linear combination of *xm*, they can be calculated by typical adders (eqs.(17) and sub-tractors (eqs.(18) in a single clock cycle. The

Coefficients of DFT in the real domain additional simplify due to the following symmetry:

The Radix-2 algorithm allows regrouping of inputs elements in the DFT expression in order to utilize some symmetries of Fourier coefficients. In a single step of the Radix-2 algorithm we can redefine the "new" set of variables by some mathematical expression of the "old" ones. This step will correspond to an elementary process in the pipeline chain. The redefinition of variables in eqs.(17) corresponds to the 1*st* stage of the pipeline. Splitting the sum (14)

input values *xn* are real and positive, since they represent the signal in the real time.

*<sup>k</sup>*)=+*Re*(*X*¯ *<sup>N</sup>*−*k*) *Im*(*X*¯

utilizes the feature of the twiddle factors related to the 1*st* stage of the pipeline.

*WA* = *WN*/2 = *e*

*WB* = *WN*/4 = *e*

So, the 1*st* stage can be implemented in a very simple way. The implementation of the multi-points algorithm requires multiple pipeline stages and apart from adders and sub-tractors also requires multipliers, which correspond to the *W<sup>k</sup>* coefficients relating to the fractional "angle" *e*−2*ikπ*/*N*. The Radix-2 algorithm used in the next stage reduces again the abundance of *W<sup>k</sup>* coefficients due to the next twiddle factors' related to the 2*nd* stage of

The *WB* suggests the similar splitting structure in the 2*nd* pipeline stage as in the 1*st* one (minus in (23) as in (22)), however the imaginary unit imposes the DFT calculation separately

reduces of coefficient *W<sup>k</sup>* set from 0,. . . ,N-1 for input *xn* to 0,. . . , *<sup>N</sup>*

*X*¯ *<sup>k</sup>* =

> *N*/4−1 ∑ *n*=0

*X*¯ *<sup>k</sup>* =

*Re*(*X*¯

for k even and odd respectively.

we get

the pipeline.

*A*2*<sup>n</sup>* = *x*2*<sup>n</sup>* + *x*2*n*+*N*/2 *A*2*n*+<sup>1</sup> = *x*2*n*+<sup>1</sup> + *x*2*n*+1+*N*/2 (17) *A*2*n*+*N*/2 = *x*2*<sup>n</sup>* − *x*2*n*+*N*/2 *A*2*n*+1+*N*/2 = *x*2*n*+<sup>1</sup> − *x*2*n*+1+*N*/2, (18)

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

(*A*2*<sup>n</sup>* + *WkA*2*n*+1)*W*2*nk* (19)

(*A*2*n*+*N*/2 + *<sup>W</sup>kA*2*n*+1+*N*/2)*W*2*nk* (20)

The interval of 160 ns (estimated as sufficient time interval for radio signals) requires 32-point Fourier transform calculated in each clock cycle.

#### **5. General algorithm**

Let us consider a DFT *X*¯ of dimension N

$$X\_k = \sum\_{n=0}^{N-1} x\_n \mathcal{W}^{nk} \qquad \text{where} \qquad \mathcal{W} = e^{-2i\pi/N} \qquad \text{and} \qquad k = 0, \ldots N-1 \tag{12}$$

If N is the product of two factors, with *N* = *N*1*N*2, the indices *n* and *k* we can redefined as follows: *n* = *N*1*n*<sup>2</sup> + *n*1,

where *n*<sup>2</sup> = 0,. . . ,*N*2-1 and *n*<sup>1</sup> = 0,. . . ,*N*1-1, k = *N*2*k*<sup>1</sup> + *k*2, *k*<sup>2</sup> = 0,. . . ,*N*2-1 and *k*<sup>1</sup> = 0,. . . ,*N*1-1

$$\bar{X}\_{N\_2k\_1+k\_2} = \sum\_{n\_1=0}^{N\_1-1} \mathcal{W}^{N\_2n\_1k\_1} \mathcal{W}^{n\_1k\_2} \times \sum\_{m\_2=0}^{N\_2-1} \mathfrak{x}\_{N\_1m\_2+m\_1} \mathcal{W}^{N\_1m\_2k\_2} \tag{13}$$

For the Radix-2 algorithm: N = 2*<sup>t</sup>* , *N*<sup>1</sup> = 2 and *N*<sup>2</sup> = 2*t*−<sup>1</sup> = N/2 . Hence,

$$X\_k = \sum\_{n=0}^{N/2-1} (x\_{2n} + \mathcal{W}^k x\_{2n+1}) \mathcal{W}^{2nk} \tag{14}$$

If we split the sum as follows

$$X\_k = \sum\_{n=0}^{N/4-1} x\_{2n} \mathcal{W}^{2nk} + \sum\_{n=N/4}^{N/2-1} x\_{2n} \mathcal{W}^{2nk} + \dots + \sum\_{n=0}^{N/4-1} x\_{2n+1} \mathcal{W}^{2nk} + \sum\_{n=N/4}^{N/2-1} x\_{2n+1} \mathcal{W}^{2nk} \tag{15}$$

and afterwards, if we redefine indices and group the sums, we get

$$X\_k = \sum\_{n=0}^{N/4-1} (\mathbf{x}\_{2n} + (-1)^k \mathbf{x}\_{2(n+N/4)}) \mathcal{W}^{2nk} + \dots + \mathcal{W}^k (\sum\_{n=0}^{N/4-1} (\mathbf{x}\_{2n+1} + (-1)^k \mathbf{x}\_{2(n+N/4)+1}) \mathcal{W}^{2nk} \tag{16}$$

We can introduce the new set of variables defined for n = 0,. . . ,N/4-1 as follows:

$$A\_{2n} = \mathbf{x}\_{2n} + \mathbf{x}\_{2n+N/2} \qquad A\_{2n+1} \qquad = \mathbf{x}\_{2n+1} + \mathbf{x}\_{2n+1+N/2} \tag{17}$$

$$A\_{2n+N/2} = \mathbf{x}\_{2n} - \mathbf{x}\_{2n+N/2} \qquad A\_{2n+1+N/2} = \mathbf{x}\_{2n+1} - \mathbf{x}\_{2n+1+N/2} \tag{18}$$

we get

14 Design and Architectures for Digital Signal Processing

Fourier transform calculated in each clock cycle.

*<sup>X</sup>*¯ *<sup>N</sup>*2*k*1+*k*<sup>2</sup> =

*x*2*nW*<sup>2</sup>*nk* +

Let us consider a DFT *X*¯ of dimension N

*N*−1 ∑ *n*=0

**5. General algorithm**

*X*¯ *<sup>k</sup>* =

follows: *n* = *N*1*n*<sup>2</sup> + *n*1,

If we split the sum as follows

*N*/4−1 ∑ *n*=0

*N*/4−1 ∑ *n*=0

*X*¯ *<sup>k</sup>* =

*X*¯ *<sup>k</sup>* =

The radio signal is spread in a time interval of an order of couple hundred nanoseconds, most of registered samples gave a time interval below 200 ns. The frequency window in the atmosphere, where a signal suppression is on an acceptable level (the atmosphere is relatively transparent) is ca. 30-80 MHz. According to the Nyquist theorem the sampling frequency should be at least twice higher than the maximal frequency in a investigated spectrum. The anti-aliasing filter should have the cut-off frequency of ca. 85 MHz. Taking into account some width of the transition range for the filter (from pass-band to stop-band) the final sampling frequency should not be lower than 180 MHz (200 MHz in our considerations).

The interval of 160 ns (estimated as sufficient time interval for radio signals) requires 32-point

If N is the product of two factors, with *N* = *N*1*N*2, the indices *n* and *k* we can redefined as

where *n*<sup>2</sup> = 0,. . . ,*N*2-1 and *n*<sup>1</sup> = 0,. . . ,*N*1-1, k = *N*2*k*<sup>1</sup> + *k*2, *k*<sup>2</sup> = 0,. . . ,*N*2-1 and *k*<sup>1</sup> = 0,. . . ,*N*1-1

*N*2−1 ∑ *m*2=0

*N*/4−1 ∑ *n*=0

> *N*/4−1 ∑ *n*=0

*WN*2*n*1*k*1*Wn*1*k*<sup>2</sup> ×

<sup>−</sup>2*iπ*/*<sup>N</sup> and k* = 0, . . . *<sup>N</sup>* − 1 (12)

*xN*1*m*2+*m*1*WN*1*m*2*k*<sup>2</sup> (13)

(*x*2*<sup>n</sup>* + *Wkx*2*n*+1)*W*2*nk* (14)

*N*/2−1 ∑ *n*=*N*/4

(*x*2*n*+<sup>1</sup> + (−1)*kx*2(*n*+*N*/4)+1)*W*2*nk*

*x*2*n*+1*W*2*nk* (15)

(16)

*x*2*n*+1*W*2*nk* +

This frequency corresponds to 5 ns between rising edges of the clock.

*xnWnk where W* = *e*

*N*1−1 ∑ *n*1=0

*X*¯ *<sup>k</sup>* =

*N*/2−1 ∑ *n*=*N*/4

and afterwards, if we redefine indices and group the sums, we get

(*x*2*<sup>n</sup>* + (−1)*kx*2(*n*+*N*/4))*W*2*nk* + +*Wk*(

For the Radix-2 algorithm: N = 2*<sup>t</sup>* , *N*<sup>1</sup> = 2 and *N*<sup>2</sup> = 2*t*−<sup>1</sup> = N/2 . Hence,

*N*/2−1 ∑ *n*=0

*x*2*nW*<sup>2</sup>*nk* + +

$$X\_k = \sum\_{n=0}^{N/4-1} (A\_{2n} + \mathcal{W}^k A\_{2n+1}) \mathcal{W}^{2nk} \tag{19}$$

$$X\_k = \sum\_{n=0}^{N/4-1} (A\_{2n+N/2} + \mathcal{W}^k A\_{2n+1+N/2}) \mathcal{W}^{2nk} \tag{20}$$

for k even and odd respectively.

*xn* represent signals in time domain. They can be easily available from outputs of shift registers clocked synchronously with the ADC. The DFT coefficients *X*¯ *<sup>k</sup>* can be expressed by new set of variables *Am*, Because *Am* are simple linear combination of *xm*, they can be calculated by typical adders (eqs.(17) and sub-tractors (eqs.(18) in a single clock cycle. The input values *xn* are real and positive, since they represent the signal in the real time.

Coefficients of DFT in the real domain additional simplify due to the following symmetry:

$$\operatorname{Re}(\tilde{X}\_k) = +\operatorname{Re}(\tilde{X}\_{N-k}) \qquad \operatorname{Im}(\tilde{X}\_k) = -\operatorname{Im}(\tilde{X}\_{N-k}) \tag{21}$$

The Radix-2 algorithm allows regrouping of inputs elements in the DFT expression in order to utilize some symmetries of Fourier coefficients. In a single step of the Radix-2 algorithm we can redefine the "new" set of variables by some mathematical expression of the "old" ones. This step will correspond to an elementary process in the pipeline chain. The redefinition of variables in eqs.(17) corresponds to the 1*st* stage of the pipeline. Splitting the sum (14) reduces of coefficient *W<sup>k</sup>* set from 0,. . . ,N-1 for input *xn* to 0,. . . , *<sup>N</sup>* <sup>2</sup> -1 for (19-20). The 1*st* stage utilizes the feature of the twiddle factors related to the 1*st* stage of the pipeline.

$$W\_A = W^{N/2} = e^{-i\pi} = -1\tag{22}$$

So, the 1*st* stage can be implemented in a very simple way. The implementation of the multi-points algorithm requires multiple pipeline stages and apart from adders and sub-tractors also requires multipliers, which correspond to the *W<sup>k</sup>* coefficients relating to the fractional "angle" *e*−2*ikπ*/*N*. The Radix-2 algorithm used in the next stage reduces again the abundance of *W<sup>k</sup>* coefficients due to the next twiddle factors' related to the 2*nd* stage of the pipeline.

$$\mathcal{W}\_{\mathcal{B}} = \mathcal{W}^{N/4} = e^{-i\pi/2} = -i \tag{23}$$

The *WB* suggests the similar splitting structure in the 2*nd* pipeline stage as in the 1*st* one (minus in (23) as in (22)), however the imaginary unit imposes the DFT calculation separately for their real and imaginary parts. If we split the sum in (20) similar as in (15), we get for k = 0,2,. . . ,N-2

10.5772/52946

299

http://dx.doi.org/10.5772/52946

**6. 16-point algorithm**

base *W*1, *W*2, *W*<sup>3</sup>

For N = 16 and odd indices we get

+ (−*i*)

where <sup>X</sup> <sup>=</sup> *<sup>α</sup>*, *<sup>β</sup>*, <sup>Y</sup> <sup>=</sup> *<sup>β</sup>*, *<sup>α</sup>* for q = 1,3.

synchronization with adjacent ones.

*Re*(*X*¯

*Re*(*X*¯

*Re*(*X*¯

*Im*(*X*¯

*q*−1

*X*¯ <sup>4</sup>*p*+*<sup>q</sup>* = (*A*<sup>8</sup> ∓ *jA*12)+(*i*)*p*(*A*9(−1)*pWq* − *iA*15*W*4−*q*)+

*W*<sup>1</sup> = *e*

*W*<sup>2</sup> = *e*

*W*<sup>3</sup> = *e*

−*i <sup>π</sup>*

−*i <sup>π</sup>*

−*i <sup>π</sup>*

<sup>8</sup> = *cos*(

<sup>4</sup> = *cos*(

<sup>8</sup> = *cos*(

Symmetries in (33-35) allow the following simplification. Notice that

We can extend the set of variables (27 − 28) also to odd indices of *X*¯

<sup>2</sup> (−1)*pW*<sup>2</sup>(*A*<sup>10</sup> ∓ *iA*14) ± (*i*)*p*(*A*11*W*4−*<sup>q</sup>* − *iA*13(−1)*pWq*) (32)

*π*

*π* <sup>4</sup> ) = *<sup>γ</sup>*˙

3*π*

*W*2(*A*<sup>10</sup> ∓ *iA*14) = *γ*(*A*<sup>10</sup> ∓ *A*14)(1 − *i*) (36)

<sup>8</sup> ) = *<sup>α</sup>* <sup>−</sup> *<sup>i</sup>* · *<sup>β</sup>* (33)

<sup>8</sup> ) = *<sup>β</sup>* <sup>−</sup> *<sup>i</sup>* · *<sup>α</sup>* (35)

<sup>8</sup>) = *B*<sup>0</sup> − *B*<sup>1</sup> + *B*<sup>2</sup> − *B*<sup>3</sup> (40)

<sup>6</sup>) = *B*<sup>4</sup> − *γ* · (*B*<sup>5</sup> − *B*7) (42)

<sup>6</sup>) = *B*<sup>6</sup> − *γ* · (*B*<sup>5</sup> + *B*7) (43)

(1 − *i*) (34)

Since of *W*<sup>4</sup> = −*i*, all coefficients can be expressed as a linear combination of the complex

<sup>8</sup> ) <sup>−</sup> *<sup>i</sup>* · *sin*(

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

<sup>4</sup> ) <sup>−</sup> *<sup>i</sup>* · *sin*(

<sup>8</sup> ) <sup>−</sup> *<sup>i</sup>* · *sin*(

*<sup>A</sup>*9(−1)*pWq* <sup>−</sup> *iA*15*W*4−*<sup>q</sup>* <sup>=</sup> <sup>X</sup>[(−1)*pA*<sup>9</sup> <sup>−</sup> *<sup>A</sup>*15)] <sup>−</sup> *<sup>i</sup>*<sup>Y</sup>[(−1)*pA*<sup>9</sup> <sup>+</sup> *<sup>A</sup>*15)] (37)

*<sup>A</sup>*11*W<sup>q</sup>* <sup>−</sup> *iA*13(−1)*pW*4−*<sup>q</sup>* <sup>=</sup> <sup>Y</sup>[*A*<sup>11</sup> <sup>−</sup> (−1)*pA*13)] <sup>−</sup> *<sup>i</sup>*<sup>X</sup>[*A*<sup>11</sup> + (−1)*pA*13)] (38)

*B*8,12 = *A*8,12 *B*9,15 = *A*<sup>9</sup> ± *A*<sup>15</sup> *B*10,14 = *A*<sup>10</sup> ± *A*<sup>14</sup> *B*11,13 = <sup>11</sup> ± *A*<sup>13</sup> (39)

<sup>4</sup>) = *B*<sup>0</sup> − *B*<sup>2</sup> (41)

Formulae (39) show that the entire 2*nd* pipeline stage can be built also from only adders and sub-tractors. Signals *A*8,12 have to be delayed in parallel shift registers in order to assure

For N = 16 the DFT coefficients can be expressed by the *Bn* variables as follows

<sup>0</sup>) = *B*<sup>0</sup> + *B*<sup>1</sup> + *B*<sup>2</sup> + *B*<sup>3</sup> *Re*(*X*¯

<sup>2</sup>) = *B*<sup>4</sup> + *γ* · (*B*<sup>5</sup> − *B*7) *Re*(*X*¯

<sup>2</sup>) = −*B*<sup>6</sup> − *γ* · (*B*<sup>5</sup> + *B*7) *Im*(*X*¯

*π*

*π*

3*π*

$$X\_k = \sum\_{n=0}^{N/8-1} \left[ (A\_{2n} + (-i)^k A\_{2n+N/4}) \mathcal{W}^{2nk} + + (A\_{2n+1} + (-i)^k A\_{2n+1+N/4}) \mathcal{W}^{(2n+1)k} \right] \tag{24}$$

Let us consider separately two subset of odd indices: k=4n and k=4n+2 (n = 0,. . . ,N/4-1)

$$X\_{4p} = \sum\_{n=0}^{N/8-1} \left[ (A\_{2n} + A\_{2n+N/4}) \mathcal{W}^{8np} + (A\_{2n+1} + A\_{2n+1+N/4}) \mathcal{W}^{(2n+1)4p} \right] \tag{25}$$

Notice that *X*¯ <sup>0</sup> and *X*¯ *<sup>N</sup>*/2 are real.

$$X\_{4p+2} = \sum\_{n=0}^{N/8-1} \left[ (A\_{2n} - A\_{2n+N/4}) W^{8n(4p+2)} + (A\_{2n+1} - A\_{2n+1+N/4}) W^{(2n+1)(4p+2)} \right] \tag{26}$$

If we introduce new variables

$$B\_{2n} = A\_{2n} + A\_{2n+N/4} \qquad B\_{2n+1} \qquad = A\_{2n+1} + A\_{2n+1+N/4} \tag{27}$$

$$B\_{2n+N/2} = A\_{2n} - A\_{2n+N/4} \qquad B\_{2n+1+N/2} = A\_{2n+1} - A\_{2n+1+N/4} \tag{28}$$

we get

$$X\_{4p} = \sum\_{n=0}^{N/8-1} (B\_{2n} + B\_{2n+1} \mathcal{W}^{4p}) \mathcal{W}^{8np} \tag{29}$$

$$X\_{4p+2} = \sum\_{n=0}^{N/8-1} (\mathcal{B}\_{2n+N/4} + \mathcal{B}\_{2n+1+N/4} \mathcal{W}^{4p+2}) \mathcal{W}^{8np+4n} \tag{30}$$

However, repeating the above procedure for odd indices related to the eq.(20) gives more complicated formulas, which cannot be simplified due to complex coefficients *W*4(*n*+*p*) (eq.31).

$$X\_{4p+q} = \sum\_{n=0}^{N/8-1} \mathcal{W}^{2n(4p+q)} [(A\_{2n+N/2} + A\_{2n+1+N/2} \mathcal{W}^{4p+q}) \mp j(A\_{2n+3N/4} + A\_{2n+1+3N/4} \mathcal{W}^{4p+q})] \tag{31}$$

where ∓ corresponds to q = 1,3 respectively. Next simplification is possible due to symmetries of trigonometric functions. However, general considerations give relatively complicated formulas, which seem to be unnecessary here.

#### **6. 16-point algorithm**

16 Design and Architectures for Digital Signal Processing

= 0,2,. . . ,N-2

*N*/8−1 ∑ *n*=0

*X*¯ <sup>4</sup>*<sup>p</sup>* =

*X*¯ <sup>4</sup>*p*+<sup>2</sup> =

we get

(eq.31).

*X*¯ <sup>4</sup>*p*+*<sup>q</sup>* =

*N*/8−1 ∑ *n*=0

*W*2*n*(4*p*+*q*)

Notice that *X*¯ <sup>0</sup> and *X*¯ *<sup>N</sup>*/2 are real.

*N*/8−1 ∑ *n*=0

If we introduce new variables

*X*¯ <sup>4</sup>*<sup>p</sup>* =

*N*/8−1 ∑ *n*=0

complicated formulas, which seem to be unnecessary here.

*X*¯ <sup>4</sup>*p*+<sup>2</sup> =

*N*/8−1 ∑ *n*=0

However, repeating the above procedure for odd indices related to the eq.(20) gives more complicated formulas, which cannot be simplified due to complex coefficients *W*4(*n*+*p*)

where ∓ corresponds to q = 1,3 respectively. Next simplification is possible due to symmetries of trigonometric functions. However, general considerations give relatively

*N*/8−1 ∑ *n*=0

*X*¯ *<sup>k</sup>* =

for their real and imaginary parts. If we split the sum in (20) similar as in (15), we get for k

Let us consider separately two subset of odd indices: k=4n and k=4n+2 (n = 0,. . . ,N/4-1)

[(*A*2*<sup>n</sup>* + (−*i*)*kA*2*n*+*N*/4)*W*2*nk* + +(*A*2*n*+<sup>1</sup> + (−*i*)*kA*2*n*+1+*N*/4)*W*(2*n*+1)*k*] (24)

[(*A*2*<sup>n</sup>* <sup>−</sup> *<sup>A</sup>*2*n*+*N*/4)*W*8*n*(4*p*+2) + (*A*2*n*+<sup>1</sup> <sup>−</sup> *<sup>A</sup>*2*n*+1+*N*/4)*W*(2*n*+1)(4*p*+2)

*B*2*<sup>n</sup>* = *A*2*<sup>n</sup>* + *A*2*n*+*N*/4 *B*2*n*+<sup>1</sup> = *A*2*n*+<sup>1</sup> + *A*2*n*+1+*N*/4 (27) *B*2*n*+*N*/2 = *A*2*<sup>n</sup>* − *A*2*n*+*N*/4 *B*2*n*+1+*N*/2 = *A*2*n*+<sup>1</sup> − *A*2*n*+1+*N*/4 (28)

(*B*2*<sup>n</sup>* + *B*2*n*+1*W*4*p*)*W*8*np* (29)

(*B*2*n*+*N*/4 <sup>+</sup> *<sup>B</sup>*2*n*+1+*N*/4*W*4*p*+2)*W*8*np*+4*<sup>n</sup>* (30)

[(*A*2*n*+*N*/2 <sup>+</sup> *<sup>A</sup>*2*n*+1+*N*/2*W*4*p*+*q*) <sup>∓</sup> *<sup>j</sup>*(*A*2*n*+3*N*/4 <sup>+</sup> *<sup>A</sup>*2*n*+1+3*N*/4*W*4*p*+*q*)]

[(*A*2*<sup>n</sup>* <sup>+</sup> *<sup>A</sup>*2*n*+*N*/4)*W*8*np* + (*A*2*n*+<sup>1</sup> <sup>+</sup> *<sup>A</sup>*2*n*+1+*N*/4)*W*(2*n*+1)4*p*] (25)

] (26)

(31)

For N = 16 and odd indices we get

$$\bar{X}\_{4p+q} = (A\_8 \mp jA\_{12}) + (i)^p (A\_9(-1)^p W^q - iA\_{15}W^{4-q}) + $$

$$+(-i)^{\frac{4-1}{2}}(-1)^{p}\mathcal{W}^{2}(A\_{10}\mp iA\_{14})\pm(i)^{p}(A\_{11}\mathcal{W}^{4-q}-iA\_{13}(-1)^{p}\mathcal{W}^{q})\tag{32}$$

Since of *W*<sup>4</sup> = −*i*, all coefficients can be expressed as a linear combination of the complex base *W*1, *W*2, *W*<sup>3</sup>

$$\mathcal{W}^1 = e^{-i\frac{\pi}{8}} = \cos(\frac{\pi}{8}) - i \cdot \sin(\frac{\pi}{8}) = \mathfrak{a} - i \cdot \mathfrak{F} \tag{33}$$

$$\mathcal{W}^2 = e^{-i\frac{\pi}{4}} = \cos(\frac{\pi}{4}) - i \cdot \sin(\frac{\pi}{4}) = \gamma(1 - i) \tag{34}$$

$$\mathcal{W}^3 = e^{-i\frac{\pi}{8}} = \cos(\frac{3\pi}{8}) - i \cdot \sin(\frac{3\pi}{8}) = \beta - i \cdot a \tag{35}$$

Symmetries in (33-35) allow the following simplification. Notice that

$$\mathcal{W}^2(A\_{10} \mp iA\_{14}) = \gamma(A\_{10} \mp A\_{14})(1 - i) \tag{36}$$

$$A\_{\theta}(-1)^{p}\mathcal{W}^{q} - iA\_{15}\mathcal{W}^{4-q} = \hat{\mathcal{X}}[(-1)^{p}A\_{\theta} - A\_{15})[-i\hat{\mathcal{Y}}[(-1)^{p}A\_{\theta} + A\_{15})] \tag{37}$$

$$iA\_{11}\mathcal{W}^{q} - iA\_{13}(-1)^{p}\mathcal{W}^{4-q} = \hat{\mathcal{Y}}[A\_{11} - (-1)^{p}A\_{13})] - i\hat{\mathcal{X}}[A\_{11} + (-1)^{p}A\_{13})] \tag{38}$$

where <sup>X</sup> <sup>=</sup> *<sup>α</sup>*, *<sup>β</sup>*, <sup>Y</sup> <sup>=</sup> *<sup>β</sup>*, *<sup>α</sup>* for q = 1,3.

We can extend the set of variables (27 − 28) also to odd indices of *X*¯

$$B\_{8,12} = A\_{8,12} \quad B\_{9,15} = A\_9 \pm A\_{15} \quad B\_{10,14} = A\_{10} \pm A\_{14} \quad B\_{11,13} = \_{11} \pm A\_{13} \tag{39}$$

Formulae (39) show that the entire 2*nd* pipeline stage can be built also from only adders and sub-tractors. Signals *A*8,12 have to be delayed in parallel shift registers in order to assure synchronization with adjacent ones.

For N = 16 the DFT coefficients can be expressed by the *Bn* variables as follows

$$\operatorname{Re}(\tilde{X}\_0) = \quad B\_0 + B\_1 + B\_2 + B\_3 \quad \quad \operatorname{Re}(\tilde{X}\_8) = B\_0 - B\_1 + B\_2 - B\_3 \tag{40}$$

$$\operatorname{Re}(X\_4) = \quad B\_0 - B\_2 \tag{41}$$

$$\operatorname{Re}(\tilde{X}\_2) = \quad B\_4 + \gamma \cdot (B\_5 - B\_7) \quad \quad \quad \operatorname{Re}(\tilde{X}\_6) = B\_4 - \gamma \cdot (B\_5 - B\_7) \tag{42}$$

$$Im(\tilde{X}\_2) = -B\_6 - \gamma \cdot (B\_5 + B\_7) \qquad \qquad Im(\tilde{X}\_6) = B\_6 - \gamma \cdot (B\_5 + B\_7) \tag{43}$$

formulae

$$\begin{aligned} \operatorname{Re}(\bar{X}\_{1,7}) &= & B\_8 \pm aB\_{15} + \gamma B\_{14} \pm \beta B\_{13} & \operatorname{Re}(\bar{X}\_{3,5}) &= & B\_8 \pm \beta B\_{15} - \gamma B\_{14} \mp aB\_{13} & \text{(44)}\\ \operatorname{Im}(\bar{X}\_{1,7}) &= & \mp B\_{12} - \beta B\_9 \mp \gamma B\_{10} - aB\_{11} & \operatorname{Im}(\bar{X}\_{3,5}) &= \pm B\_{12} - aB\_9 \mp \gamma B\_{10} + \beta B\_{11} & \text{(45)} \end{aligned}$$

The next,  $3^{nl}$  pipeline stage requires implementation of 10 multiplies calculating from (44-45), 3 address, 3 sub-tractors and 4 shift registers : according to the following.

$$\mathbf{C}\_0 = B\_0 + B\_2 \qquad \mathbf{C}\_1 = B\_1 + B\_3 \qquad \mathbf{C}\_2 = B\_0 - B\_2 \qquad \mathbf{C}\_3 = B\_1 - B\_3 \tag{46}$$

$$\mathbf{C\_{5}} = B\_{5} + B\_{7} \qquad \mathbf{C\_{7}} = B\_{5} - B\_{7} \tag{47}$$

$$\mathsf{C}\_{4} = B\_{4} \qquad \qquad \mathsf{C}\_{6} = B\_{6} \qquad \qquad \qquad \mathsf{C}\_{8} = B\_{8} \qquad \qquad \mathsf{C}\_{12} = B\_{12} \tag{48}$$

$$\mathbf{C\_{9A}} = \mathbf{a} \cdot \mathbf{B\_9} \qquad \mathbf{C\_{11A}} = \mathbf{a} \cdot \mathbf{B\_{11}} \qquad \mathbf{C\_{13A}} = \mathbf{a} \cdot \mathbf{B\_{13}} \qquad \mathbf{C\_{15A}} = \mathbf{a} \cdot \mathbf{B\_{15}} \tag{49}$$

$$\mathbf{C\_{9B}} = \boldsymbol{\beta} \cdot \mathbf{B\_{9}} \qquad \mathbf{C\_{11B}} = \boldsymbol{\beta} \cdot \mathbf{B\_{11}} \qquad \mathbf{C\_{13B}} = \boldsymbol{\beta} \cdot \mathbf{B\_{13}} \qquad \mathbf{C\_{15B}} = \boldsymbol{\beta} \cdot \mathbf{B\_{15}} \tag{50}$$

$$\mathbf{C}\_{10} = \boldsymbol{\gamma} \cdot \mathbf{B}\_{10} \qquad \mathbf{C}\_{14} = \boldsymbol{\gamma} \cdot \mathbf{B}\_{14} \tag{51}$$

10.5772/52946

301

http://dx.doi.org/10.5772/52946

<sup>−</sup>*iπkn*/16 (57)

**7. 32-point FFT algorithm**

algorithm (DiF) (Figure 1b) one.

**Figure 9.** A global pipeline internal structure of FFT\_16 [11] .

For 32-point Discrete Fourier Transform *X*¯

*X*¯ *<sup>k</sup>*=0,...,31 =

31 ∑ *n*=0 *xne*

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

where *xn* as samples from an ADC chip are real. The formula (57) can be split on two or more parts by rearranging of the sum and indices. The standard approach of a formula simplification is a Radix-2 Decimation-in-Time (DiT) (Figure 1a) or Decimation-in-Frequency

For Radix-2 DiT, we get the formula 3. N-point DFT can be easily split on two N/2-point transforms. Outputs from DFT procedures are complex. So, a calculation of final DFT coefficients by using DiT algorithm requires the complex multiplication for final merging

The 4*th* stage utilizes 2 multipliers, 5 adders, 5 sub-tractors and 4 shift registers

$$D\_{5,7} = \gamma \cdot \mathbb{C}\_{5,7} \qquad D\_{0,1} = \mathbb{C}\_{0} \pm \mathbb{C}\_{1} \qquad D\_{8,14} = \mathbb{C}\_{8} \pm \mathbb{C}\_{14} \qquad D\_{10,12} = \mathbb{C}\_{10} \pm \mathbb{C}\_{12} \tag{52}$$

$$D\_9 = \mathcal{C}\_{9A} - \mathcal{C}\_{11B} \quad D\_{11} = \mathcal{C}\_{11A} + \mathcal{C}\_{9B} \quad D\_{15} = \mathcal{C}\_{13A} - \mathcal{C}\_{15B} \qquad D\_{13} = \mathcal{C}\_{13B} + \mathcal{C}\_{15A} \tag{53}$$

$$D\_{2,3,4,6} = C\_{2,3,4,6} \tag{54}$$

Finally, the set of DFT *X*¯ *<sup>k</sup>* in the 5*th* stage coefficients is calculated by 6 adders and 6 sub-tractors supported by 4 shift registers.

$$\begin{aligned} \operatorname{Re}\hat{X}\_{0,4,8} &= D\_{0,2,1} & \operatorname{Re}\hat{X}\_{1,7} &= D\_{8} \pm D\_{15} & \operatorname{Re}\hat{X}\_{2,6} &= D\_{4} \pm D\_{7} & \operatorname{Re}\hat{X}\_{3,5} &= D\_{14} \mp D\_{13} & \text{(55)}\\ \operatorname{Im}\hat{X}\_{4} &= -D\_{3} & \operatorname{Im}\hat{X}\_{1,7} &= \mp D\_{10} - D\_{11} & \operatorname{Im}\hat{X}\_{2,6} &= \mp D\_{6} - D\_{5} & \operatorname{Im}\hat{X}\_{3,5} &= \mp D\_{12} - D\_{9} \end{aligned}$$

Figure 9 shows the internal structure of the 16-point FFT algorithm. As shown later (compare Figure 15), this algorithm is higher optimized in comparison to a pure DiF approach.

The algorithm with the 16-point FFT was tested on the 3*rd* generation of the Auger surface detector Front-End Board (Figure 10) [9], [10]. The 1*st* [12] and the 2*nd* [13] generations of the Front-End Boards could not support the FFT algorithms due to a lack of FPGA resources. However, the FFT algorithm seems to be less efficient than the DCT approach. The DCT algorithm implemented into the 4*th* generation Front-End with the CycloneIII® EP3C40F324C7 (Figure 11) passed successfully tests on the field recognizing short peaks with an exponentially attenuated tails characteristically for signals generated by very inclined showers.

#### **7. 32-point FFT algorithm**

18 Design and Architectures for Digital Signal Processing

sub-tractors supported by 4 shift registers.

formulae

showers.

*Re*(*X*¯ 1,7) = *B*<sup>8</sup> ± *αB*<sup>15</sup> + *γB*<sup>14</sup> ± *βB*<sup>13</sup> *Re*(*X*¯ 3,5) = *B*<sup>8</sup> ± *βB*<sup>15</sup> − *γB*<sup>14</sup> ∓ *αB*<sup>13</sup> (44) *Im*(*X*¯ 1,7) = ∓*B*<sup>12</sup> − *βB*<sup>9</sup> ∓ *γB*<sup>10</sup> − *αB*<sup>11</sup> *Im*(*X*¯ 3,5) = ±*B*<sup>12</sup> − *αB*<sup>9</sup> ∓ *γB*<sup>10</sup> + *βB*<sup>11</sup> (45)

The next, 3*rd* pipeline stage requires implementation of 10 multipliers calculating products from (44-45), 3 adders, 3 sub-tractors and 4 shift registers : according to the following

The 4*th* stage utilizes 2 multipliers, 5 adders, 5 sub-tractors and 4 shift registers

*D*5,7 = *γ* · *C*5,7 *D*0,1 = *C*<sup>0</sup> ± *C*1, *D*8,14 = *C*<sup>8</sup> ± *C*14, *D*10,12 = *C*<sup>10</sup> ± *C*<sup>12</sup> (52) *D*<sup>9</sup> = *C*9*<sup>A</sup>* − *C*11*<sup>B</sup> D*<sup>11</sup> = *C*11*<sup>A</sup>* + *C*9*<sup>B</sup> D*<sup>15</sup> = *C*13*<sup>A</sup>* − *C*15*<sup>B</sup> D*<sup>13</sup> = *C*13*<sup>B</sup>* + *C*15*<sup>A</sup>* (53)

Finally, the set of DFT *X*¯ *<sup>k</sup>* in the 5*th* stage coefficients is calculated by 6 adders and 6

*ReX*¯ 0,4,8 = *D*0,2,1 *ReX*¯ 1,7 = *D*<sup>8</sup> ± *D*<sup>15</sup> *ReX*¯ 2,6 = *D*<sup>4</sup> ± *D*<sup>7</sup> *ReX*¯ 3,5 = *D*<sup>14</sup> ∓ *D*<sup>13</sup> (55) *ImX*¯ <sup>4</sup> = −*D*3, *ImX*¯ 1,7 = ∓*D*<sup>10</sup> − *D*<sup>11</sup> *ImX*¯ 2,6 = ∓*D*<sup>6</sup> − *D*<sup>5</sup> *ImX*¯ 3,5 = ∓*D*<sup>12</sup> − *D*<sup>9</sup> (56)

Figure 9 shows the internal structure of the 16-point FFT algorithm. As shown later (compare Figure 15), this algorithm is higher optimized in comparison to a pure DiF approach.

The algorithm with the 16-point FFT was tested on the 3*rd* generation of the Auger surface detector Front-End Board (Figure 10) [9], [10]. The 1*st* [12] and the 2*nd* [13] generations of the Front-End Boards could not support the FFT algorithms due to a lack of FPGA resources. However, the FFT algorithm seems to be less efficient than the DCT approach. The DCT algorithm implemented into the 4*th* generation Front-End with the CycloneIII® EP3C40F324C7 (Figure 11) passed successfully tests on the field recognizing short peaks with an exponentially attenuated tails characteristically for signals generated by very inclined

*C*<sup>0</sup> = *B*<sup>0</sup> + *B*<sup>2</sup> *C*<sup>1</sup> = *B*<sup>1</sup> + *B*<sup>3</sup> *C*<sup>2</sup> = *B*<sup>0</sup> − *B*<sup>2</sup> *C*<sup>3</sup> = *B*<sup>1</sup> − *B*<sup>3</sup> (46)

*C*<sup>4</sup> = *B*<sup>4</sup> *C*<sup>6</sup> = *B*<sup>6</sup> *C*<sup>8</sup> = *B*<sup>8</sup> *C*<sup>12</sup> = *B*<sup>12</sup> (48) *C*9*<sup>A</sup>* = *α* · *B*<sup>9</sup> *C*11*<sup>A</sup>* = *α* · *B*<sup>11</sup> *C*13*<sup>A</sup>* = *α* · *B*<sup>13</sup> *C*15*<sup>A</sup>* = *α* · *B*<sup>15</sup> (49) *C*9*<sup>B</sup>* = *β* · *B*<sup>9</sup> *C*11*<sup>B</sup>* = *β* · *B*<sup>11</sup> *C*13*<sup>B</sup>* = *β* · *B*<sup>13</sup> *C*15*<sup>B</sup>* = *β* · *B*<sup>15</sup> (50)

*C*<sup>5</sup> = *B*<sup>5</sup> + *B*<sup>7</sup> *C*<sup>7</sup> = *B*<sup>5</sup> − *B*<sup>7</sup> (47)

*C*<sup>10</sup> = *γ* · *B*<sup>10</sup> *C*<sup>14</sup> = *γ* · *B*<sup>14</sup> (51)

*D*2,3,4,6 = *C*2,3,4,6 (54)

For 32-point Discrete Fourier Transform *X*¯

$$X\_{k=0,\ldots,31} = \sum\_{n=0}^{31} \varkappa\_n e^{-i\pi kn/16} \tag{57}$$

where *xn* as samples from an ADC chip are real. The formula (57) can be split on two or more parts by rearranging of the sum and indices. The standard approach of a formula simplification is a Radix-2 Decimation-in-Time (DiT) (Figure 1a) or Decimation-in-Frequency algorithm (DiF) (Figure 1b) one.

For Radix-2 DiT, we get the formula 3. N-point DFT can be easily split on two N/2-point transforms. Outputs from DFT procedures are complex. So, a calculation of final DFT coefficients by using DiT algorithm requires the complex multiplication for final merging

**Figure 9.** A global pipeline internal structure of FFT\_16 [11] .

10.5772/52946

303

http://dx.doi.org/10.5772/52946

<sup>2</sup>, dataa\_0

<sup>2</sup> = −1. The first

*<sup>N</sup>* (58)

(*k*) for k=[0,2,4,. . .,30] and computation of the

<sup>−</sup>*iπkn*/8 <sup>⇒</sup> *FFT*16*even* (59)

*N N*

<sup>−</sup>*iπ*(2*p*+1)*n*/16 (60)

data from parallel DFT procedures with a lower order i.e. multiplication of twiddle factors

by *G*[*k*] and *H*[*k*] in Figure 1. Altera® provides a library routine of the complex multiplication in the FPGA (Figure 12a), however, for i.e. 16x16 bits operation requires 6 DSP embedded 9x9 multipliers even in most economical (canonical) mode. Generally, the complex multiplication in the FPGA is rather resource-spendthrift and if possible it should be replaced by the

(a) (b)

= datab\_0 and dataa\_1 = datab\_1. The ALTMULT\_ADD routine requires 4 DSP 9 × 9 multipliers. It is used in E\_bin pipeline stage for odd FFT indices (Figure 17). Inputs *dataa*\_0, 1 are used for *Ck* , *datab*\_0, 1 for constants *α*, *β*, *ξ*, *η*, *σ* and *ρ*. The routine requires two clock cycles. Sub-products are registered in MULT0 and MULT1 DSP blocks, respectively. Thus, the sum appears in

For the Radix-2 DiF, we get the formula 4. The standard Radix-2 Decimation-in-Frequency algorithm (DiF) rearranges the DFT equation (57) into two parts: computation of the

odd-numbered indices k=[1,3,5,. . .,31]. This corresponds to a splitting N-point DFT into two

operations are simple sums and subtractions of real variables (see Figure 1b). Each operation

The algorithm of Decimation in Frequency used for the 32-point DFT allows splitting eq. 57

*An*+16*e*

*An* = *xn* + *xn*+<sup>16</sup> *An*+<sup>16</sup> = *xn* − *xn*+<sup>16</sup> *n* = 0, 1, ..., 15 (61)

k = N/2-point routines. The first corresponding twiddle factor is *<sup>e</sup>*−*<sup>i</sup>* <sup>2</sup>*<sup>π</sup>*

15 ∑ *n*=0

*X*¯ *<sup>k</sup>*=2*p*+<sup>1</sup> =

*Ane*

15 ∑ *n*=0

related to the consecutive twiddle factor will be performed in a single clock cycle.

**Figure 12.** The ALTMULT\_COMPLEX and ALTMULT\_ADD procedures provided by Altera®. For a calculation of |*Wk* |

−*i* <sup>2</sup>*π<sup>k</sup>*

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

*W<sup>k</sup> <sup>N</sup>* = *e*

*W<sup>k</sup> N* :

multiplication of real variables.

the next register stage.

as follows:

even-numbered discrete-frequency indices *X*¯

*X*¯ *<sup>k</sup>*=2*<sup>p</sup>* =

**Figure 10.** The 3*rd* generation of the Front-End Board with Cyclone® FPGA EP1C12Q240I7 used in more than 800 surface detectors in the Pierre Auger Observatory on the Argentinean pampas. The EP1C12Q240I7 does not contain DSP blocks. The multipliers had to be implemented from logic elements according to the scheme on the Figure 9.

**Figure 11.** The 4*th* generation of the Front-End Board with CycloneIII® FPGA EP3C40F324I7. The EP3C40F324I7 contains DSP blocks and it is possible to implement even a sophisticated algorithm like DCT engines for a recognition of horizontal or very inclined showers. This board has been used also for preliminary testing of the wavelet trigger and the signal filtering based on a chain: FFT+Median filter+iFFT.

data from parallel DFT procedures with a lower order i.e. multiplication of twiddle factors *W<sup>k</sup> N* :

20 Design and Architectures for Digital Signal Processing

**Figure 10.** The 3*rd* generation of the Front-End Board with Cyclone® FPGA EP1C12Q240I7 used in more than 800 surface detectors in the Pierre Auger Observatory on the Argentinean pampas. The EP1C12Q240I7 does not contain DSP blocks. The

**Figure 11.** The 4*th* generation of the Front-End Board with CycloneIII® FPGA EP3C40F324I7. The EP3C40F324I7 contains DSP blocks and it is possible to implement even a sophisticated algorithm like DCT engines for a recognition of horizontal or very inclined showers. This board has been used also for preliminary testing of the wavelet trigger and the signal filtering based on

a chain: FFT+Median filter+iFFT.

multipliers had to be implemented from logic elements according to the scheme on the Figure 9.

$$\mathcal{W}\_N^k = e^{-i\frac{2\pi k}{N}}\tag{58}$$

by *G*[*k*] and *H*[*k*] in Figure 1. Altera® provides a library routine of the complex multiplication in the FPGA (Figure 12a), however, for i.e. 16x16 bits operation requires 6 DSP embedded 9x9 multipliers even in most economical (canonical) mode. Generally, the complex multiplication in the FPGA is rather resource-spendthrift and if possible it should be replaced by the multiplication of real variables.

**Figure 12.** The ALTMULT\_COMPLEX and ALTMULT\_ADD procedures provided by Altera®. For a calculation of |*Wk* | <sup>2</sup>, dataa\_0 = datab\_0 and dataa\_1 = datab\_1. The ALTMULT\_ADD routine requires 4 DSP 9 × 9 multipliers. It is used in E\_bin pipeline stage for odd FFT indices (Figure 17). Inputs *dataa*\_0, 1 are used for *Ck* , *datab*\_0, 1 for constants *α*, *β*, *ξ*, *η*, *σ* and *ρ*. The routine requires two clock cycles. Sub-products are registered in MULT0 and MULT1 DSP blocks, respectively. Thus, the sum appears in the next register stage.

For the Radix-2 DiF, we get the formula 4. The standard Radix-2 Decimation-in-Frequency algorithm (DiF) rearranges the DFT equation (57) into two parts: computation of the even-numbered discrete-frequency indices *X*¯ (*k*) for k=[0,2,4,. . .,30] and computation of the odd-numbered indices k=[1,3,5,. . .,31]. This corresponds to a splitting N-point DFT into two k = N/2-point routines. The first corresponding twiddle factor is *<sup>e</sup>*−*<sup>i</sup>* <sup>2</sup>*<sup>π</sup> N N* <sup>2</sup> = −1. The first operations are simple sums and subtractions of real variables (see Figure 1b). Each operation related to the consecutive twiddle factor will be performed in a single clock cycle.

The algorithm of Decimation in Frequency used for the 32-point DFT allows splitting eq. 57 as follows:

$$X\_{k=2p} = \sum\_{n=0}^{15} A\_n e^{-i\pi kn/8} \quad \Rightarrow \qquad FFT16\_{even} \tag{59}$$

$$X\_{k=2p+1} = \sum\_{n=0}^{15} A\_{n+16} e^{-i\pi(2p+1)n/16} \tag{60}$$

$$A\_{\rm ll} = \mathbf{x}\_{\rm n} + \mathbf{x}\_{\rm n+16} \qquad \qquad A\_{\rm n+16} = \mathbf{x}\_{\rm n} - \mathbf{x}\_{\rm n+16} \qquad \qquad n = 0, 1, \ldots, 15 \tag{61}$$

The next twiddle factors are:

$$\mathcal{W}\_{\mathcal{B}} = e^{-i\pi/2} = -i \qquad \qquad \mathcal{W}\_{\mathbb{C}} = e^{-i\pi/4} = \gamma(1-i) \qquad \qquad \mathcal{W}\_{\mathcal{D}} = e^{-i\pi/8} = \mathfrak{a} - i\mathfrak{F} \tag{62}$$

$$\mathcal{W}\_{\rm E} = e^{-i\pi/16} = \mathfrak{F} - i\eta \qquad \qquad \mathcal{W}\_{\rm F} = e^{-3i\pi/16} = \sigma - i\rho \tag{63}$$

10.5772/52946

305

16

http://dx.doi.org/10.5772/52946

<sup>2</sup> and next as real number multiply

2,

*X*¯ *<sup>k</sup>*=2*p*+<sup>1</sup> = *A*<sup>16</sup> + *A*<sup>24</sup> +

*B*16,24 = *A*16,24

DSP blocks) (see Figure 12b).

**8. Wavelet power calculation**

<sup>2</sup> = |*W*|

2.

<sup>2</sup> × |Ψ*k*|

various reference wavelets, respectively.

<sup>2</sup> × |Ψ|

well as |*W* × Ψ|

is trivial). |*Wk*|

by a next real |Ψ|

7 ∑ *n*=1

*Bn*+<sup>24</sup> = *An*+<sup>16</sup> − *A*32−*<sup>n</sup> n* = 1, ..., 7

(*cosφBn*+<sup>24</sup> <sup>−</sup> *isinφBn*+16) *<sup>φ</sup>* <sup>=</sup> *<sup>π</sup>*(2*<sup>p</sup>* <sup>+</sup> <sup>1</sup>)*<sup>n</sup>*

*Bn*+<sup>16</sup> = *An*+<sup>16</sup> + *A*32−*<sup>n</sup> n* = 1, ..., 7 (68)

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

A development of the algorithm according to eq. (22) would allow a reduction of the next pipeline stage, but unfortunately at the cost of additional 16 ALTMULT\_ADD routines (64

If the speed is not a factor, sums of products in the E\_bin routine can be performed in a single clock cycle instead of two cycles as shown on Figure 17. Thus, *D*26,20,24,28 shift registers are not necessary and can be removed. A shorter chain for the odd indices allows removing also the last pipeline chain for even indices and saving totally more than 1000 logic elements without the cost of additional multipliers. However, we should be aware, that a registered performance significantly decreases from ca. 220 MHz to only 158 MHz for EP3C120F780C7.

The reference wavelets are real, however, their Fourier transform are already complex. An elementary product from eq. (11) is a product of two complex numbers: Fourier coefficients of data and Fourier coefficient of a reference wavelet. The simplest way is to use the Altera® routine from Figure 12. However, due to a fact that the wavelet Fourier coefficients are predefined constant and finally we are going to calculate a module of a complex product as

2, we can calculate only |*W*|

The FFT32 routine from Figure 17 utilizes 96 DSP 9 × 9 multipliers. For a calculation of |*Wk*|

<sup>2</sup> products use next 30 DSP 9 × 9 multipliers.

of a sufficient amount of DSP blocks to implement the wavelet trigger.

the ALTMULT\_ADD routine utilizes 4 DSP 9 × 9 multipliers for each index k, totally 60 (|*W*0|

This algorithm can be implemented only in very powerful modern FPGA chips. The FPGA families ACEX® or Cyclone®, currently used in surface detectors, do not contain DSP blocks. Even CycloneIII® EP3C40F324I7 [14] used for DCT trigger tests ([15], [16]) does not consist

The biggest FPGAs from the CycloneIII® EP3C120F780C7 (Figure 13) and CycloneIV® EP4CE115F29C7 (Figure 14) families with 576 and 532 DSP multipliers, respectively, allow the implementation of the FFT32 routine (96 DSP blocks) + "Module" block (60 DSP blocks) + 14 or 11 "engines" (30 DSP blocks each) simultaneously for a power estimation of 14 or 11

Table 2 shows results calculated and measured in the Altera®'s development kit DK-DSP-3C120N for various variants for Cyclone® III EP3C120F780C7 (a heart of this development kit). Results do not fully agree with our expectations. A reduction of a single pipeline stage decreases a resource occupation on ca. 410 (not 640) logic elements. This

$$\gamma = \cos(\pi/4) \qquad \qquad \mathfrak{a} = \cos(\pi/8) \qquad \qquad \zeta = \cos(\pi/16) \qquad \qquad \sigma = \cos(3\pi/16) \tag{64}$$

$$\beta = \sin(\pi/8) \qquad \eta = \sin(\pi/16) \qquad \rho = \sin(3\pi/16) \qquad \text{(65)}$$

The scheme developed on the pure Radix-2 Decimation in Frequency algorithm is presented in Figure 15. The algorithm takes into account only FFT coefficients with indices k = 0,...,15. Due to real input data (*x*0,...31) the higher FFT coefficients have well known symmetry : *ReX*¯ <sup>32</sup>−*<sup>n</sup>* = *ReX*¯ *<sup>n</sup>* and *ImX*¯ <sup>32</sup>−*<sup>n</sup>* = −*ImX*¯ *<sup>n</sup>* (*n* > 0). The calculation of *X*¯ 0,...15 according the pure Radix-2 DiF algorithm requires 8 pipeline stages. For *X*¯ 0,4,8,12,16 2 pipeline stages are necessary only for a synchronization.

According to the eq. (59) all *X*¯ 0,2,4,...,14 with even indices could be calculated by the algorithm presented in [11]. Variables *xn* in Figure 2 in [11] were be replaced by variable of *An* according to eq. (61). An application of a modified algorithm reduces an amount of 9 × 9 multipliers from 12 to 10 only and shorten a pipeline chain on stages (the last 2 stages are simple registers for synchronization) (see Figure 16).

Let us notice that for the odd indices stages *B* and *C* for k=16,...,19 and k = 24,...27 are pure delay lines, while for neighboring indices k=20,...,23 and k = 28,...31 mathematical operation are performed in a cascade. Let us multiply *A*16,...19 and *A*24,...27 by the factor *λ* = *γ*<sup>−</sup>1. Then to adjust variables in the *C* stage for odd FFT coefficients (for k = 20,21,22,33 and k = 28,29,30,31)

$$\mathbb{C}\_{k} = \lambda \times \gamma = B\_{k} \tag{66}$$

Thus, by such a redefinition, The *C* stage for the odd FFT indices is a pure pipeline stage. It can be removed with one of pipeline stage for the even FFT indices. In order to come back to the correct values coefficients in *F* stage can be simple redefined

$$\mathfrak{a}' = \gamma \times \mathfrak{a} \qquad \mathfrak{B}' = \gamma \times \mathfrak{B} \qquad \mathfrak{F}' = \gamma \times \mathfrak{F} \qquad \eta' = \gamma \times \eta \qquad \sigma' = \gamma \times \sigma \qquad \rho' = \gamma \times \rho \tag{67}$$

but for indices k = 16, 20, 24 and 28 we have to use additional 4 multipliers. Nevertheless, at this cost we save one pipeline stage and depending on a width of buses in the final FFT coefficients we save at least of 1000 logic elements.

We can save a next pipeline stage and more ca. 1000 logic elements but again at the cost of additional utilized multipliers. The algorithm used for indices k = 2,6,10,14 is neither Decimation in Time nor Decimation in Frequency. The eq. (60) can be rewritten as follows:

$$\begin{aligned} \bar{X}\_{k=2p+1} &= A\_{16} + A\_{24} + \sum\_{n=1}^{7} (\cos \phi B\_{n+24} - i \sin \phi B\_{n+16}) & \phi &= \frac{\pi (2p+1)n}{16} \\ B\_{16,24} &= A\_{16,24} \\ B\_{n+16} &= A\_{n+16} + A\_{32-n} & n = 1, \ldots, 7 \\ B\_{n+24} &= A\_{n+16} - A\_{32-n} & n = 1, \ldots, 7 \end{aligned} \tag{68}$$

A development of the algorithm according to eq. (22) would allow a reduction of the next pipeline stage, but unfortunately at the cost of additional 16 ALTMULT\_ADD routines (64 DSP blocks) (see Figure 12b).

If the speed is not a factor, sums of products in the E\_bin routine can be performed in a single clock cycle instead of two cycles as shown on Figure 17. Thus, *D*26,20,24,28 shift registers are not necessary and can be removed. A shorter chain for the odd indices allows removing also the last pipeline chain for even indices and saving totally more than 1000 logic elements without the cost of additional multipliers. However, we should be aware, that a registered performance significantly decreases from ca. 220 MHz to only 158 MHz for EP3C120F780C7.

#### **8. Wavelet power calculation**

22 Design and Architectures for Digital Signal Processing

necessary only for a synchronization.

for synchronization) (see Figure 16).

28,29,30,31)

<sup>−</sup>*iπ*/2 <sup>=</sup> <sup>−</sup>*i WC* <sup>=</sup> *<sup>e</sup>*

*WE* = *e*

<sup>−</sup>*iπ*/4 <sup>=</sup> *<sup>γ</sup>*(<sup>1</sup> <sup>−</sup> *<sup>i</sup>*) *WD* <sup>=</sup> *<sup>e</sup>*

*β* = *sin*(*π*/8) *η* = *sin*(*π*/16) *ρ* = *sin*(3*π*/16) (65)

*Ck* = *λ* × *γ* = *Bk* (66)

<sup>−</sup>*iπ*/16 <sup>=</sup> *<sup>ξ</sup>* <sup>−</sup> *<sup>i</sup><sup>η</sup> WF* <sup>=</sup> *<sup>e</sup>*

*γ* = *cos*(*π*/4) *α* = *cos*(*π*/8) *ξ* = *cos*(*π*/16) *σ* = *cos*(3*π*/16) (64)

The scheme developed on the pure Radix-2 Decimation in Frequency algorithm is presented in Figure 15. The algorithm takes into account only FFT coefficients with indices k = 0,...,15. Due to real input data (*x*0,...31) the higher FFT coefficients have well known symmetry : *ReX*¯ <sup>32</sup>−*<sup>n</sup>* = *ReX*¯ *<sup>n</sup>* and *ImX*¯ <sup>32</sup>−*<sup>n</sup>* = −*ImX*¯ *<sup>n</sup>* (*n* > 0). The calculation of *X*¯ 0,...15 according the pure Radix-2 DiF algorithm requires 8 pipeline stages. For *X*¯ 0,4,8,12,16 2 pipeline stages are

According to the eq. (59) all *X*¯ 0,2,4,...,14 with even indices could be calculated by the algorithm presented in [11]. Variables *xn* in Figure 2 in [11] were be replaced by variable of *An* according to eq. (61). An application of a modified algorithm reduces an amount of 9 × 9 multipliers from 12 to 10 only and shorten a pipeline chain on stages (the last 2 stages are simple registers

Let us notice that for the odd indices stages *B* and *C* for k=16,...,19 and k = 24,...27 are pure delay lines, while for neighboring indices k=20,...,23 and k = 28,...31 mathematical operation are performed in a cascade. Let us multiply *A*16,...19 and *A*24,...27 by the factor *λ* = *γ*<sup>−</sup>1. Then to adjust variables in the *C* stage for odd FFT coefficients (for k = 20,21,22,33 and k =

Thus, by such a redefinition, The *C* stage for the odd FFT indices is a pure pipeline stage. It can be removed with one of pipeline stage for the even FFT indices. In order to come back

*α*′ = *γ* × *α β*′ = *γ* × *β ξ*′ = *γ* × *ξ η*′ = *γ* × *η σ*′ = *γ* × *σ ρ*′ = *γ* × *ρ* (67)

but for indices k = 16, 20, 24 and 28 we have to use additional 4 multipliers. Nevertheless, at this cost we save one pipeline stage and depending on a width of buses in the final FFT

We can save a next pipeline stage and more ca. 1000 logic elements but again at the cost of additional utilized multipliers. The algorithm used for indices k = 2,6,10,14 is neither Decimation in Time nor Decimation in Frequency. The eq. (60) can be rewritten as follows:

to the correct values coefficients in *F* stage can be simple redefined

coefficients we save at least of 1000 logic elements.

<sup>−</sup>*iπ*/8 = *<sup>α</sup>* − *<sup>i</sup><sup>β</sup>* (62)

<sup>−</sup>3*iπ*/16 = *<sup>σ</sup>* − *<sup>i</sup><sup>ρ</sup>* (63)

The next twiddle factors are:

*WB* = *e*

The reference wavelets are real, however, their Fourier transform are already complex. An elementary product from eq. (11) is a product of two complex numbers: Fourier coefficients of data and Fourier coefficient of a reference wavelet. The simplest way is to use the Altera® routine from Figure 12. However, due to a fact that the wavelet Fourier coefficients are predefined constant and finally we are going to calculate a module of a complex product as well as |*W* × Ψ| <sup>2</sup> = |*W*| <sup>2</sup> × |Ψ| 2, we can calculate only |*W*| <sup>2</sup> and next as real number multiply by a next real |Ψ| 2.

The FFT32 routine from Figure 17 utilizes 96 DSP 9 × 9 multipliers. For a calculation of |*Wk*| 2, the ALTMULT\_ADD routine utilizes 4 DSP 9 × 9 multipliers for each index k, totally 60 (|*W*0| is trivial). |*Wk*| <sup>2</sup> × |Ψ*k*| <sup>2</sup> products use next 30 DSP 9 × 9 multipliers.

This algorithm can be implemented only in very powerful modern FPGA chips. The FPGA families ACEX® or Cyclone®, currently used in surface detectors, do not contain DSP blocks. Even CycloneIII® EP3C40F324I7 [14] used for DCT trigger tests ([15], [16]) does not consist of a sufficient amount of DSP blocks to implement the wavelet trigger.

The biggest FPGAs from the CycloneIII® EP3C120F780C7 (Figure 13) and CycloneIV® EP4CE115F29C7 (Figure 14) families with 576 and 532 DSP multipliers, respectively, allow the implementation of the FFT32 routine (96 DSP blocks) + "Module" block (60 DSP blocks) + 14 or 11 "engines" (30 DSP blocks each) simultaneously for a power estimation of 14 or 11 various reference wavelets, respectively.

Table 2 shows results calculated and measured in the Altera®'s development kit DK-DSP-3C120N for various variants for Cyclone® III EP3C120F780C7 (a heart of this development kit). Results do not fully agree with our expectations. A reduction of a single pipeline stage decreases a resource occupation on ca. 410 (not 640) logic elements. This

10.5772/52946

307

http://dx.doi.org/10.5772/52946

**Figure 15.** An internal structure of the FFT32 FPGA procedure. The algorithm uses 14 single clock-cycle multipliers (i.e. *F*<sup>7</sup> = *γD*<sup>7</sup> - each utilizes two 9x9 DSP multipliers) and 16 two clock-cycles multipliers (i.e. *N*<sup>7</sup> = *βG*<sup>7</sup> − *αH*<sup>7</sup> - each utilizes four

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

may be due to optimization processes performed by the Quartus® II compiler to achieve the maximal registered performance. Nevertheless, for all comparisons the speed in the "optimized" design is higher than for the "pure DiF". For a development of wavelet engines

9x9 DSP multipliers). Totally, the algorithm needs 92 9x9 DSP multipliers.

the "optimized variant has been selected as potentially faster.

**Figure 13.** The test system based on a development kit with Altera® CycloneIII® FPGA EP3C120F780C7 supported by two daughter boards: AD/DA Data Conversion Card (left) with two ADCs (150MHz sampling) and two DACs (250 MHz), as well as the Industrial Communication Board (ICB-HSMC)(right) allowing a connection via the galvanic isolated RS485 ports.

**Figure 14.** Test system based on a development kit with Altera® CycloneIV® EP4CE115F29C7 supported by ICB-HSMC daughter board.

24 Design and Architectures for Digital Signal Processing

**Figure 13.** The test system based on a development kit with Altera® CycloneIII® FPGA EP3C120F780C7 supported by two daughter boards: AD/DA Data Conversion Card (left) with two ADCs (150MHz sampling) and two DACs (250 MHz), as well as

**Figure 14.** Test system based on a development kit with Altera® CycloneIV® EP4CE115F29C7 supported by ICB-HSMC daughter

board.

the Industrial Communication Board (ICB-HSMC)(right) allowing a connection via the galvanic isolated RS485 ports.


**Figure 15.** An internal structure of the FFT32 FPGA procedure. The algorithm uses 14 single clock-cycle multipliers (i.e. *F*<sup>7</sup> = *γD*<sup>7</sup> - each utilizes two 9x9 DSP multipliers) and 16 two clock-cycles multipliers (i.e. *N*<sup>7</sup> = *βG*<sup>7</sup> − *αH*<sup>7</sup> - each utilizes four 9x9 DSP multipliers). Totally, the algorithm needs 92 9x9 DSP multipliers.

may be due to optimization processes performed by the Quartus® II compiler to achieve the maximal registered performance. Nevertheless, for all comparisons the speed in the "optimized" design is higher than for the "pure DiF". For a development of wavelet engines the "optimized variant has been selected as potentially faster.

10.5772/52946

309

http://dx.doi.org/10.5772/52946

**Figure 17.** An optimized structure with a reduced a single pipeline stage at the cost of only 4 additional multipliers (8 DSP

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

the Stratix® III chips have a huge power consumption in a static mode of ∼600 mW, which significantly limited their application in systems supplied from solar panels. The power consumption for the "optimized" variant is ∼35 mW higher than for the "pure DiF" solution. The additional 35 mW is not a factor, if it allows an improvement of the safety margin for the register performance. The EP3C120F780C7 allows the implementation of 14 wavelet engines.

9 × 9 blocks).

**Figure 16.** A modified structure for *X*¯ 2,6,10,14 allowing a reduction of two 9 × 9 multipliers and shorten a pipeline chain on two stages (shift registers still used for synchronization).

The Quartus® II compiler estimated a power consumption for the core, a static mode and for the I/O sector. As possible, the output of registers were multiplexed to reduce an amount of output pins (all pins were achieved to HSMC connectors on the development board). According to expectation the power for I/O increase ca. linear with a number of used pins. The static power consumption is on a level ∼100 mW. It is a reasonable level. In comparison 26 Design and Architectures for Digital Signal Processing

stages (shift registers still used for synchronization).

**Figure 16.** A modified structure for *X*¯ 2,6,10,14 allowing a reduction of two 9 × 9 multipliers and shorten a pipeline chain on two

The Quartus® II compiler estimated a power consumption for the core, a static mode and for the I/O sector. As possible, the output of registers were multiplexed to reduce an amount of output pins (all pins were achieved to HSMC connectors on the development board). According to expectation the power for I/O increase ca. linear with a number of used pins. The static power consumption is on a level ∼100 mW. It is a reasonable level. In comparison

**Figure 17.** An optimized structure with a reduced a single pipeline stage at the cost of only 4 additional multipliers (8 DSP 9 × 9 blocks).

the Stratix® III chips have a huge power consumption in a static mode of ∼600 mW, which significantly limited their application in systems supplied from solar panels. The power consumption for the "optimized" variant is ∼35 mW higher than for the "pure DiF" solution. The additional 35 mW is not a factor, if it allows an improvement of the safety margin for the register performance. The EP3C120F780C7 allows the implementation of 14 wavelet engines.


10.5772/52946

311

http://dx.doi.org/10.5772/52946

**9. Spectral leakage**

simultaneously analysis of more reference wavelets.

**10. Design improvement**

IV) is reduced up to 40%.

systems in cosmic rays experiments.

used in a development kit DE2-115 (Terasic).

atmosphere will be built on a basis of Cyclone® V family.

**11. Conclusions**

For the serial FFT processing the input data have to be chopped into blocks to be processed by the FFT routine. If signal pulses are located close to the border of a block, aliasing occurs. It manifests by a spurious contribution in the opposite border of the block and in the neighboring block as well. This effect may cause spurious pulses and has to be eliminated. The problem can only be solved, without introducing dead time between the blocks, by using an overlapping routine. Therefore the FFT engines have to be over-clocked. Practically for 1024-length blocks aliasing is reduced to a negligible level, when two blocks are overlapped during 64 time bins [7]. For parallel data processing, when all set of Fourier coefficients is available for each clock cycle FFT engines aliasing can be eliminated by a selection of a set of these coefficients not significantly affected. If a reduced set of Fourier coefficients is taken for data analysis, there is a possibility to increase an amount of wavelet engines for

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

The new Altera®'s FPGA family - Cyclone® V provides the industry's lowest system cost and power, along with performance levels that make the device family ideal for high-volume applications. A total power consumption compared with the previous generation (Cyclone®

The biggest FPGA from the Cyclone® V E family 5CEA9 (with logic only without ARM-based hard processor system (HPS) contains 684 DSP 18 × 18 multipliers + 342 variable-precision DSP blocks (DSP blocks include three 9 × 9, two 18 × 19, and one 27 × 27 multiplier). Assuming roughly a single 18 × 18 multiplier is equivalent to two 9 × 9 ones, 5CEA9 could implement FFT32 + 18 engines for various 18 reference wavelets. However, the 5CEA9 FPGA is not yet available even for compilation (latest Quartus® II version 12.0). An estimation for 12 wavelet engines for 5CGXFC7 FPGA shows the scarcity of DSP blocks. Fast multipliers are replaced by logic elements, which significantly reduced the register performance for slow models, below our requirements. Nevertheless, if all multiplication all implemented in the fast DSP blocks (see Table 3 Cyclone® V for 4 wavelet engines only), timing is perfect. This allows anticipating also a perfect timing for the 5CEA9 chip. Expected total 58% less power consumption (30% and next 40% of reduction of power consumption from Cyclone® III to Cyclone® V) gives an estimation of 840 mW for 12 and 1260 mW for 18 wavelet engines, respectively. It is acceptable level of the power consumption for currently used supply

The FFT32 routine has been successfully and cost-effectively implemented into the powerful FPGA EP3C120F780C7 from the Cyclone® III family used in a development kit DK-DSP-3C120N (Altera®) and EP4CE115F29C7 from the Cyclone® IV family of Altera®

Nevertheless, both FPGAs from Cyclone® III and IV families were treated as an engineering test platform for a development of the algorithm and a timing verification. The prototype targeted for real detection of radio signals coming from air showers developing in the

**Table 2.** Resources Occupancy and Power Consumption for the Cyclone III FPGA - EP3C120F780C7 for 200 MHz PLL Global Clock


**Table 3.** Resources Occupancy and Timing for the Cyclone® IV and Cyclone® V FPGAs for 200 MHz PLL Global Clock

A design with 12 engines has been tested. The power consumption is on a level of ∼100-110 mW per the wavelet engine. It gives ∼2 W for 12 engines. This may be a challenge for an autonomous system supplied from solar panels.

Measurements of the power consumption for all considered variants show some discrepancies with simulations. The Measured power consumption for the core increases slower with new wavelet engines than simulations show. Almost 300 mW lower power taken by the FPGA (in comparison to simulations) for 12 engines gives optimistic predictions for the future applications. The power consumption for the core seems to be ca 15% overestimated in simulations. On the other hand, the power consumption for the I/O section is unpredictable much higher than for simulations. However, differences decrease with a higher amount of active pins. This, actually, is not a problem, I/O pins have been attached for test only. In real applications almost all variables are utilized as internal nodes. The power optimization is highly recommended.

Designs have been also implemented into EP4CE115F29C7 from the Cyclone® IV family of Altera® used in a development kit DE2-115 (Terasic). According to the Altera®'s specification, the power consumption for the Cyclone® IV family is 30% less than for the Cyclone® III one. However, the Terasic's development kit does not contain any system allowing a measurement of the power consumption on the board.

For the Cyclone®IV EP4CE115F29C7 timing shows a pretty good safety margin.

#### **9. Spectral leakage**

**power power power power**

**(mW) (mW) (mW) (mW)**

**Slack Fmax Fmax**

**(ns) (MHz) (MHz)**

**elements** 0◦*C* 85◦*C*

**config logic DSP pins fmax sim. sim. mea. mea.**

pure FFT32 pure DiF 4712 - 4% 92 - 16% 25 - 5% 236 557 65 580 170 pure FFT32 optimized 4301 - 4% 96 - 16% 25 - 5% 241 589 65 588 170 plus Module pure DiF 4990 - 4% 152 - 26% 25 - 5% 245 750 68 779 170 plus Module opt 4541 - 4% 156 - 27% 25 - 5% 246 787 68 783 170 1 wavelet - 24-bit opt 4726 - 4% 186 - 32% 29 - 5% 235 861 88 840 240 1 wavelet - 16-bit opt 4265 - 4% 186 - 32% 21 - 4% 228 814 66 790 170 4 wavelets - 16-bit opt 5478 - 5% 276 - 48% 81 - 15% 212 1134 215 1040 240 8 wavelets - 16-bit opt 5967 - 5% 396 - 69% 161 - 30% 204 1591 413 1363 360 12 wavelets - 16-bit opt 7060 - 6% 516 - 90% 241 - 45% 208 1980 612 1691 478

**Table 2.** Resources Occupancy and Power Consumption for the Cyclone III FPGA - EP3C120F780C7 for 200 MHz PLL Global

**Family FPGA config logic DSP Fast Slow Slow**

Cyclone IV EP4CE115F29C7 12 wavelets 7120 - 6% 516 - 97% 2.594 234 214 Cyclone V 5CGXFC7D6F31C6 12 wavelets 6933 - 6% 156 - 100% 2.111 **195 196** Cyclone V 5CGXFC7D6F31C6 4 wavelets 3177 - 3% 111 - 71% 2.169 227 228

A design with 12 engines has been tested. The power consumption is on a level of ∼100-110 mW per the wavelet engine. It gives ∼2 W for 12 engines. This may be a challenge for an

Measurements of the power consumption for all considered variants show some discrepancies with simulations. The Measured power consumption for the core increases slower with new wavelet engines than simulations show. Almost 300 mW lower power taken by the FPGA (in comparison to simulations) for 12 engines gives optimistic predictions for the future applications. The power consumption for the core seems to be ca 15% overestimated in simulations. On the other hand, the power consumption for the I/O section is unpredictable much higher than for simulations. However, differences decrease with a higher amount of active pins. This, actually, is not a problem, I/O pins have been attached for test only. In real applications almost all variables are utilized as internal nodes. The power optimization

Designs have been also implemented into EP4CE115F29C7 from the Cyclone® IV family of Altera® used in a development kit DE2-115 (Terasic). According to the Altera®'s specification, the power consumption for the Cyclone® IV family is 30% less than for the Cyclone® III one. However, the Terasic's development kit does not contain any system allowing a measurement

For the Cyclone®IV EP4CE115F29C7 timing shows a pretty good safety margin.

**Table 3.** Resources Occupancy and Timing for the Cyclone® IV and Cyclone® V FPGAs for 200 MHz PLL Global Clock

autonomous system supplied from solar panels.

is highly recommended.

of the power consumption on the board.

**elements (MHz) core I/O core I/O**

28 Design and Architectures for Digital Signal Processing

Clock

For the serial FFT processing the input data have to be chopped into blocks to be processed by the FFT routine. If signal pulses are located close to the border of a block, aliasing occurs. It manifests by a spurious contribution in the opposite border of the block and in the neighboring block as well. This effect may cause spurious pulses and has to be eliminated. The problem can only be solved, without introducing dead time between the blocks, by using an overlapping routine. Therefore the FFT engines have to be over-clocked. Practically for 1024-length blocks aliasing is reduced to a negligible level, when two blocks are overlapped during 64 time bins [7]. For parallel data processing, when all set of Fourier coefficients is available for each clock cycle FFT engines aliasing can be eliminated by a selection of a set of these coefficients not significantly affected. If a reduced set of Fourier coefficients is taken for data analysis, there is a possibility to increase an amount of wavelet engines for simultaneously analysis of more reference wavelets.

#### **10. Design improvement**

The new Altera®'s FPGA family - Cyclone® V provides the industry's lowest system cost and power, along with performance levels that make the device family ideal for high-volume applications. A total power consumption compared with the previous generation (Cyclone® IV) is reduced up to 40%.

The biggest FPGA from the Cyclone® V E family 5CEA9 (with logic only without ARM-based hard processor system (HPS) contains 684 DSP 18 × 18 multipliers + 342 variable-precision DSP blocks (DSP blocks include three 9 × 9, two 18 × 19, and one 27 × 27 multiplier). Assuming roughly a single 18 × 18 multiplier is equivalent to two 9 × 9 ones, 5CEA9 could implement FFT32 + 18 engines for various 18 reference wavelets. However, the 5CEA9 FPGA is not yet available even for compilation (latest Quartus® II version 12.0). An estimation for 12 wavelet engines for 5CGXFC7 FPGA shows the scarcity of DSP blocks. Fast multipliers are replaced by logic elements, which significantly reduced the register performance for slow models, below our requirements. Nevertheless, if all multiplication all implemented in the fast DSP blocks (see Table 3 Cyclone® V for 4 wavelet engines only), timing is perfect. This allows anticipating also a perfect timing for the 5CEA9 chip. Expected total 58% less power consumption (30% and next 40% of reduction of power consumption from Cyclone® III to Cyclone® V) gives an estimation of 840 mW for 12 and 1260 mW for 18 wavelet engines, respectively. It is acceptable level of the power consumption for currently used supply systems in cosmic rays experiments.

#### **11. Conclusions**

The FFT32 routine has been successfully and cost-effectively implemented into the powerful FPGA EP3C120F780C7 from the Cyclone® III family used in a development kit DK-DSP-3C120N (Altera®) and EP4CE115F29C7 from the Cyclone® IV family of Altera® used in a development kit DE2-115 (Terasic).

Nevertheless, both FPGAs from Cyclone® III and IV families were treated as an engineering test platform for a development of the algorithm and a timing verification. The prototype targeted for real detection of radio signals coming from air showers developing in the atmosphere will be built on a basis of Cyclone® V family.

The Pierre Auger Observatory is worldwide the largest cosmic ray experiment and operates its southern observatory since 2004. Results from Auger South have shown that the spectrum of cosmic rays has a characteristic cut-off at ca. 50 EeV; that events with higher energy arrive anisotropic; and that cosmic rays at highest energies are probably built from heavy nuclei. These results define the requirements for the next generation experiment: it needs to be considerably increased in size, it needs a better sensitivity to composition, and it should cover the full sky. Such a facility, AugerNext, will be specified within the next 3-5 years.

10.5772/52946

313

http://dx.doi.org/10.5772/52946

[4] X. Bertou, P. Billoir, O. Deligny, C. Lachaud, A. Letessier-Selvon, "Tau Neutrinos in the Auger Observatory : A New Window to UHECR Sources ", astro-ph/ 0104452.

FPGA Based Serial and Single-Clock Cycle Pipelined Fast Fourier Transforms in a Radio Detection of Cosmic Rays

[6] G. A. Dulk, W. C. Erickson, R. Manning, and J.-L. Bougeret, "Calibration of low-frequency radio telescopes using the galactic background radiation", *A&A*, vol. 365, pp. 294-300,

[7] A. Schmidt, H. Gemmeke, A. Haungs, K-H Kampert, C. Rühle, Z. Szadkowski, "An FPGA Based Trigger and RFI Filter for Radio Detection of Cosmic Rays", *IEEE Trans.*

[8] Z.Ge, Significance tests for the wavelet power and the wavelet power spectrum, *Ann.*

[9] Z. Szadkowski, K-H Becker, K-H Kampert, "Development of a new first level trigger for the surface array in the Pierre Auger Observatory based on the Cyclone™ Altera®

[10] Z. Szadkowski, et. al, "The 3rd Generation Front-End Cards of the Pierre Auger Surface Detectors: Test Results and Performance in the Field", *Nucl. Instr. Meth*, vol. A606, pp.

[11] Z. Szadkowski, "16-point Discrete Fourier Transform based on the Radix-2 FFT algorithm implemented into Cyclone™ FPGA as the UHECR trigger for horizontal air

[12] Z. Szadkowski, D. Nitz, "Implementation of the first level surface detector trigger for the Pierre Auger Observatory Engineering Array", *Nucl. Instr. Meth*, vol. A545, pp. 624-631,

[13] Z. Szadkowski, "The concept of an ACEX® cost-effective first level surface detector trigger in the Pierre Auger Observatory", *Nucl. Instr. Meth*, vol. A551, pp. 477-486, Oct.

[14] Z. Szadkowski. "Trigger Board for the Auger Surface Detector with 100 MHz Sampling and Discrete Cosine Transform". *IEEE Trans. Nucl. Science*, vol. 58, no 4, pp. 1692-1700,

[15] Z. Szadkowski, "A spectral 1*st* level FPGA trigger for detection of very inclined showers based on a 16-point Discrete Cosine Transform for the Pierre Auger Experiments", *Nucl.*

[16] Z. Szadkowski "An optimization of 16-point Discrete Cosine Transform Implemented into a FPGA as a Design for a Spectral First Level Surface Detector Trigger in Extensive Air Shower Experiments", *Applications of Digital Signal Processing*, InTech, ISBN

*Instr. Meth*, vol. A606, pp. 330-343, July 2009, ISSN: 0168-9002

showers", *Nucl. Instr. Meth*, vol. A560, pp. 309-316, May 2006, ISSN: 0168-9002

FPGA", *Nucl. Instr. Meth*, vol. A545, pp. 793-802, June 2005, ISSN: 0168-9002

Jan. 2001, ISSN (Print Edition): 0004-6361, ISSN (Electronic Edition): 1432-0746

[5] http://www.altera.com/products/ip/dsp/transforms/m-ham-fft.html

*Nucl. Science.* vol. 58, no 4, pp. 1621-1627, Aug. 2011, ISSN: 0018-9499

*Geophys.*, 25, pp. 2259-2269, 2007, ISSN: 1593-5213

439-445, July 2009, ISSN: 0168-9002

June 2005 ISSN: 0168-9002

Aug. 2011, ISSN: 0018-9499

978-953-307-406-1, Croatia, 2011.

2005, ISSN: 0168-9002

The innovative research studies are needed in order to prepare an AugerNext proposal fulfilling the demands. Requested resources are primarily focused in the areas: consolidation of the detection of cosmic rays using MHz radio antennas, proof-of-principle of cosmic rays microwave detection, testing the large-scale application of new generation photo sensors, generalization of data communication techniques, and developing a new technique of muon detection with surface arrays. Studies for such a next generation cosmic ray experiment and the utilization of detection methods are principle elements of the ASPERA /ApPEC roadmaps.

ASPERA-2 [18] supporting these efforts is the project of "The Innovative Research Studies for the Next Generation Ground-Based Ultra-High Energy Cosmic-Ray Experiment: AugerNext".

#### **Acknowledgements**

This chapter has been supported by the National Center of Researches and Development (Poland) under the Grant No. ERA-NET-ASPERA/02/11.

#### **Author details**

Zbigniew Szadkowski

<sup>⋆</sup> Address all correspondence to: zszadkow@kfd2.phys.uni.lodz.pl

University of Łód´z, Department of Physics and Applied Informatics, Faculty of High-Energy Astrophysics, Łód´z, Poland

The author is the member of the Pierre Auger Collaboration since 1999

#### **References**


30 Design and Architectures for Digital Signal Processing

roadmaps.

AugerNext".

**Acknowledgements**

**Author details**

**References**

Zbigniew Szadkowski

Astrophysics, Łód´z, Poland

ISSN: 0927-6505

Ray Phys., vol. 10, pp. 171, 1971.

523, pp. 50-95, May 2004, ISSN: 0168-9002

The Pierre Auger Observatory is worldwide the largest cosmic ray experiment and operates its southern observatory since 2004. Results from Auger South have shown that the spectrum of cosmic rays has a characteristic cut-off at ca. 50 EeV; that events with higher energy arrive anisotropic; and that cosmic rays at highest energies are probably built from heavy nuclei. These results define the requirements for the next generation experiment: it needs to be considerably increased in size, it needs a better sensitivity to composition, and it should cover the full sky. Such a facility, AugerNext, will be specified within the next 3-5 years.

The innovative research studies are needed in order to prepare an AugerNext proposal fulfilling the demands. Requested resources are primarily focused in the areas: consolidation of the detection of cosmic rays using MHz radio antennas, proof-of-principle of cosmic rays microwave detection, testing the large-scale application of new generation photo sensors, generalization of data communication techniques, and developing a new technique of muon detection with surface arrays. Studies for such a next generation cosmic ray experiment and the utilization of detection methods are principle elements of the ASPERA /ApPEC

ASPERA-2 [18] supporting these efforts is the project of "The Innovative Research Studies for the Next Generation Ground-Based Ultra-High Energy Cosmic-Ray Experiment:

This chapter has been supported by the National Center of Researches and Development

University of Łód´z, Department of Physics and Applied Informatics, Faculty of High-Energy

[1] H. R. Allan, "Radio emission from extensive air showers", Prog. in Elem. Part. and Cos.

[2] H. Falcke, P.W. Gorham, "Detecting radio emission from cosmic ray air showers and neutrinos with a digital radio telescope", *Astropart. Phys.* vol. 19, pp. 477-494, July 2003,

[3] J. Abraham et al., [Pierre Auger Collaboration], "Properties and Performance of the Prototype Instrument for the Pierre Auger Observatory", *Nucl. Instr. Meth.*, ser. A, vol.

(Poland) under the Grant No. ERA-NET-ASPERA/02/11.

<sup>⋆</sup> Address all correspondence to: zszadkow@kfd2.phys.uni.lodz.pl

The author is the member of the Pierre Auger Collaboration since 1999


32 Design and Architectures for Digital Signal Processing

314 Design and Architectures for Digital Signal Processing

[18] http://www.aspera-eu.org/

[17] Z. Szadkowski, "FPGA implementation of the 32-point DFT for a wavelet trigger in

cosmic rays experiments", Real Time Conference, Berkeley, CA, June 2012

### *Edited by Gustavo Ruiz and Juan A. Michell*

Digital signal processing (DSP) covers a wide range of applications in which the implementation of high-performance systems to meet stringent requirements and performance constraints is receiving increasing attention both in the industrial and academic contexts. Conceived to be available to a wide audience, the aim of this book is to provide students, researchers, engineers and the industrial community with a guide to the latest advances in emerging issues in the design and implementation of DSP systems for application-specific circuits and programmable devices. The book is divided into different sections including real-time audio applications, optical signal processing, image and video processing and advanced architectures and implementations. It will enable early-stage researchers and developers to deal with the important gap in knowledge in the transition from algorithm specification to the design of architectures for VLSI implementations.

Design and Architectures for Digital Signal Processing

Design and Architectures for

Digital Signal Processing

*Edited by Gustavo Ruiz and Juan A. Michell*

Photo by Goettingen / iStock