1. Introduction

The architecture of the embedded platform plays a significant role in ensuring that real-time systems meet the performance requirements. Moreover, software development suffers from increased implementation complexity and a lack of standard methodology for partitioning the implementation of signal-processing functionalities to heterogeneous hardware platforms. For instance, digital signal processor (DSP) is cheaper, consumes less power, and is easy to develop software applications, but it has a considerable latency and less throughput compared with field programmable gate arrays (FPGAs) [1]. For high-speed signal-processing (HSP) communication systems, such as cognitive radio (CR) [2, 3] and software-defined radio (SDR) [4], DSP may fail to capture and process the received data due to data loss. In addition, implementing

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and eproduction in any medium, provided the original work is properly cited. © 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

applications such as finite impulse response (FIR) filtering, discrete wavelet transform (DWT), or fast Fourier transform (FFT) by software application limits the throughput, which is not sufficient to meet the requirements of high-bandwidth and high-performance applications. As a result, HSP systems are enhanced by off-loading complex signal-processing operations to hardware platforms.

and 720 multipliers, respectively, while ML605 [15], ZC706 [16], and VC707 [17] have 768, 900,

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

http://dx.doi.org/10.5772/intechopen.76522

113

Although the modern FPGAs come with a reasonable number of multipliers, designers prefer to implement multiplier-free DWT architecture for many reasons. First, a partial number of multipliers can be preserved for tasks, such as pulse shape filter, digital-up and digital-down converter that are used at SDR front-ends. Second, in contrast to DWT, the MLP weights depend on the learning step. Third, MLP weights could be frequently changed at runtime in an adaptive manner, whereas the DWT coefficients are fixed and known. Therefore, the multiplier-free DWT architecture could simplify the design process and allow the designers to

In this work, we present the 1-D DWT implementation on FPGA by means of memory-based approaches. The aim is to compare different implementations in terms of system performance and resource consumption. We demonstrate the implementation of Daubechies wavelets (DB2, DB4, and DB5) using DAA and RNS approaches. These approaches do not employ explicit multipliers in the design. Because the main focus of this work is on extracting the key features of a signal via DWT, the inverse DWT (IDWT) and high-pass filter coefficients are not consid-

Implementations of 1-D DWT for signal de-noising, feature extraction, and pattern recognition and compression can be found in [8, 9, 18, 19]. The conventional convolution-based DWT requires massive computations and consumes much area and power, which could be overcome by using the lifting-based scheme for the DWT, which is introduced by Sweldens [20]. Although, the lifting scheme is used to compute the output of low- and high-pass using fewer components, it may not be well suited to our application, owing to the PBCCS's nature, where the low-frequency components are much important than the higher ones. Therefore, in this study, 1-D DWT decomposition, which is implemented by means of filter banks, is considered. Another advantage of using convolution-based DWT over lifting approach is that they do not require temporary registers to store the intermediate results, and with an appropriate design

Rather than the simplest implementation of FIR filter via multipliers and an adder tree, a multiplier-free architecture is used because they result in low-complexity systems and for their high-throughput-processing capability [22]. Fundamentally, there are two techniques for facilitating parallel processing. They are the distributed arithmetic algorithm (DAA) and the residue number system (RNS). DAA is an algorithm that performs the inner product in a bit serial with the assist of a lookup table (LUT) scheme followed by shift accumulation operations [23, 24]. Several techniques have been proposed to improve the design, such as the partial sum technique [25], a multiple memory bank technique [26, 27], and an LUT-less adder-based [28]. The DAA approach has been adapted in many applications, such as least mean square (LMS)

strategy, they could have better area and power efficiency [21].

adaptive filter [29] and square-root-raised cosine filter [30].

and 2800 multipliers (DSP48Es), respectively.

focus on the MLP design.

ered in this work.

1.1. Related work

Although FPGAs exhibit an increased development time and design complexity, they are preferred to meet high-performance requirements for two reasons. First, they efficiently address signal-processing tasks that can be pipelined. Second, they have the capacity to develop a programmable circuit architecture with the flexibility of computational, memory, speed, and power requirements. However, FPGA has its own resources such as memory, configurable logic blocks (CLBs), and multipliers that influence on the performance and selected algorithm. As a consequence, the choice of algorithm is determined by the hardware resource availability and performance requirements. These factors have an impact on each other and create many challenges that need to be optimized.

As an example, the discrete wavelet transform (DWT) [5–9], a linear signal-processing technique that transforms a signal from the time domain to the wavelet domain [10], employs various techniques for signal decomposing into an orthonormal time series with different frequency bands. The signal decomposition is performed using a pyramid algorithm (PA) [10, 11] or a recursive pyramid transform (RPT) [12]. While the PA algorithm is based on convolutions with quadrature mirror filters, which is infeasible for HW implementation, RPT decomposes the signal x[n] into two parts using high- and low-pass filters, which can be implemented using FIR filter [13]. Figure 1 shows a four-tap FIR filter with four multipliers, named as multiplier accumulator (MAC). By using the MAC structure, multipliers are involved in multiplying an input with filter coefficients, bi. It is clear that the direct implementation of the N-tap filter requires N multipliers.

This work focuses exclusively on implementing a one-level multiplierless DWT for a patternbased cognitive communication system receiver (PBCCS) [8] by means of FPGA. The DWT is required to extract the received signal's features. Then, the extracted features are fed into a multilayer perceptron (MLP) neural network (NN) to identify the received symbol. The most challenging part is that the NN could consume most of the available multipliers inside the FPGA. As an example, Ntoune et al. [14] have implemented a real-valued time-delay neural network (RVTDNN) and real-valued recurrent neural network (RVRNN) architecture with 600

Figure 1. Four-tap finite impulse response filter.

and 720 multipliers, respectively, while ML605 [15], ZC706 [16], and VC707 [17] have 768, 900, and 2800 multipliers (DSP48Es), respectively.

Although the modern FPGAs come with a reasonable number of multipliers, designers prefer to implement multiplier-free DWT architecture for many reasons. First, a partial number of multipliers can be preserved for tasks, such as pulse shape filter, digital-up and digital-down converter that are used at SDR front-ends. Second, in contrast to DWT, the MLP weights depend on the learning step. Third, MLP weights could be frequently changed at runtime in an adaptive manner, whereas the DWT coefficients are fixed and known. Therefore, the multiplier-free DWT architecture could simplify the design process and allow the designers to focus on the MLP design.

In this work, we present the 1-D DWT implementation on FPGA by means of memory-based approaches. The aim is to compare different implementations in terms of system performance and resource consumption. We demonstrate the implementation of Daubechies wavelets (DB2, DB4, and DB5) using DAA and RNS approaches. These approaches do not employ explicit multipliers in the design. Because the main focus of this work is on extracting the key features of a signal via DWT, the inverse DWT (IDWT) and high-pass filter coefficients are not considered in this work.

#### 1.1. Related work

applications such as finite impulse response (FIR) filtering, discrete wavelet transform (DWT), or fast Fourier transform (FFT) by software application limits the throughput, which is not sufficient to meet the requirements of high-bandwidth and high-performance applications. As a result, HSP systems are enhanced by off-loading complex signal-processing operations to

Although FPGAs exhibit an increased development time and design complexity, they are preferred to meet high-performance requirements for two reasons. First, they efficiently address signal-processing tasks that can be pipelined. Second, they have the capacity to develop a programmable circuit architecture with the flexibility of computational, memory, speed, and power requirements. However, FPGA has its own resources such as memory, configurable logic blocks (CLBs), and multipliers that influence on the performance and selected algorithm. As a consequence, the choice of algorithm is determined by the hardware resource availability and performance requirements. These factors have an impact on each

As an example, the discrete wavelet transform (DWT) [5–9], a linear signal-processing technique that transforms a signal from the time domain to the wavelet domain [10], employs various techniques for signal decomposing into an orthonormal time series with different frequency bands. The signal decomposition is performed using a pyramid algorithm (PA) [10, 11] or a recursive pyramid transform (RPT) [12]. While the PA algorithm is based on convolutions with quadrature mirror filters, which is infeasible for HW implementation, RPT decomposes the signal x[n] into two parts using high- and low-pass filters, which can be implemented using FIR filter [13]. Figure 1 shows a four-tap FIR filter with four multipliers, named as multiplier accumulator (MAC). By using the MAC structure, multipliers are involved in multiplying an input with filter coefficients, bi. It is clear that the direct implemen-

This work focuses exclusively on implementing a one-level multiplierless DWT for a patternbased cognitive communication system receiver (PBCCS) [8] by means of FPGA. The DWT is required to extract the received signal's features. Then, the extracted features are fed into a multilayer perceptron (MLP) neural network (NN) to identify the received symbol. The most challenging part is that the NN could consume most of the available multipliers inside the FPGA. As an example, Ntoune et al. [14] have implemented a real-valued time-delay neural network (RVTDNN) and real-valued recurrent neural network (RVRNN) architecture with 600

other and create many challenges that need to be optimized.

tation of the N-tap filter requires N multipliers.

Figure 1. Four-tap finite impulse response filter.

hardware platforms.

112 Wavelet Theory and Its Applications

Implementations of 1-D DWT for signal de-noising, feature extraction, and pattern recognition and compression can be found in [8, 9, 18, 19]. The conventional convolution-based DWT requires massive computations and consumes much area and power, which could be overcome by using the lifting-based scheme for the DWT, which is introduced by Sweldens [20]. Although, the lifting scheme is used to compute the output of low- and high-pass using fewer components, it may not be well suited to our application, owing to the PBCCS's nature, where the low-frequency components are much important than the higher ones. Therefore, in this study, 1-D DWT decomposition, which is implemented by means of filter banks, is considered. Another advantage of using convolution-based DWT over lifting approach is that they do not require temporary registers to store the intermediate results, and with an appropriate design strategy, they could have better area and power efficiency [21].

Rather than the simplest implementation of FIR filter via multipliers and an adder tree, a multiplier-free architecture is used because they result in low-complexity systems and for their high-throughput-processing capability [22]. Fundamentally, there are two techniques for facilitating parallel processing. They are the distributed arithmetic algorithm (DAA) and the residue number system (RNS). DAA is an algorithm that performs the inner product in a bit serial with the assist of a lookup table (LUT) scheme followed by shift accumulation operations [23, 24]. Several techniques have been proposed to improve the design, such as the partial sum technique [25], a multiple memory bank technique [26, 27], and an LUT-less adder-based [28]. The DAA approach has been adapted in many applications, such as least mean square (LMS) adaptive filter [29] and square-root-raised cosine filter [30].

On the other hand, RNS is an integer number system, in which the operations are performed based on the residue of division operation [31–33]. Eventually, the RNS-based results are converted back to the equivalent binary number format using a Chinese reminder theorem (CRT) [34]. The key advantage of RNS is gained by reducing an arithmetic operation to a set of concurrent, but simple, operations. Several applications, such as digital filters, benefit from the RNS implementation, for example, [35–37]. In addition, RNS was combined with DAA in one architecture, called RNS-DA [38, 39], which benefits from the advantages of both approaches.

high-pass filter, h[n] is the low-pass filter, and ↓2 is the down-sampling by a factor of two. The output of each low-pass filter is fed to the next level, so that each filter creates a series of

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

Mathematically, a signal y[n] consists of high- and low-frequency components, as shown in Eq. (1). It shows that the obtained signal can be represented by using half of the coefficients,

The decimated low-pass-filtered output is recursively passed through identical filter banks to add the dimension of varying resolution at every stage. Eqs. (2) and (3) represent the filtering process through a digital low-pass filter h[k] and high-pass filter g[k], corresponding to a

y n½ �¼ yhigh ½ �þ n � 1 ylow ½ � n � 1 (1)

h k½ �:x ½ � 2n � k (2)

http://dx.doi.org/10.5772/intechopen.76522

115

g k½ �:x ½ � 2n � k (3)

x k½ � : h k½ � (4)

h k½� �x k½ �<sup>0</sup>

:2�<sup>l</sup> (5)

� � (6)

coefficients (ai and di), which represent and compact the original signal information.

ylow½ �¼ <sup>n</sup> <sup>X</sup>

ylow½ �¼ <sup>n</sup> <sup>X</sup>

ylow ½ �¼ n

x k½ �¼� x k½ �<sup>0</sup> <sup>þ</sup><sup>X</sup>

h k½ �:x k½ �<sup>l</sup>

k

k

where 2n is the down-sampling process. The outputs ylow [n] and yhigh [n] provide an approx-

The distributed arithmetic algorithm (DAA) gets rid of multipliers by performing the arithmetic operations in a bit-serial computation [13]. Because the down-sampling process follows each filter (as shown in Figure 2), Eq. (2) can be rewritten without the decimation factor as

> N X�1 k¼0

Obviously, Eq. (4) requires an intensive operation due to multiplication of the real input values with the filter coefficients. Eq. (3) can be simplified by representing x[k] as a fixed point

L�1

x k½ �<sup>l</sup>

th bit of x[k] and x[k]0 is the sign bit. Substituting Eq. (5) into Eq. (4), the

N X� <sup>1</sup> k¼0

þ

l¼1

because they are decimated by 2

convolution with an impulse response of k-tap filters

imation signal and of the detailed signal, respectively [40].

2.2. Distributed arithmetic algorithm

arithmetic of length L:

where x[k]<sup>l</sup> is the l

output of the filter becomes

y n½ �¼ <sup>X</sup>

L�1

2�<sup>l</sup> : N X� <sup>1</sup> k¼0

" #

l¼1

In this chapter, three major 1-D DWT approaches are implemented on FPGA-based platforms and compared in terms of performance and energy requirements. The implementations are compared for different number of, multipliers, memory consumptions, number of taps (N), and levels (L) of the transform to show their advantages. To the best of our knowledge, no detailed comparisons of hardware implementations of the three major 1-D DWT designs exist in the study. This comparison will give significant insight on which implementation is the most suitable for given values of relevant algorithmic parameters. Although there are many efficient designs in the study, we did not optimize the number of memories in any approach, so that we have a fair comparison.

The remainder of this chapter is organized as follows. Section 2 presents the preliminaries information to understand DWT. It also reviews the theoretical background of DAA and RNS. Section 3 describes the implementation of discrete wavelet transform. We further show an analytical comparison between these approaches. Section 4 presents the performance results. Finally, this chapter concludes in Section 5.

## 2. Fundamentals and basic concepts

#### 2.1. Discrete wavelet transform

The wavelet decomposition mainly depends on the orthonormal filter banks. Figure 2 shows a two-channel wavelet structure for decomposition, where x[n] is the input signal, g[n] is the

Figure 2. Multi-resolution wavelet decomposition. The block diagram of the two-channel four-level discrete wavelet transform decomposition (J = 3) that decomposes a discrete signal into two parts. Note that ↓2 is maintaining one sample out of two, ai and di are the approximation and details at level i, respectively.

high-pass filter, h[n] is the low-pass filter, and ↓2 is the down-sampling by a factor of two. The output of each low-pass filter is fed to the next level, so that each filter creates a series of coefficients (ai and di), which represent and compact the original signal information.

Mathematically, a signal y[n] consists of high- and low-frequency components, as shown in Eq. (1). It shows that the obtained signal can be represented by using half of the coefficients, because they are decimated by 2

$$y[n] = y\_{high}[n-1] + y\_{low}[n-1] \tag{1}$$

The decimated low-pass-filtered output is recursively passed through identical filter banks to add the dimension of varying resolution at every stage. Eqs. (2) and (3) represent the filtering process through a digital low-pass filter h[k] and high-pass filter g[k], corresponding to a convolution with an impulse response of k-tap filters

$$\text{log}\_{low}[n] = \sum\_{k} \text{ } h \text{ } [k] \text{ } \text{x } [2n - k] \text{ } \tag{2}$$

$$\text{g}\_{low}[n] = \sum\_{k} \text{ g} \begin{bmatrix} k \end{bmatrix} \text{.} \begin{bmatrix} 2n - k \end{bmatrix} \tag{3}$$

where 2n is the down-sampling process. The outputs ylow [n] and yhigh [n] provide an approximation signal and of the detailed signal, respectively [40].

#### 2.2. Distributed arithmetic algorithm

On the other hand, RNS is an integer number system, in which the operations are performed based on the residue of division operation [31–33]. Eventually, the RNS-based results are converted back to the equivalent binary number format using a Chinese reminder theorem (CRT) [34]. The key advantage of RNS is gained by reducing an arithmetic operation to a set of concurrent, but simple, operations. Several applications, such as digital filters, benefit from the RNS implementation, for example, [35–37]. In addition, RNS was combined with DAA in one architecture, called RNS-DA [38, 39], which benefits from the advantages of both approaches. In this chapter, three major 1-D DWT approaches are implemented on FPGA-based platforms and compared in terms of performance and energy requirements. The implementations are compared for different number of, multipliers, memory consumptions, number of taps (N), and levels (L) of the transform to show their advantages. To the best of our knowledge, no detailed comparisons of hardware implementations of the three major 1-D DWT designs exist in the study. This comparison will give significant insight on which implementation is the most suitable for given values of relevant algorithmic parameters. Although there are many efficient designs in the study, we did not optimize the number of memories in any approach, so that we

The remainder of this chapter is organized as follows. Section 2 presents the preliminaries information to understand DWT. It also reviews the theoretical background of DAA and RNS. Section 3 describes the implementation of discrete wavelet transform. We further show an analytical comparison between these approaches. Section 4 presents the performance results.

The wavelet decomposition mainly depends on the orthonormal filter banks. Figure 2 shows a two-channel wavelet structure for decomposition, where x[n] is the input signal, g[n] is the

Figure 2. Multi-resolution wavelet decomposition. The block diagram of the two-channel four-level discrete wavelet transform decomposition (J = 3) that decomposes a discrete signal into two parts. Note that ↓2 is maintaining one sample

out of two, ai and di are the approximation and details at level i, respectively.

have a fair comparison.

114 Wavelet Theory and Its Applications

Finally, this chapter concludes in Section 5.

2. Fundamentals and basic concepts

2.1. Discrete wavelet transform

The distributed arithmetic algorithm (DAA) gets rid of multipliers by performing the arithmetic operations in a bit-serial computation [13]. Because the down-sampling process follows each filter (as shown in Figure 2), Eq. (2) can be rewritten without the decimation factor as

$$y\_{low}\ [n] = \sum\_{k=0}^{N-1} x\ [k]\ \ .\ h\ [k]\tag{4}$$

Obviously, Eq. (4) requires an intensive operation due to multiplication of the real input values with the filter coefficients. Eq. (3) can be simplified by representing x[k] as a fixed point arithmetic of length L:

$$\mathbf{x}[k] = -\ \mathbf{x}[k]\_0 + \sum\_{l=1}^{L-1} \mathbf{x}[k]\_l \mathbf{2}^{-l} \tag{5}$$

where x[k]<sup>l</sup> is the l th bit of x[k] and x[k]0 is the sign bit. Substituting Eq. (5) into Eq. (4), the output of the filter becomes

$$\text{y } [n] = \left[ \sum\_{l=1}^{L-1} \mathfrak{L}^{-l} . \sum\_{k=0}^{N-1} h[k]. \mathfrak{x}[k]\_l \right] \quad + \sum\_{k=0}^{N-1} h[k] \left( -\mathfrak{x}[k]\_0 \right) \tag{6}$$

Since x[k]<sup>l</sup> takes the value of either 0 or 1, P<sup>N</sup>�<sup>1</sup> <sup>k</sup>¼<sup>0</sup> h k½ �:x k½ �<sup>l</sup> may have only <sup>2</sup><sup>N</sup> possible values. That is, rather than computing the summation at each iteration online, it is possible to precompute and store these values in a ROM, indexed by x[k]l. In other words, Eq. (6) simply realizes the sum of product computation by memory (LUT), adders, and shift register.

#### 2.3. Residue number system

The RNS is a non-weighted number system that performs parallel carry-free addition and multiplication arithmetic. In DSP applications, which require intensive computations, the carry-free propagation allows for a concurrent computation in each residue channel. The RNS moduli set, P = {m1, m2, …, mq}, consists of q channels. Each mi represents a positive relatively prime integer; the greatest common divisor (GCD) (mi, mj) = 1 for i 6¼ j.

Any number, X ∈ ZM = 0, 1, …,M-1, is uniquely represented in RNS by its residues j j X mi , which is the remainder of division X by mi and M is defined in Eq. (7) as

$$M = \Pi\_{i=1}^{q} m\_i = m\_1 \stackrel{\*}{}{m\_2} \stackrel{\*}{}{\dots} \dots \stackrel{\*}{}{m\_q} \tag{7}$$

3. DWT implementation methodology

which will have an impact on the output precision.

3.2. DWT implementation using RNS

(FCMA), as illustrated in Figure 4.

3.2.1. The forward converter

number of channels.

the binary point.

DAA hides the explicit multiplications with a ROM lookup table. The memory stores all possible values of the inner product of a fixed w-bit with any possible combination of the DWT filter coefficients. The input data, x[n], are signed fixed-point of a 22-bit width, with 16 binary-point bits (Q5,16). We assumed that the memory contents have the same precision as the input, which is reasonable to give high enough accuracy for the fixed-point implementation. As a consequence, 22 ROMs, each consisting of 16 words, are required. Each ROM stores any possible combination of the four DWT filter coefficients, where the final result is a 22-bit signed fixed-point (Q5,16). In order to decrease the number of memory, the width should be reduced,

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

http://dx.doi.org/10.5772/intechopen.76522

117

Figure 3 shows the block diagram of 1-bit DAA at position l. This block contains one ROM (4 22) and one shift register. Because the word's length w of the input x is 22 bits, the actual

The RNS-based DWT implementation has mainly three components. They are the forward converter, the modulo adders (MAs), and the reverse converter. The forward converter, which is also known as the binary-to-residue converter (BRC), is used to convert a binary input number to residue numbers. By contrast, the reverse converter or the residue-to-binary converter (RBC) is used to obtain the result in a binary format from the residue numbers. We refer to the RNS system, which does not include RBC, as a forward converter and modular-adders

The forward converter is used to convert the result of multiplying an input number by a wavelet coefficient to q residue numbers via LUT, shift, and modulo adders, where q is the

Figure 3. The block diagram of DAA-based architecture of the DB2. For simplicity, we showed one ROM and one shift register. In the actual design, there are 22 ROMs and shift registers. >> is a 16 l shift operation, where 16 is the number of

design contains 22 memory blocks and 21 adders for summing up the partial results.

3.1. DWT implementation using DA

where M determines the range of unsigned numbers in [0, M - 1], and should be greater than the largest performed results. In addition, M uniquely represents any signed numbers. The implementation of RNS-based DWT obtained from Eq. (4) is given by Eq. (8) as follows:

$$y[n]\_{m\_i} = y\_{m\_i} = \left| \left( \sum\_{k=0}^{N-1} |h\left[k\right]\_{m\_i} \ge [n-k]\_{m\_i} \Big|\_{m\_i} \right) \right| . \tag{8}$$

for each mi ∈ P. This implies that a q-channel DWT is implemented by q FIR filters that work in parallel.

Mapping from the RNS system to integers is performed by the Chinese reminder theorem (CRT) [34, 41, 42]. The CRT states that binary/decimal representation of a number can be obtained from its RNS if all elements of the moduli set are pairwise relatively prime.

Designing a robust RNS-based DWT requires selecting a moduli set and implementing the hardware design of residue to binary conversion. Most widely studied moduli sets are given as a power of two due to the attractive arithmetic properties of these modulo sets. For example, <sup>2</sup><sup>n</sup> � <sup>1</sup> ; <sup>2</sup><sup>n</sup>; <sup>2</sup><sup>n</sup>þ<sup>1</sup> � <sup>1</sup> � � [43], 2<sup>n</sup> � <sup>1</sup> ; <sup>2</sup><sup>n</sup>; <sup>2</sup><sup>n</sup> f g <sup>þ</sup> <sup>1</sup> [39], and 2<sup>n</sup>; <sup>2</sup><sup>2</sup><sup>n</sup> � <sup>1</sup>; 22<sup>n</sup> <sup>þ</sup> <sup>1</sup> � � [44] have been investigated.

For the purpose of illustrating, the moduli set Pn <sup>¼</sup> <sup>2</sup><sup>n</sup> � <sup>1</sup>; <sup>2</sup><sup>n</sup>; <sup>2</sup><sup>n</sup>þ<sup>1</sup> � <sup>1</sup> � � is used for three reasons. First, the multiplicative adder (MA) is simple and identical for <sup>m</sup><sup>1</sup> <sup>¼</sup> <sup>2</sup><sup>n</sup> � 1 and <sup>m</sup><sup>3</sup> <sup>¼</sup> <sup>2</sup><sup>n</sup>þ<sup>1</sup> � 1. Second, for small (<sup>n</sup> = 7), the dynamic range of <sup>P</sup><sup>7</sup> is large, M = 4,145,280, which could efficiently express real numbers in the range [�2.5, 2.5] using a 16-bit fixed-point representation, provided scaling and rounding are done properly. We assume that this interval is sufficient to map the input values, which does not exceed �2. Third, the reverse converter unit is simple and regular [42] due to using simple circuits design.
