3.2.3. Modulo mi multiplier

The multiplication of the received sample, X[i], by the filter coefficients, which are constants, is performed by indexing the ROM. As the word length, w, of the received sample X[i] is increased, the memory size becomes 2w. In addition, q ROMs are required to perform the modulo multiplication.

We propose few improvements to this design. First, instead of preserving a dedicated memory for each modulo mi, a ROM that contains all module results is used. Thus, each word at location j contains the q modules of hk ∗ j <sup>∗</sup> 211. Figure 5 shows the internal BRC block design of the three-channel moduli set P<sup>7</sup> = {127, 128, 255} with its memory map at the right top corner. It shows that, for a location j, the least significant 8 bit contains hk <sup>∗</sup> j j <sup>x</sup> <sup>m</sup><sup>3</sup> , the next 7 bit

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless http://dx.doi.org/10.5772/intechopen.76522 119

Figure 5. The block diagram of the binary-to-residue converter for the three-channel RNS-based DWT, P7 = {127, 128, 255}. Four identical memories are used for each tap. The upper corner shows the memory content at location j ∈ [0, 15].

contains hk <sup>∗</sup> j j <sup>x</sup> <sup>m</sup><sup>2</sup> and the most significant <sup>7</sup> bit contains hk <sup>∗</sup> j j <sup>x</sup> <sup>m</sup><sup>1</sup> , which is generalized by Eq. (9). The advantage of this method is that no extra hardware is required to separate each module value.

$$ROM(j) = |h\_k^\* \ j \ast 2^{11}|\_{m\_1} \ast 2^{2n+1} + |h\_k^\* \ j \ast 2^{11}|\_{m\_2} \ast 2^{n+1} + \left| h\_k \right \ast \! \! / \ast 2^{11}|\_{m\_3} \quad j = [0, 2^w] \tag{9}$$

As with DAA-based approach, if the input word length is 16 bits, the ROM should contain 216 locations. One way to reduce the size of the memory is to divide it into four ROMs of 4 � 22. Figure 4 shows the block diagram of the binary-to-residue converter with four ROMs; each is indexed by four bits of x. However, the output of each ROM should be combined, so that the final result can be corrected. It is worth noting that this division comes with a cost in terms of adders and registers.

According to the previous improvements, the RNS-based works are as follows. The input X<sup>16</sup>�bit ¼ ð Þ x1; x2; x3; x<sup>4</sup> is divided into four segments. Each of the 4-bit segment is fed into one ROM, so that three outputs, corresponding to hk ∗ xl∗211 mi , are produced.

To obtain the final multiplications' result, each mi output should be shifted by l positions, where l is the index of the lowest input bit (4, 8, or 12). The modular multiplication and shift for (2<sup>n</sup> – 1) and (2<sup>n</sup> + 1–1) can be achieved by a left circular shift (left rotate) for l positions, whereas the modular multiplication and shift for 2<sup>n</sup> can be achieved by a left shift for l positions [17]. Finally, the modulo adder adds the corresponding output (Figure 6).

Figure 6. The block diagram of (2n – 1) modulo adder.

3.2.2. RNS-system number conversion

for residue-to-binary converter, and MA for modulo adder.

118 Wavelet Theory and Its Applications

3.2.3. Modulo mi multiplier

modulo multiplication.

location j contains the q modules of hk

by 2<sup>y</sup>

The received samples and wavelet coefficients span the real number and might take small values. One of the main drawbacks of RNS-number representation is that it only operates with positive integer numbers from [0, M – 1]. The DWT coefficients are generally between 1 and � 1. As a possible solution, we have divided the range of RNS, [0, M – 1], to handle those numbers. In addition, the received sample X[i] is scaled up by shifting y positions to the left (multiplying

Figure 4. The block diagram of DB2 RNS-based architecture. BRC is an abbreviation for binary-to-residue converter, RBC

Coefficient(hk) Real value RNS-system value

h<sup>0</sup> �0.129409522550921 �266 h<sup>1</sup> 0.224143868041857 459 h<sup>2</sup> 0.836516303737469 1713 h<sup>3</sup> 0.482962913144690 989

Table 1. The DB2 low-pass real and RNS-system number equivalent, multiplied by 211.

), which ensures that X[i] is a y-bit fixed point integer. In a similar manner, the wavelet coefficients are scaled by shifting its z positions to the left. In our design, we set the filter scaling factor z to 11. Table 1 presents the low-pass filter of DB2 before and after scaling.

The multiplication of the received sample, X[i], by the filter coefficients, which are constants, is performed by indexing the ROM. As the word length, w, of the received sample X[i] is increased, the memory size becomes 2w. In addition, q ROMs are required to perform the

We propose few improvements to this design. First, instead of preserving a dedicated memory for each modulo mi, a ROM that contains all module results is used. Thus, each word at

of the three-channel moduli set P<sup>7</sup> = {127, 128, 255} with its memory map at the right top

<sup>∗</sup> 211. Figure 5 shows the internal BRC block design

<sup>∗</sup> j j <sup>x</sup> <sup>m</sup><sup>3</sup>

, the next 7 bit

∗ j

corner. It shows that, for a location j, the least significant 8 bit contains hk

Figure 7. The block diagram of two-level RNS-based DWT design, and FCMA represents FIR-filtering process in RNS.

#### 3.2.4. Modulo adder (MA)

The modulo adders are required for adding the results from a modular multiplier as well as for a reverse converter. In this work, we have two MAs—that is, one is based on 2<sup>n</sup> and the other is based on 2<sup>n</sup> – 1. Modulo 2<sup>n</sup> adder is just the lowest n bits of adding two integer numbers, where the carry is ignored. Figure 7 shows the block diagram of the 2<sup>n</sup> – 1 modulo adder.

#### 3.2.5. The reverse converter

The Chinese remainder theorem (CRT) [34] provides the theoretical basis for converting a residue number into a natural integer. The moduli set Pn <sup>¼</sup> <sup>2</sup><sup>n</sup> � <sup>1</sup>; <sup>2</sup><sup>n</sup>; <sup>2</sup><sup>n</sup>þ<sup>1</sup> � <sup>1</sup> � � can be efficiently implemented by four modulo adders and two multiplexers [42]. The output of the RBC is unsigned (3 \* n + 1)-bits integer number. The actual signed number can be found by shifting the result y+z positions to the left, which is equivalent to dividing by 2(y + z) . y and z are the scaled values of the input and wavelet coefficients, respectively. Generally, the word length of one-level DWT is bounded by Eq. (10) and should not exceed (3 \* n – 2) bits

$$\begin{array}{ccccc} \mathfrak{Z}^\* & n & + & 1 \geq \ y & + & z & + & \mathfrak{Z} \end{array} \tag{10}$$

3.4. Hardware complexity

DAA and RNS techniques employ the memory as a key resource to avoid multiplying two input variables. In each approach, as the number of filter taps increases, both the size and the number of memories change. Assuming that the length of the received word is w-bit and there are N filter taps, the size of a memory element can be considered as a � b, where a and b are the word length in bits of the input and output, respectively. The value of a determines the size

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

http://dx.doi.org/10.5772/intechopen.76522

121

The total number of memory elements that are occupied by the DAA-based filter is w \* (N � 22). The output is a fixed 16-bit fixed point and the word length is 22 bits. The number of memory elements remains constant as the filter taps increase, whereas the size of the

On the other hand, the total number of memory elements occupied by an RNS-based filter is <sup>N</sup><sup>∗</sup> log2ð Þ <sup>w</sup> <sup>∗</sup>ð Þ <sup>4</sup> � <sup>22</sup> . This equation shows that the number of memory elements increases linearly with the number of filter taps, while the memory size remains constant (4 � 22). Table 2 shows a comparison of the memory usage with w = 16 for different DWT families.

DAA-based implementation employs shift registers and adders to sum the result at each bit level (Figure 3). For a word length w with m magnitude bits, we need (w – 1) shift registers and (w – 1) 2-input adders (data combined by a tree adder architecture). To handle the negative numbers, the two's complement operation requires additional (m – 1) shift registers and (m – 1) adders. Thus, for l-level DA-based implementation, a total of l \* (w – m – 2) shift registers and

On the other hand, for a word length w and N-tap filter, the q-channel FCMA implementation requires N BRC blocks and (q\*(N – 1)) MA blocks to compute the final result. Each BRC block

modulo, respectively. The modulo 2<sup>n</sup> requires log2(n) because shifting operations is not circular and shifting n-bit numbers to the left by n positions or more is always zero. Likewise, the RBC has four MA blocks (for 2<sup>n</sup> + 1–1), two multiplexers, and two subtractors. Thus, the total

Table 2. Occupied memories when DA- and RNS-based approaches are used. The word length, w, is 22 and 16 bits for

DB2 DB4 DB5

, and 2<sup>n</sup> + 1–1

has log2<sup>w</sup> � 1, log2<sup>n</sup> � <sup>1</sup> , and log2<sup>w</sup> � 1 MA blocks for <sup>2</sup><sup>n</sup> – 1, 2<sup>n</sup>

Number of filter taps 4 8 10

DA memory usage 22\*(4 � 22) 22\*(8 � 22) 22\*(10 � 22) RNS memory usage 16\*(4 � 22) 32\*(4 � 22) 40\*(4 � 22)

3.4.1. Memory usage

of the memory, 2<sup>a</sup>

.

memory exponentially increases to 2N.

3.4.2. Shift register and adder counts

two-input adders is required.

DA- and RNS-based, respectively.

number of MA blocks at one-level RNS-based is

As a consequence, the range of the moduli set should be greater than the maximum output, tho, which can be computed as follows:

$$th\_0 = \left(\sum\_{\mathbf{k}} \mathbf{h}\_{\mathbf{k}}\right)^2 \ast \max(\mathbf{x}[\mathbf{n}]) \ast (2^x)^2 \ast 2^y \le M - 1 \tag{11}$$

where hk is the kth filter coefficient, x[n] is the input, y and z are the input and filter scaling factors, respectively, and M is the maximum range.

#### 3.3. Two-level DWT implementation

The two-level discrete wavelet transform compromises two one-level DWTs, where the output of the first level is fed into the second level (as shown in Figure 7). The one-level DWT at each level is identical, but the output of each level is halved. For example, if a signal of 1800 samples is applied to the input, then 900 and 450 samples are produced by the first and second levels, respectively.

Figure 7 shows the design of two-level RNS-based DWT, which involves two FCMA blocks and two RBC blocks. Each FCMA requires converting the result of the first stage to binary, shifting the number by 11 and converting it to residue number again.

#### 3.4. Hardware complexity

#### 3.4.1. Memory usage

3.2.4. Modulo adder (MA)

120 Wavelet Theory and Its Applications

3.2.5. The reverse converter

which can be computed as follows:

3.3. Two-level DWT implementation

respectively.

tho <sup>¼</sup> <sup>X</sup> k hk !<sup>2</sup>

factors, respectively, and M is the maximum range.

The modulo adders are required for adding the results from a modular multiplier as well as for a reverse converter. In this work, we have two MAs—that is, one is based on 2<sup>n</sup> and the other is based on 2<sup>n</sup> – 1. Modulo 2<sup>n</sup> adder is just the lowest n bits of adding two integer numbers, where the carry is ignored. Figure 7 shows the block diagram of the 2<sup>n</sup> – 1 modulo adder.

Figure 7. The block diagram of two-level RNS-based DWT design, and FCMA represents FIR-filtering process in RNS.

The Chinese remainder theorem (CRT) [34] provides the theoretical basis for converting a residue number into a natural integer. The moduli set Pn <sup>¼</sup> <sup>2</sup><sup>n</sup> � <sup>1</sup>; <sup>2</sup><sup>n</sup>; <sup>2</sup><sup>n</sup>þ<sup>1</sup> � <sup>1</sup> � � can be efficiently implemented by four modulo adders and two multiplexers [42]. The output of the RBC is unsigned (3 \* n + 1)-bits integer number. The actual signed number can be found by

are the scaled values of the input and wavelet coefficients, respectively. Generally, the word

As a consequence, the range of the moduli set should be greater than the maximum output, tho,

where hk is the kth filter coefficient, x[n] is the input, y and z are the input and filter scaling

The two-level discrete wavelet transform compromises two one-level DWTs, where the output of the first level is fed into the second level (as shown in Figure 7). The one-level DWT at each level is identical, but the output of each level is halved. For example, if a signal of 1800 samples is applied to the input, then 900 and 450 samples are produced by the first and second levels,

Figure 7 shows the design of two-level RNS-based DWT, which involves two FCMA blocks and two RBC blocks. Each FCMA requires converting the result of the first stage to binary,

shifting the number by 11 and converting it to residue number again.

<sup>3</sup><sup>∗</sup> <sup>n</sup> <sup>þ</sup> <sup>1</sup> <sup>≥</sup> <sup>y</sup> <sup>þ</sup> <sup>z</sup> <sup>þ</sup> <sup>3</sup> (10)

<sup>∗</sup> max x n ð Þ ½ � <sup>∗</sup> 2z ð Þ<sup>2</sup> <sup>∗</sup> 2y <sup>≤</sup> <sup>M</sup> � <sup>1</sup> (11)

. y and z

shifting the result y+z positions to the left, which is equivalent to dividing by 2(y + z)

length of one-level DWT is bounded by Eq. (10) and should not exceed (3 \* n – 2) bits

DAA and RNS techniques employ the memory as a key resource to avoid multiplying two input variables. In each approach, as the number of filter taps increases, both the size and the number of memories change. Assuming that the length of the received word is w-bit and there are N filter taps, the size of a memory element can be considered as a � b, where a and b are the word length in bits of the input and output, respectively. The value of a determines the size of the memory, 2<sup>a</sup> .

The total number of memory elements that are occupied by the DAA-based filter is w \* (N � 22). The output is a fixed 16-bit fixed point and the word length is 22 bits. The number of memory elements remains constant as the filter taps increase, whereas the size of the memory exponentially increases to 2N.

On the other hand, the total number of memory elements occupied by an RNS-based filter is <sup>N</sup><sup>∗</sup> log2ð Þ <sup>w</sup> <sup>∗</sup>ð Þ <sup>4</sup> � <sup>22</sup> . This equation shows that the number of memory elements increases linearly with the number of filter taps, while the memory size remains constant (4 � 22). Table 2 shows a comparison of the memory usage with w = 16 for different DWT families.

#### 3.4.2. Shift register and adder counts

DAA-based implementation employs shift registers and adders to sum the result at each bit level (Figure 3). For a word length w with m magnitude bits, we need (w – 1) shift registers and (w – 1) 2-input adders (data combined by a tree adder architecture). To handle the negative numbers, the two's complement operation requires additional (m – 1) shift registers and (m – 1) adders. Thus, for l-level DA-based implementation, a total of l \* (w – m – 2) shift registers and two-input adders is required.

On the other hand, for a word length w and N-tap filter, the q-channel FCMA implementation requires N BRC blocks and (q\*(N – 1)) MA blocks to compute the final result. Each BRC block has log2<sup>w</sup> � 1, log2<sup>n</sup> � <sup>1</sup> , and log2<sup>w</sup> � 1 MA blocks for <sup>2</sup><sup>n</sup> – 1, 2<sup>n</sup> , and 2<sup>n</sup> + 1–1 modulo, respectively. The modulo 2<sup>n</sup> requires log2(n) because shifting operations is not circular and shifting n-bit numbers to the left by n positions or more is always zero. Likewise, the RBC has four MA blocks (for 2<sup>n</sup> + 1–1), two multiplexers, and two subtractors. Thus, the total number of MA blocks at one-level RNS-based is


Table 2. Occupied memories when DA- and RNS-based approaches are used. The word length, w, is 22 and 16 bits for DA- and RNS-based, respectively.

$$MA\_t = 2N \* \left( \left( \lceil \log\_2 w \rceil - 1 \right)\_{2^u - 1} \right) + \left( \lceil \log\_2 n \rceil - 1 \right)\_{2^u} + \ \ q \* (N - 1) \ \tag{12}$$

For instance, three-channel DB2 implementation requires nine MA blocks to sum the result, and P7 RNS-based implementation has a total of 45 MA blocks when w = 16.

Meanwhile, the number of RNS-based adders depends on the design of the MA block. For example, each MA block of (2<sup>n</sup> – 1) and (2<sup>n</sup> + 1–1) requires two adders, while each MA block of <sup>2</sup><sup>n</sup> requires one adder. Thus, at <sup>¼</sup> <sup>12</sup> <sup>N</sup> <sup>þ</sup> <sup>N</sup> log2<sup>n</sup> � <sup>1</sup> <sup>þ</sup> <sup>5</sup><sup>∗</sup> ð Þþ <sup>N</sup> � <sup>1</sup> 10 adders are required, which can be simplified as follows (summarized in Table 3):


$$a\_t = \begin{array}{c} 17N \ + \end{array} \begin{array}{c} N \left( \lceil \log\_2 n \rceil - \ 1 \right) \ \end{array} + \begin{array}{c} 10 \end{array} \tag{13}$$

For RNS implementation, the moduli sets of P<sup>7</sup> = {127, 128, 255} and P<sup>10</sup> = {1023, 1024, 2047} were used. The dynamic ranges of these sets are M = 4,161,536 and 2,144,338,944, respectively. The moduli set of P<sup>10</sup> is selected because its dynamic range is greater than tho (Eq. (11) with y = 6, z = 11 and ∑(hi) = 1.5436). In all RNS-based implementations, the word length was set to

Figure 9. The Simulink model of FIR-based one-level DB2 discrete wavelet transform. The IP FIR compiler 6.3 of the

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

http://dx.doi.org/10.5772/intechopen.76522

123

Table 4 summarizes the resource use by RNS-based components—that is, FCMA and reverse converter (RBC). The RBC consumes fewer resources and less power. However, the operating

Table 5 summarizes the resource consumption of each filter implementation. It shows that the MAC and IP FIR-based implementations have four multiplier units (DSP48E1s) with maximum frequencies of 296 and 472 MHz, respectively. By contrast, the proposed approaches are more complex than MAC. However; DAA- and RNS-based implementations has 22 and 16 memory blocks (BRAMs) used to store the pre-calculated wavelet coefficients. It also shows that the number of slice registers, slice LUTs, and occupied slices of P<sup>10</sup> RNS-based is greater than one of P<sup>7</sup> because the former has 31 output signals, while the latter has 22 output signals. As a result, the number of flip-flops is increased and the number of resources is approximately

FCMA RBC FCMA RBC

frequency is equal in all models and greater than the entire RNS-based filter.

Resources RNS-based (n = 7) RNS-based (n = 10)

Table 4. The resource use and system performance of the RNS components—that is, FCMA.

Number of slice registers 656 157 883 190 Number of slice LUTs 591 138 854 180 Number of RAMB18E1 16 0 16 0 Max. operating freq. (MHz) 291.2 311.62 283.85 298.67 Min. period (ns) 3.434 3.21 3.523 3.348 Estimated total power (mW) 40.5 6.59 43.08 7.33 Latency (clock cycle (CC)) 6 6 6 6

16 bits.

system generator is used.

4.1. Resource utilization and system performance

Table 3. Memory usage and adders for 1-L N-tap DA and RNS-based approaches DWT.

#### 4. Performance analysis and validation

Hardware analysis was carried out by using a Xilinx System Generator for DSP (SysGen) [45], which is a high-level software tool that enables the use of MATLAB/Simulink environment to create and verify hardware designs for Xilinx FPGAs. It enables the use of the MathWorks model-based Simulink design environment for FPGA design. Furthermore, the hardwaresoftware co-simulation design was synthesized and implemented on ML605 Xilinx Vertex 6 [15].

The implementation of RNS and DA is compared with the multiplier-accumulate-based DWT structure (MAC), as shown in Figure 8. We also consider the direct DWT implementation using an IP FIR Compiler 6.3 (FIR6.3) block [46], which provides a common interface to generate highly area-efficient and high-performance FIR filters (Figure 9).

Figure 8. The Simulink model of MAC-based one-level DB2 discrete wavelet transform. Filter coefficients are stored as constants.

Figure 9. The Simulink model of FIR-based one-level DB2 discrete wavelet transform. The IP FIR compiler 6.3 of the system generator is used.

For RNS implementation, the moduli sets of P<sup>7</sup> = {127, 128, 255} and P<sup>10</sup> = {1023, 1024, 2047} were used. The dynamic ranges of these sets are M = 4,161,536 and 2,144,338,944, respectively. The moduli set of P<sup>10</sup> is selected because its dynamic range is greater than tho (Eq. (11) with y = 6, z = 11 and ∑(hi) = 1.5436). In all RNS-based implementations, the word length was set to 16 bits.

#### 4.1. Resource utilization and system performance

MAt <sup>¼</sup> <sup>2</sup><sup>N</sup> <sup>∗</sup> log2<sup>w</sup> � <sup>1</sup>

122 Wavelet Theory and Its Applications

4. Performance analysis and validation

Memory usage

constants.

Number of adders

<sup>2</sup>n�<sup>1</sup> <sup>þ</sup> log2<sup>n</sup> � <sup>1</sup>

and P7 RNS-based implementation has a total of 45 MA blocks when w = 16.

required, which can be simplified as follows (summarized in Table 3):

Table 3. Memory usage and adders for 1-L N-tap DA and RNS-based approaches DWT.

For instance, three-channel DB2 implementation requires nine MA blocks to sum the result,

Meanwhile, the number of RNS-based adders depends on the design of the MA block. For example, each MA block of (2<sup>n</sup> – 1) and (2<sup>n</sup> + 1–1) requires two adders, while each MA block of <sup>2</sup><sup>n</sup> requires one adder. Thus, at <sup>¼</sup> <sup>12</sup> <sup>N</sup> <sup>þ</sup> <sup>N</sup> log2<sup>n</sup> � <sup>1</sup> <sup>þ</sup> <sup>5</sup><sup>∗</sup> ð Þþ <sup>N</sup> � <sup>1</sup> 10 adders are

DA-based RNS-based

Hardware analysis was carried out by using a Xilinx System Generator for DSP (SysGen) [45], which is a high-level software tool that enables the use of MATLAB/Simulink environment to create and verify hardware designs for Xilinx FPGAs. It enables the use of the MathWorks model-based Simulink design environment for FPGA design. Furthermore, the hardwaresoftware co-simulation design was synthesized and implemented on ML605 Xilinx Vertex 6 [15]. The implementation of RNS and DA is compared with the multiplier-accumulate-based DWT structure (MAC), as shown in Figure 8. We also consider the direct DWT implementation using an IP FIR Compiler 6.3 (FIR6.3) block [46], which provides a common interface to

Figure 8. The Simulink model of MAC-based one-level DB2 discrete wavelet transform. Filter coefficients are stored as

generate highly area-efficient and high-performance FIR filters (Figure 9).

at <sup>¼</sup> <sup>17</sup> <sup>N</sup> <sup>þ</sup> <sup>N</sup> log2<sup>n</sup> � <sup>1</sup> <sup>þ</sup> <sup>10</sup> (13)

<sup>2</sup><sup>n</sup> þ q ∗ ð Þþ N � 1 4 (12)

Table 4 summarizes the resource use by RNS-based components—that is, FCMA and reverse converter (RBC). The RBC consumes fewer resources and less power. However, the operating frequency is equal in all models and greater than the entire RNS-based filter.

Table 5 summarizes the resource consumption of each filter implementation. It shows that the MAC and IP FIR-based implementations have four multiplier units (DSP48E1s) with maximum frequencies of 296 and 472 MHz, respectively. By contrast, the proposed approaches are more complex than MAC. However; DAA- and RNS-based implementations has 22 and 16 memory blocks (BRAMs) used to store the pre-calculated wavelet coefficients. It also shows that the number of slice registers, slice LUTs, and occupied slices of P<sup>10</sup> RNS-based is greater than one of P<sup>7</sup> because the former has 31 output signals, while the latter has 22 output signals. As a result, the number of flip-flops is increased and the number of resources is approximately


Table 4. The resource use and system performance of the RNS components—that is, FCMA.


Table 5. The resource use and system performance of the DWT implementation for one-level DB2 implementation.

4.3. Precision analysis

cycle is 10 ns.

4.4. Discussion

We carried out the precision analysis for the first and DWT levels, and the result is presented in Table 7. The output bit precision is set to Q5,16 for all implementations. Table 7 shows the maximum performance based on the signal-to-noise-ratio (SNR) and peak-signal-to-noiseratio (PSNR). For P7, we could not achieve a better accuracy with the specified scaling factors because y + z = 19 < (3 ∗ 7) + 1 = 22. However, both DAA- and RNS-based approaches offer high-signal quality with a peak signal-to-noise ratio (PSNR) of 73.5 and 56.5 dB, respectively. Figure 11 shows the effect of changing the scaling factors of P<sup>10</sup> for DB2 RNS-based approach. The input scaling factor is increased from 8 to 13 bit and the filter scaling factor is increased from 11 to 18. As expected, lower scaler factors produce PSNR equal to 56 dB, whereas the

Figure 10. The output and latency of one-level DWT using a ModelSim simulator when a sin wave is applied. Each clock

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

http://dx.doi.org/10.5772/intechopen.76522

125

Hardware availability and system performance requirements are critical for selecting the appropriate architecture of the embedded platform. The number of DWT levels, filter taps, and word length has a substantial influence on the performance of the design and complexity. Increasing the number of DWT levels has roughly the same effect on the operating frequency. Because the only change between the RNS-based with P<sup>7</sup> or P<sup>10</sup> implementations is the output signal width, and the maximum operating frequencies slightly change. Furthermore, the onelevel DB2 filter bank was designed with maximum operating frequencies of 232 and 258 MHz for

Resources FIR MAC DAA-based RNS-based

Input precision Q5,16 Q5,16 Q5,16 y = 8 y = 8 Coefficients precision Q0,12 Q1,15 Q0,15 z = 11 z = 11 Internal word length 22 bit 22 bit NA 22 bit 31 bit SNR (dB) 83.2 78.7 70.4 53.41 54.78 PSNR (dB) 86.3 81.8 73.5 56.5 57.9

P<sup>7</sup> P<sup>7</sup>

maximum PSNR equal to 84 is obtained when y = 12 and z = 16.

Table 7. The SNR and PSNR values of different DWT implementations.


Table 6. Resource use for the DWT implementation of DB2, DB4, and DB5.

increased by one-third, while the maximum frequency in both designs is greater than 235 MHz.

Table 6 shows a comparison between the DA- and RNS-based one-level DWT implementations when using larger filter banks—that is, DB4 and DB5. It shows that DAA-based implementation occupies a fixed number of RAMB18E1. The number of memory elements of DAA-based implementation is fixed and depends on the word length (Table 2).

However, as the number of filter taps increases, the memory size is exponentially increased to 2N. By contrast, the number of memory elements that are used in RNS-based implementation is linearly increased as the number of filter taps is increased. Similarly, the number of memories that are used at multilevel DAA-based and RNS-based implementations with the l-level would be an aggregate of levels 1 through l.

#### 4.2. Functionality verification

The discrete wavelet transform was simulated by means of ModelSim simulator. Figure 10 shows that the MAC and DAA have lower latency than other approaches. It depicts that the FIR- and RNS-based of P<sup>7</sup> and P<sup>10</sup> implementations lag behind MAC and DAA by four clock cycles.


Figure 10. The output and latency of one-level DWT using a ModelSim simulator when a sin wave is applied. Each clock cycle is 10 ns.

#### 4.3. Precision analysis

We carried out the precision analysis for the first and DWT levels, and the result is presented in Table 7. The output bit precision is set to Q5,16 for all implementations. Table 7 shows the maximum performance based on the signal-to-noise-ratio (SNR) and peak-signal-to-noiseratio (PSNR). For P7, we could not achieve a better accuracy with the specified scaling factors because y + z = 19 < (3 ∗ 7) + 1 = 22. However, both DAA- and RNS-based approaches offer high-signal quality with a peak signal-to-noise ratio (PSNR) of 73.5 and 56.5 dB, respectively. Figure 11 shows the effect of changing the scaling factors of P<sup>10</sup> for DB2 RNS-based approach. The input scaling factor is increased from 8 to 13 bit and the filter scaling factor is increased from 11 to 18. As expected, lower scaler factors produce PSNR equal to 56 dB, whereas the maximum PSNR equal to 84 is obtained when y = 12 and z = 16.

#### 4.4. Discussion

increased by one-third, while the maximum frequency in both designs is greater than

Resources MAC DA FIR RNS (n = 7) RNS (n = 10)

Table 5. The resource use and system performance of the DWT implementation for one-level DB2 implementation.

Number of slice registers 650 737 780 767 1441 1898 Number of slice LUTs 521 539 568 721 132 1677 Number of RAMB18E1 22 22 22 16 32 40 Max. operating freq. (MHz) 232.72 205.55 223.31 258.87 265.32 258.80

DB2 DB4 DB5 DB2 DB4 DB5

Resources DA-based RNS (n = 7)

Number of slice registers 282 661 167 767 1089 Number of slice LUTs 128 520 71 721 1055 Number of occupied slices 58 188 60 240 358 Number of DSP48E1s 4 0 4 0 0 Number of RAMB18E1 0 22 0 16 16 Max. operating freq. (MHz) 296.38 229.83 472.59 258.86 261.028 Min. period (ns) 3.374 4.351 2.030 3.863 3.831 Estimated total power (mW) 8.44 66.54 9.05 56.22 53.05

Table 6 shows a comparison between the DA- and RNS-based one-level DWT implementations when using larger filter banks—that is, DB4 and DB5. It shows that DAA-based implementation occupies a fixed number of RAMB18E1. The number of memory elements of DAA-based imple-

However, as the number of filter taps increases, the memory size is exponentially increased to 2N. By contrast, the number of memory elements that are used in RNS-based implementation is linearly increased as the number of filter taps is increased. Similarly, the number of memories that are used at multilevel DAA-based and RNS-based implementations with the l-level

The discrete wavelet transform was simulated by means of ModelSim simulator. Figure 10 shows that the MAC and DAA have lower latency than other approaches. It depicts that the FIR- and RNS-based of P<sup>7</sup> and P<sup>10</sup> implementations lag behind MAC and DAA by four clock

mentation is fixed and depends on the word length (Table 2).

Table 6. Resource use for the DWT implementation of DB2, DB4, and DB5.

would be an aggregate of levels 1 through l.

4.2. Functionality verification

235 MHz.

124 Wavelet Theory and Its Applications

cycles.

Hardware availability and system performance requirements are critical for selecting the appropriate architecture of the embedded platform. The number of DWT levels, filter taps, and word length has a substantial influence on the performance of the design and complexity.

Increasing the number of DWT levels has roughly the same effect on the operating frequency. Because the only change between the RNS-based with P<sup>7</sup> or P<sup>10</sup> implementations is the output signal width, and the maximum operating frequencies slightly change. Furthermore, the onelevel DB2 filter bank was designed with maximum operating frequencies of 232 and 258 MHz for


Table 7. The SNR and PSNR values of different DWT implementations.

MAC-based approach. The former approaches are multiplierless architectures that intensively

A Comparative Performance of Discrete Wavelet Transform Implementations Using Multiplierless

http://dx.doi.org/10.5772/intechopen.76522

127

Given implementation examples for experimental verifications and analysis, the approaches were simulated using Simulink and validated on a Xilinx Virtex 6 FPGA platform. The cosimulation results have also been verified and compared with the simulation environment. The complexity and optimization of multi-level DWT with respect to hardware structure provides a foundation for employing an appropriate algorithm for high-performance applications, such as in cognitive communication when combining the DWT analysis with machine-learning

Faculty of Computer Engineering, Department of Computer Engineering, Istanbul Technical

[1] Sklivanitis G, Gannon A, Batalama SN, Pados DA. Addressing next-generation wireless hallenges with commercial software-DefinedRadio platforms. IEEE Communications

[2] Mitola J, Maguire JGQ. Cognitive radio: Making software radios more personal. IEEE

[3] Akyildiz IF, Lee WY, Vuran MC, Mohanty S. Next generation/dynamic Spectrum access/ cognitive radio wireless networks: A survey. Computer Networks. 2006;50(13):2127-2159.

[4] Mitola J. The software radio architecture. IEEE Communications Magazine. 1995;33(5):26-38.

[5] Yang P, Li Q. Wavelet transform-based feature extraction for ultrasonic flaw signal classification. Neural Computing and Applications. 2014;24(3–4):817-826. DOI: 10.1007/s00521-

[6] Madishetty SK,Madanayake A, Cintra RJ, Dimitrov VS. Precise VLSI architecture for AI based 1-D/ 2-D Daub-6 wavelet filter banks with low adder-count. IEEE Transactions on Circuits and

Systems I: Regular Papers. 2014;61(7):1984-1993. DOI: 10.1109/TCSI.20142298283

Magazine. 2016;54(1):59-67. DOI: 10.1109/MCOM.2016.7378427

Personal Communications. 1999;6(4):13-18. DOI: 10.1109/98.788210

use memory to speed up the entire processing time.

algorithms.

Author details

References

University, Istanbul, Turkey

Husam Alzaq\* and Burak Berk Üstündağ

\*Address all correspondence to: alzaq@itu.edu.tr

DOI: 10.1016/j.comnet.2006.05.001

DOI: 10.1109/35.393001

012-1305-7

Figure 11. The impact of input and wavelet filter scaling factors of one-level RNS-based implementation with respect to P10 and P13 moduli sets on PSNR.

DAA- and RNS-based approaches, respectively. However, all high-frequency implementations introduce a latency of at least 10 clock cycles for one-level DA-based DWT.

Another critical parameter that affects the DWT performance is the filter order. DAA-based implementation outperforms the RNS-based with at most 10 taps. When the number of taps increases, the number of memory units and binary adders within the RNS-based implementation constantly increases, and the size is not affected as shown in Table 2. The memory requirement for DAA-based implementation is exponentially increased as the number of filter taps increases.

In addition, the two approaches have different memory content. Whereas the memory content of DAA-based implementation is consistent and identical, the memory content of RNS-based varies from tap to tap. This is obvious because each memory 590 stores the multiplication values of each filter coefficient by the moduli set.

The word length determines the number of occupied memory in both implementations. As the word length increases, the number of memory within the DAA- and RNS-based approaches increases linearly by w and w∗log2(w), respectively. Furthermore, we could not neglect the effect of output word length on the accuracy and the internal structure. The DAA-based approach requires large memories to have high precision. By contrast, RNS-based approach could achieve high precision with adopting the scaling factors, which do not require any change to the design, except updating memory contents.
