**2.3 Delay time and area estimations for the compare-swap cell**

In order to have a reference of the switching speed for the one-bit CS circuit, an empirical delay time estimation supported by SPICE simulations is performed. Due to the speed in a CMOS gate is limited by the time taken to charge load capacitances toward VDD and discharge toward GND (Rabaey, 2003), the parasitic capacitances induced by the layout structure are considered. In this sense, a parasitic extractor software (e.g., L-Edit extractor of Tanner EDA) can be used to obtain a circuit netlist file in which all these elements be incorporated. By using SPICE simulation and including the proper test-data fabrication model parameters (AMIS 0.5 microns), an accurate transient response is achieved. The resulting transient responses are analyzed to estimate the switching speed through the delay

Fig. 4. Mask layout of the CMOS one-bit full-adder circuit

Fig. 5. Mask layout of the one-bit multiplexer circuit

**2.3 Delay time and area estimations for the compare-swap cell** 

In order to have a reference of the switching speed for the one-bit CS circuit, an empirical delay time estimation supported by SPICE simulations is performed. Due to the speed in a CMOS gate is limited by the time taken to charge load capacitances toward VDD and discharge toward GND (Rabaey, 2003), the parasitic capacitances induced by the layout structure are considered. In this sense, a parasitic extractor software (e.g., L-Edit extractor of Tanner EDA) can be used to obtain a circuit netlist file in which all these elements be incorporated. By using SPICE simulation and including the proper test-data fabrication model parameters (AMIS 0.5 microns), an accurate transient response is achieved. The resulting transient responses are analyzed to estimate the switching speed through the delay time (difference between input transition at 50% and the 50% output level). The simulated output voltage obtained for the one-bit CS circuit is shown in Fig. 6. In this simulation, the voltage supply of 5V (VDD) and the overall frequency of 5MHz are considered. Also, the simplest representation 0 or 1 will be hereafter used instead of the "1" logic or the "0" logic notations. After running the SPICE simulation, it can be observed the outputs MAX(A,B)={0,1,1,1} and MIN(A,B)={0,0,0,1} when the inputs A and B are given by A={0,0,1,1} and B={0,1,0,1}. It is important to notice that the signal CARRY\_OUT (Cout) is only in high when A=0 and B=1 (the unique case where a swap is needed).

Fig. 6. Simulated output voltage obtained for the one-bit CS circuit

Fig. 7. Worst-case delay time for the one-bit CS circuit

VLSI Design of Sorting Networks in CMOS Technology 99

In Fig. 10, the simulated output voltage obtained for the 4-bit CS circuit is shown when inputs A and B are given by: [A3:A0]={ 0101 (510), 1001 (910), 0011 (310), 0100 (410) } and

Fig. 11. Carry out (Cout) propagation through the 4-bit CS circuit for the Cout=[1,1,1,1] case

[B3:B0]={ 1010 (1010), 0110 (610), 1100 (1210), 0100 (410) }.

Fig. 10. Simulated output waveforms of the 4-bit CS circuit

As it was expected, the worst-case of delay time is presented in the swapping case. However, not only the delay time depends on the Cout propagation, but also it is related to the delay time added by the transmission gate. In accordance with simulation, a delay of 1.3 ns is exhibited. In Fig. 7 this delay time is showed, the dashed line indicates the input B=1 when A=0 and the solid line represents the propagated B datum after the swap operation.

An accurate silicon area estimation of the CS design can be computed directly from the layout editor by using a ruler tool (usually provided in this software). Figure 8 shows the CS cell layout design that highlights the length and width dimensions expressed in terms of lambda. From this figure the area estimation is given by 30830 λ2 = 0.0037767 mm2.

Fig. 8. Silicon area estimation for the one-bit CS layout

#### **2.4 The n-bits compare-swap cell**

The one-bit CS circuit in Fig.2 can be easily expanded into an n-bits structure. In order to illustrate how this expansion can be performed, the schematic diagram for a 4-bit CS circuit is shown in Fig. 9. Because of the overall speed of the CS circuit is limited by the delay propagation of the Cout bits through the *n*-bits chain, therefore an estimation of this time becomes essential for determining the speed performance. However, besides to the delay produced due to the critical path of Cout, the delay time added by the multiplexer block is also taken into account.

Fig. 9. Block diagram for the 4-bit CS circuit

As it was expected, the worst-case of delay time is presented in the swapping case. However, not only the delay time depends on the Cout propagation, but also it is related to the delay time added by the transmission gate. In accordance with simulation, a delay of 1.3 ns is exhibited. In Fig. 7 this delay time is showed, the dashed line indicates the input B=1 when A=0 and the solid line represents the propagated B datum after the swap operation. An accurate silicon area estimation of the CS design can be computed directly from the layout editor by using a ruler tool (usually provided in this software). Figure 8 shows the CS cell layout design that highlights the length and width dimensions expressed in terms of

The one-bit CS circuit in Fig.2 can be easily expanded into an n-bits structure. In order to illustrate how this expansion can be performed, the schematic diagram for a 4-bit CS circuit is shown in Fig. 9. Because of the overall speed of the CS circuit is limited by the delay propagation of the Cout bits through the *n*-bits chain, therefore an estimation of this time becomes essential for determining the speed performance. However, besides to the delay produced due to the critical path of Cout, the delay time added by the multiplexer block is

lambda. From this figure the area estimation is given by 30830 λ2 = 0.0037767 mm2.

Fig. 8. Silicon area estimation for the one-bit CS layout

**2.4 The n-bits compare-swap cell** 

Fig. 9. Block diagram for the 4-bit CS circuit

also taken into account.

In Fig. 10, the simulated output voltage obtained for the 4-bit CS circuit is shown when inputs A and B are given by: [A3:A0]={ 0101 (510), 1001 (910), 0011 (310), 0100 (410) } and [B3:B0]={ 1010 (1010), 0110 (610), 1100 (1210), 0100 (410) }.

Fig. 10. Simulated output waveforms of the 4-bit CS circuit

Fig. 11. Carry out (Cout) propagation through the 4-bit CS circuit for the Cout=[1,1,1,1] case

VLSI Design of Sorting Networks in CMOS Technology 101

process is completed. It is worth to mention that because of the mask operates over the neighbour pixels, then it is needed to add elements (for example zeros) around *I*(*x*,*y*), increasing its dimension as (*m*+2)×(*n*+2). At each one of these pixels, a sorting procedure is performed by following three basic steps into the (3×3) mask: firstly, the pixels of the mask are sorted in a column by column sequence, then row by row, and finally along to the diagonal elements. After the sorting task is achieved, the central element (median) of the mask is picked out of *I*(*x*,*y*) and stored in the *IF*(*x*,*y*) to construct the filtered image. An illustrative description for this median algorithm is depicted in Fig. 13. A more formal

description of this algorithm can be found in reference (Jimenez et al., 2011).

Fig. 13. Graphical description for the median filter algorithm

algorithm is shown in Fig. 14.

follows:

**3.1 The sorting network block in the median filter algorithm** 

A Knuth diagram for the sorting network procedure which is described in the median filter

Notice that the above sorting network exhibits a very regular structure that is hierarchically partitioned in seven blocks of three-data for median computing. The first stage of three blocks is dedicated to the column by column sorting, the second stage of three blocks is devoted for the row by row sorting, and finally the last block performs the diagonal sorting. It can be also observed that after all data have been propagated through the entire network, the median datum will be appearing in the bus line D4. If the (3×3) mask is defined as

(,)

MASK D1 D4 D7

D0 D3 D6

*Ixy* <sup>=</sup> (1)

D2 D5 D8

Figure 11 depicts the Cout propagation while Fig.12 indicates the delay time between a signal and its corresponding output after that a swap operation is performed. In these simulations, the delay time was also examined at the overall frequency of 5MHz, VDD=5V, and by considering the worst-case of Cout propagation. This case occurs when [A3:A0]=[0000] and [B3:B0]=[1111] what ensures transferring Cout= 1 at every one-bit CS basic cell.

Fig. 12. Delay time between B[0] and MAX[A3,B3]

#### **3. Median filtering for image denoising using sorting network**

In order to illustrate the application of the CS circuit to the CMOS design, a digital architecture which is dedicated to median filtering for image denoising, is taken as a reference. This kind of filtering technique is used to reduce impulsive noise in acquired images (Faundez, 2001). Its main advantage consists in diminishing the lossless of information due to the computed pixel values have correspondence to one of the already presented in the image and its main characteristic is the requirement of a sorting operation (Vega et al., 2002).

Before of describing this design, it is important to present a briefly explanation about the algorithm which serves as basis for its digital architecture. The following notational conventions will be used: if *I*(*x*,*y*) is a grayscale image divided in (*m*×*n*) pixels (squares) and also *I*(*x*,*y*) is affected by impulsive noise, then by applying a median filter algorithm, a denoised image *IF*(x,y) can be obtained. In order to achieve *IF*(*x*,*y*), the value of each output pixel must be computed by using iteratively a (3×3) square array (mask) of 9 pixels with center in *I*(*x*,*y*). The position of this mask is shifted along to *I*(*x*,*y*) until the median filtering

Figure 11 depicts the Cout propagation while Fig.12 indicates the delay time between a signal and its corresponding output after that a swap operation is performed. In these simulations, the delay time was also examined at the overall frequency of 5MHz, VDD=5V, and by considering the worst-case of Cout propagation. This case occurs when [A3:A0]=[0000] and

[B3:B0]=[1111] what ensures transferring Cout= 1 at every one-bit CS basic cell.

Fig. 12. Delay time between B[0] and MAX[A3,B3]

(Vega et al., 2002).

**3. Median filtering for image denoising using sorting network**

In order to illustrate the application of the CS circuit to the CMOS design, a digital architecture which is dedicated to median filtering for image denoising, is taken as a reference. This kind of filtering technique is used to reduce impulsive noise in acquired images (Faundez, 2001). Its main advantage consists in diminishing the lossless of information due to the computed pixel values have correspondence to one of the already presented in the image and its main characteristic is the requirement of a sorting operation

Before of describing this design, it is important to present a briefly explanation about the algorithm which serves as basis for its digital architecture. The following notational conventions will be used: if *I*(*x*,*y*) is a grayscale image divided in (*m*×*n*) pixels (squares) and also *I*(*x*,*y*) is affected by impulsive noise, then by applying a median filter algorithm, a denoised image *IF*(x,y) can be obtained. In order to achieve *IF*(*x*,*y*), the value of each output pixel must be computed by using iteratively a (3×3) square array (mask) of 9 pixels with center in *I*(*x*,*y*). The position of this mask is shifted along to *I*(*x*,*y*) until the median filtering process is completed. It is worth to mention that because of the mask operates over the neighbour pixels, then it is needed to add elements (for example zeros) around *I*(*x*,*y*), increasing its dimension as (*m*+2)×(*n*+2). At each one of these pixels, a sorting procedure is performed by following three basic steps into the (3×3) mask: firstly, the pixels of the mask are sorted in a column by column sequence, then row by row, and finally along to the diagonal elements. After the sorting task is achieved, the central element (median) of the mask is picked out of *I*(*x*,*y*) and stored in the *IF*(*x*,*y*) to construct the filtered image. An illustrative description for this median algorithm is depicted in Fig. 13. A more formal description of this algorithm can be found in reference (Jimenez et al., 2011).


Fig. 13. Graphical description for the median filter algorithm

## **3.1 The sorting network block in the median filter algorithm**

A Knuth diagram for the sorting network procedure which is described in the median filter algorithm is shown in Fig. 14.

Notice that the above sorting network exhibits a very regular structure that is hierarchically partitioned in seven blocks of three-data for median computing. The first stage of three blocks is dedicated to the column by column sorting, the second stage of three blocks is devoted for the row by row sorting, and finally the last block performs the diagonal sorting. It can be also observed that after all data have been propagated through the entire network, the median datum will be appearing in the bus line D4. If the (3×3) mask is defined as follows:

$$\text{MASK}\_{I(\mathbf{x},\mathbf{y})} = \begin{vmatrix} \text{D0} & \text{D3} & \text{D6} \\ \text{D1} & \text{D4} & \text{D7} \\ \text{D2} & \text{D5} & \text{D8} \end{vmatrix} \tag{1}$$

VLSI Design of Sorting Networks in CMOS Technology 103

In order to illustrate the correct performance of this architecture, results obtained from the FPGA implementation and from the coded algorithm in Matlab are compared. Figure 15 shows a group of images that have been intentionally corrupted by impulsive noise and

Fig. 15. Collection of images filtered by software (Matlab) and by the FPGA device

The main structural component in the sorting network which is exposed in section 3.1, is a three-data comparator. As shown in Fig. 14, this element can be constituted by a set of interconnected one-bit CS cells. Three 8-bit word-length inputs described as: A, B and C can be identified. Also, three 8-bit CS blocks make possible to collect the median datum in the middle bus denoted by MED(A,B,C), and the corresponding minimum and maximum data into the external buses described as MIN(A,B,C) and MAX(A,B,C). In order to minimize the layout area, the CS modules have been rotated and placed in the position as illustrates the

Fig. 16. Floorplanning (on left) and layout (on right) for the 8-bit three-data comparator

**3.3 Floor planning and design at layout level** 

floorplanning and layout of Fig. 16.

included in the median filter

then filtered directly by Matlab software and by FPGA hardware.

Fig. 14. Knuth diagram for the sorting network included in the median filter algorithm Then the median datum (collected in D4) is computed trough the next steps:

1. Column by column sorting:

( *DDDSORT* 2,1,0 ) ( *DDDSORT* 5,4,3 ) ( *DDDSORT* 8,7,6 ) ( *DDDSORT* 6,3,0 ) ( *DDDSORT* 7,4,1 ) ( *DDDSORT* 8,5,2 )

3. Diagonal sorting:

2. Row by Row sorting:

*T* ( *DDDSOR* 6,4,2 )

#### **3.2 Digital architecture for the image filtering based on sorting network**

In reference (Jimenez et al., 2011) a FPGA (Field Programmable Gate Array) implementation for median filtering image based on a sorting algorithm is reported. In such architecture two blocks can be distinguished: a nine-data accumulator and a nine-data sorting network module. The accumulator is a memory register in which the data is received from the (3x3) mask and temporarily stored. The sorting network, which is in fact the kernel of the median filter architecture, is also a nine-inputs/one-output combinational module. It is constituted by an array of seven blocks of three-data comparator modules as corresponds to Fig. 14. This interconnection topology is directly related to the median algorithm because it operates by following the already described three steps: column sorting, row sorting and diagonal sorting. It can be seen that although this block is able to output the nine data in a sorted sequence, only the datum in D4 is collected since it represents the median.

Fig. 14. Knuth diagram for the sorting network included in the median filter algorithm

( *DDDSORT* 2,1,0 )

( *DDDSORT* 5,4,3 )

( *DDDSORT* 8,7,6 )

( *DDDSORT* 6,3,0 )

( *DDDSORT* 7,4,1 )

( *DDDSORT* 8,5,2 )

*T* ( *DDDSOR* 6,4,2 )

In reference (Jimenez et al., 2011) a FPGA (Field Programmable Gate Array) implementation for median filtering image based on a sorting algorithm is reported. In such architecture two blocks can be distinguished: a nine-data accumulator and a nine-data sorting network module. The accumulator is a memory register in which the data is received from the (3x3) mask and temporarily stored. The sorting network, which is in fact the kernel of the median filter architecture, is also a nine-inputs/one-output combinational module. It is constituted by an array of seven blocks of three-data comparator modules as corresponds to Fig. 14. This interconnection topology is directly related to the median algorithm because it operates by following the already described three steps: column sorting, row sorting and diagonal sorting. It can be seen that although this block is able to output the nine data in a sorted

**3.2 Digital architecture for the image filtering based on sorting network** 

sequence, only the datum in D4 is collected since it represents the median.

Then the median datum (collected in D4) is computed trough the next steps:

1. Column by column sorting:

2. Row by Row sorting:

3. Diagonal sorting:

In order to illustrate the correct performance of this architecture, results obtained from the FPGA implementation and from the coded algorithm in Matlab are compared. Figure 15 shows a group of images that have been intentionally corrupted by impulsive noise and then filtered directly by Matlab software and by FPGA hardware.

Fig. 15. Collection of images filtered by software (Matlab) and by the FPGA device
