**3. Rank-Order Filtering (ROF) with a maskable memory**

Rank-order filtering (ROF), or order-statistical filtering, has been widely applied for various speech and image processing applications [1]-[6]. Given a sequence of input samples *{xi*−*k*, *xi*−*k*<sup>+</sup>1, ..., *xi*, ..., *xi*+*l}*, the basic operation of rank order filtering is to choose the *<sup>r</sup>*-th largest sample as the output *yi*, where *r* is the rank-order of the filter. This type of ROF is normally classified as *the non-recursive ROF* . Another type of ROF is called *the recursive ROF*. The difference between the recursive ROF and the non-recursive ROF is that

the input sequence of the recursive ROF is {*yi*−*k*, *yi*−*k*<sup>+</sup>1, ..., *yi*−1, *xi*, ..., *xi*<sup>+</sup>*l*}. Unlike linear filtering, ROF can remove sharp discontinuities of small duration without blurring the original signal; therefore, ROF becomes a key component for signal smoothing and impulsive noise elimination. To provide this key component for various signal processing applications, we intend to design a configurable rank-order filter that features low cost and high speed.

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 263

Many approaches for hardware implementation of rank-order filtering have been presented in the past decades [8, 10-24]. Many of them are based on sorting algorithm [11, 22, 23, 25-28]. They considered the operation of rank-order filtering as two steps: sorting and choosing. Papers [10, 19] have proposed systolic architectures for rank-order filtering based on sorting algorithms, such as bubble sort and bitonic sort. These architectures are fully pipelined for high throughput rate at the expense of latency, but require a large number of compare-swap units and registers. To reduce the hardware complexity, papers [8, 12, 14, 15, 29-32] present linear-array structures to maintain samples in sorted order. For a sliding window of size *N*, the linear-array architectures consist of *N* processing elements and require three steps for each iteration: finding the proper location for new coming sample, discarding the eldest one, and moving samples between the newest and eldest one position. The three-step procedure is called delete-and-insert (DI). Although the hardware complexity is reduced to *O(N)*, they require a large latency for DI steps. Paper [31] further presents a micro-programmable processor for the implementations of the median-type filters. Paper [20] presents a parallel architecture using two-phase design to improve the operating speed. In this paper, they first modified the traditional content-addressable memory (CAM) to a shiftable CAM (SCAM) processor with shiftable memory cells and comparators. Their architecture can take advantages of CAM for parallelizing the DI procedure. Then, they use two-phase design to combine delete and insert operations. Thereafter, the SCAM processor can quickly finish DI operations in parallel. Although the SCAM processor has significantly increased the speed of the linear-array architecture, it can only process a new sample at a time and cannot efficiently process 2-D data. For a window of size *n-by-n*, the SCAM processor needs *n* DI procedures for each filtering computation. To have an efficient 2-D rank-order filter, papers [12, 27] present

In addition to the sorting algorithm, the paper [35] applies the threshold decomposition technique for rank-order filtering. To simplify the VLSI complexity, the proposed approach uses three steps : decomposition, binary filtering, and recombination. The proposed approach significantly reduce the area complexity from exponential to linear. Papers [10, 13, 16-18, 21, 33, 34] employ the bit-sliced majority algorithm for median filtering, the most popular type of rank-order filtering. The bit-sliced algorithm [36, 37] bitwisely selects the ranked candidates and generates the ranked result one bit at a time. Basically, the bit-sliced algorithm for median filtering recursively executes two steps: majority calculation and polarization. The majority calculation, in general, dominates the execution time of median filtering. Papers [10], [17] and [37] present logic networks for implementation of majority calculation. However, the circuits are time-consuming and complex so that they cannot take full advantages of bit-sliced algorithm. Some papers claim that this type of algorithm is impractical for logic circuit implementation because of its exponential complexity [31]. Paper [21] uses an inverter as a voter for majority calculation. It significantly improves both cost and processing speed, but the noise margin will become narrow as the number of inputs increases. The narrow noise

solutions for 2-D rank-order filtering at the expense of area.

Fig. 14. The FPGA prototype of the GICam-II image compressor.


Table 6. The comparison Result with Previous GICam Works


Table 7. The Comparison Results with Existing Works

20 Will-be-set-by-IN-TECH

JPEG GICam Proposed GICam-II designed by image compressor image compressor

> Area Frequency Power Supply Voltage (MHz) (mW) (V)

> > (measured)

(evaluated)

(measured)

Fig. 14. The FPGA prototype of the GICam-II image compressor.

Table 6. The comparison Result with Previous GICam Works

Table 7. The Comparison Results with Existing Works

Average

Average

rate Average

dissipation

(19) (11)

PSNRY 46.37 dB 41.99 dB 40.73 dB

power 876mW 14.92 mW 9.17 mW

GICam image 390k 12.58 14.92 1.8

X.Xie et al.(12)\* 12600k 40.0 6.2 1.8

K.Wahid et al. (13) 325k 150 10 1.8

X.Chen et al.(14)\* 11200k 20 1.3 0.95

Proposed GICam-II 318k 7.96 9.17 1.8

\* includes analog and transmission circuit and SRAM

image compressor (evaluated)

compressor (11) (evaluated)

compression 82.20% 79.65% 82.28%

the input sequence of the recursive ROF is {*yi*−*k*, *yi*−*k*<sup>+</sup>1, ..., *yi*−1, *xi*, ..., *xi*<sup>+</sup>*l*}. Unlike linear filtering, ROF can remove sharp discontinuities of small duration without blurring the original signal; therefore, ROF becomes a key component for signal smoothing and impulsive noise elimination. To provide this key component for various signal processing applications, we intend to design a configurable rank-order filter that features low cost and high speed.

Many approaches for hardware implementation of rank-order filtering have been presented in the past decades [8, 10-24]. Many of them are based on sorting algorithm [11, 22, 23, 25-28]. They considered the operation of rank-order filtering as two steps: sorting and choosing. Papers [10, 19] have proposed systolic architectures for rank-order filtering based on sorting algorithms, such as bubble sort and bitonic sort. These architectures are fully pipelined for high throughput rate at the expense of latency, but require a large number of compare-swap units and registers. To reduce the hardware complexity, papers [8, 12, 14, 15, 29-32] present linear-array structures to maintain samples in sorted order. For a sliding window of size *N*, the linear-array architectures consist of *N* processing elements and require three steps for each iteration: finding the proper location for new coming sample, discarding the eldest one, and moving samples between the newest and eldest one position. The three-step procedure is called delete-and-insert (DI). Although the hardware complexity is reduced to *O(N)*, they require a large latency for DI steps. Paper [31] further presents a micro-programmable processor for the implementations of the median-type filters. Paper [20] presents a parallel architecture using two-phase design to improve the operating speed. In this paper, they first modified the traditional content-addressable memory (CAM) to a shiftable CAM (SCAM) processor with shiftable memory cells and comparators. Their architecture can take advantages of CAM for parallelizing the DI procedure. Then, they use two-phase design to combine delete and insert operations. Thereafter, the SCAM processor can quickly finish DI operations in parallel. Although the SCAM processor has significantly increased the speed of the linear-array architecture, it can only process a new sample at a time and cannot efficiently process 2-D data. For a window of size *n-by-n*, the SCAM processor needs *n* DI procedures for each filtering computation. To have an efficient 2-D rank-order filter, papers [12, 27] present solutions for 2-D rank-order filtering at the expense of area.

In addition to the sorting algorithm, the paper [35] applies the threshold decomposition technique for rank-order filtering. To simplify the VLSI complexity, the proposed approach uses three steps : decomposition, binary filtering, and recombination. The proposed approach significantly reduce the area complexity from exponential to linear. Papers [10, 13, 16-18, 21, 33, 34] employ the bit-sliced majority algorithm for median filtering, the most popular type of rank-order filtering. The bit-sliced algorithm [36, 37] bitwisely selects the ranked candidates and generates the ranked result one bit at a time. Basically, the bit-sliced algorithm for median filtering recursively executes two steps: majority calculation and polarization. The majority calculation, in general, dominates the execution time of median filtering. Papers [10], [17] and [37] present logic networks for implementation of majority calculation. However, the circuits are time-consuming and complex so that they cannot take full advantages of bit-sliced algorithm. Some papers claim that this type of algorithm is impractical for logic circuit implementation because of its exponential complexity [31]. Paper [21] uses an inverter as a voter for majority calculation. It significantly improves both cost and processing speed, but the noise margin will become narrow as the number of inputs increases. The narrow noise

Step 4: (Polarization)

Step 6: If *b* ≥ 0 go to Step 2.

step will consider the inputs with *u*<sup>3</sup>

**3.2 The dual-cell RAM architecture for rank-order filtering**

*<sup>j</sup>* for 0 ≤ *m* ≤ *b* − 1 and *i* − *k* ≤ *j* ≤ *i* + *l*

Fig.15 illustrates a bit-sliced ROF example for *N*=7, *B*=4, and *r*=1. Given that the input samples are 7(01112), 5(01012), 11(10112 ), 14(11102), 2(00102), 8(10002), and 3(00112), the generic algorithm will produce 14(11102 ) as the output result. At the beginning, the "Bit counting" step will calculate the number of 1's at MSBs, which is 3. Since the number of 1's is greater than *r*, the "Threshold decomposition" step sets the MSB of *yi* to '1'. Then, the "Polarization"

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 265

lower bits of the others to all 0's. After repeating the above steps with decreasing *b*, the output

As mentioned above, the generic rank-order filtering algorithm generates the rank-order value bit-by-bit without using complex sorting computations. The main advantage of this algorithm is that the calculation of rank-order filtering has low computational complexity and can be mapped to a highly parallel architecture. In the algorithm, there are three main tasks: bit counting, threshold decomposition, and polarization. To have these tasks efficiently implemented, this paper presents an ROF processor based on a novel maskable memory architecture, as shown in Fig.16. The memory structure is highly scalable with the window size increasing, by simply adding memory cells. Furthermore, with the instruction decoder and maskable memory, the proposed architecture is programmable and flexible for different

The dual-cell random-access memory (DCRAM) plays a key role in the proposed ROF architecture. In the DCRAM, there are two fields for reusing the input data and pipelining the filtering process. For the one-dimensional (1-D) ROF, the proposed architecture receives one sample at a time. For the *n-by-n* two-dimensional (2-D) ROF, the architecture reads *n* samples into the input window within a filtering iteration. To speed up the process of rank-order filtering and pipeline the data loading and filtering calculation, the data field loads the input data while the computing field is performing bit-sliced operations. Hence, the execution of the architecture has two pipeline stages: data fetching and rank-order calculation. In each iteration, the data fetching first loads the input sample(s) into the data field and then makes copies from the data field to the computing field. After having the input window in the computing field, the rank-order calculation bitwisely accesses the computing field and

The computing field is in the maskable part of DCRAM. The maskable part of DCRAM performs parallel reads for bit counting and parallel writes for polarization. The read-mask register (RMR) is configured to mask unwanted bits of the computing field during read operation. The value of RMR is one-hot-encoded so that the bit-sliced values can be read from

*<sup>j</sup>* = 1 as candidates of the ROF output and polarize the

If *u<sup>b</sup> <sup>j</sup>* � *<sup>v</sup><sup>b</sup> <sup>i</sup>* , *<sup>u</sup><sup>m</sup> <sup>j</sup>* <sup>=</sup> *<sup>u</sup><sup>b</sup>*

Step 5: *b=b-1*.

Step 7: Output *yi*.

*yi* will be 14(11102 ).

kinds of ROFs.

executes the ROF tasks.

margin makes the implementation impractical and limits the configurability of rank-order filtering.

Instead of using logic circuits, this paper presents a novel memory architecture for rank-order filtering based on a generic rank-order filtering algorithm. The maskable memory structure, called dual-cell random-access memory (DCRAM), is an extended SRAM structure with maskable registers and dual cells. The maskable registers allow the architecture to selectively read or write bit-slices, and hence speed up "parallel read" and "parallel polarization" tasks. The control of maskable registers is driven by a long-instruction-word (LIW) instruction set. The LIW makes the proposed architecture programmable for various rank-order filtering algorithms, such as recursive and non-recursive ROFs. The proposed architecture has been implemented using TSMC 0.18um 1P6M technology and successfully applied for 1-D/2-D ROF applications. For 9-point 1-D and 3-by-3 2-D ROF applications, the core size is 356.1 <sup>×</sup> 427.7*um*2. As shown in the post-layout simulation, the DCRAM-based processor can operate at 290 MHz for 3.3V supply and 256 MHz for 1.8V supply. For image processing, the performance of the proposed processor can process video clips of SVGA format in real-time.

#### **3.1 The generic bit-sliced rank-order filtering algorithm**

Let *Wi*={*xi*−*k*, *xi*−*k*<sup>+</sup>1, ..., *xi*, ..., *xi*<sup>+</sup>*l*} be a window of input samples. The binary code of each input *xj* is denoted as *uB*−<sup>1</sup> *<sup>j</sup>* ··· *<sup>u</sup>*<sup>1</sup> *j u*0 *<sup>j</sup>* . The output *yi* of the *r*-th order filter is the *r*-th largest sample in the input window *Wi*, denoted as *vB*−<sup>1</sup> *<sup>i</sup>* ··· *<sup>v</sup>*<sup>1</sup> *i v*0 *<sup>i</sup>* . The algorithm sequentially determines the *r*-th order value bit-by-bit starting from the most significant bit (MSB) to the least significant bit (LSB). To start with, we first count 1's from the MSB bit-slice of input samples and use *ZB*−<sup>1</sup> to denote the result. The *b*-th bit-slice of input samples is defined as *u<sup>b</sup> <sup>i</sup>*−*ku<sup>b</sup> <sup>i</sup>*−*k*+<sup>1</sup> ... *<sup>u</sup><sup>b</sup> <sup>i</sup>* ... *<sup>u</sup><sup>b</sup> i*+*l* . If *ZB*−<sup>1</sup> is greater than or equal to *<sup>r</sup>*, then *<sup>v</sup>B*−<sup>1</sup> *<sup>i</sup>* is 1; otherwise, *vB*−<sup>1</sup> *<sup>i</sup>* is 0. Any input sample whose MSB has the same value as *<sup>v</sup>B*−<sup>1</sup> *<sup>i</sup>* is considered as one of candidates of the *r*-th order sample. On the other hand, if the MSB of an input sample is not equal to *vB*−<sup>1</sup> *<sup>i</sup>* , the input sample will be considered as a non-candidate. Non-candidates will be then polarized to either the largest or smallest value. If the MSB of an input sample *xj* is 1 and *vB*−<sup>1</sup> *<sup>i</sup>* is 0, the rest bits (or lower bits) of *xj* are set to 1's. Contrarily, if the MSB of an input sample *xj* is 0 and *vB*−<sup>1</sup> *<sup>i</sup>* is 1, the rest bits (or lower bits) of *xj* are set to 0's. After the polarization, the algorithm counts 1's from the consecutive bit-slice and then repeats the polarization procedure. Consequently, the *r*-th order value can be obtained by recursively iterating the steps bit-by-bit. The following pseudo code illustrates the generic bit-sliced rank-order filtering algorithm:

Given the input samples, the window size *N=l+k+1*, the bitwidth *B* and the rank *r*, do:

Step 1: Set *b*=B-1. Step 2: (Bit counting) Calculate *Zb* from {*u<sup>b</sup> <sup>i</sup>*−*k*, *<sup>u</sup><sup>b</sup> <sup>i</sup>*−*k*+1, ..., *<sup>u</sup><sup>b</sup> <sup>i</sup>* , ..., *<sup>u</sup><sup>b</sup> i*+*l* }. Step 3: (Threshold decomposition) If *Zb* <sup>≥</sup> *<sup>r</sup>*,*v<sup>b</sup> <sup>i</sup>* =1; otherwise *<sup>v</sup><sup>b</sup> <sup>i</sup>* =0.

22 Will-be-set-by-IN-TECH

margin makes the implementation impractical and limits the configurability of rank-order

Instead of using logic circuits, this paper presents a novel memory architecture for rank-order filtering based on a generic rank-order filtering algorithm. The maskable memory structure, called dual-cell random-access memory (DCRAM), is an extended SRAM structure with maskable registers and dual cells. The maskable registers allow the architecture to selectively read or write bit-slices, and hence speed up "parallel read" and "parallel polarization" tasks. The control of maskable registers is driven by a long-instruction-word (LIW) instruction set. The LIW makes the proposed architecture programmable for various rank-order filtering algorithms, such as recursive and non-recursive ROFs. The proposed architecture has been implemented using TSMC 0.18um 1P6M technology and successfully applied for 1-D/2-D ROF applications. For 9-point 1-D and 3-by-3 2-D ROF applications, the core size is 356.1 <sup>×</sup> 427.7*um*2. As shown in the post-layout simulation, the DCRAM-based processor can operate at 290 MHz for 3.3V supply and 256 MHz for 1.8V supply. For image processing, the performance of the proposed processor can process video clips of SVGA format in real-time.

Let *Wi*={*xi*−*k*, *xi*−*k*<sup>+</sup>1, ..., *xi*, ..., *xi*<sup>+</sup>*l*} be a window of input samples. The binary code of each

determines the *r*-th order value bit-by-bit starting from the most significant bit (MSB) to the least significant bit (LSB). To start with, we first count 1's from the MSB bit-slice of input samples and use *ZB*−<sup>1</sup> to denote the result. The *b*-th bit-slice of input samples is defined

of candidates of the *r*-th order sample. On the other hand, if the MSB of an input sample is

will be then polarized to either the largest or smallest value. If the MSB of an input sample

the polarization, the algorithm counts 1's from the consecutive bit-slice and then repeats the polarization procedure. Consequently, the *r*-th order value can be obtained by recursively iterating the steps bit-by-bit. The following pseudo code illustrates the generic bit-sliced

Given the input samples, the window size *N=l+k+1*, the bitwidth *B* and the rank *r*, do:

*<sup>i</sup>* , ..., *<sup>u</sup><sup>b</sup> i*+*l* }. *<sup>i</sup>* ··· *<sup>v</sup>*<sup>1</sup>

. If *ZB*−<sup>1</sup> is greater than or equal to *<sup>r</sup>*, then *<sup>v</sup>B*−<sup>1</sup>

*<sup>i</sup>* , the input sample will be considered as a non-candidate. Non-candidates

*<sup>i</sup>* is 0, the rest bits (or lower bits) of *xj* are set to 1's. Contrarily, if the MSB of

*<sup>j</sup>* . The output *yi* of the *r*-th order filter is the *r*-th largest

*<sup>i</sup>* . The algorithm sequentially

*<sup>i</sup>* is 1; otherwise,

*<sup>i</sup>* is considered as one

*i v*0

*<sup>i</sup>* is 1, the rest bits (or lower bits) of *xj* are set to 0's. After

**3.1 The generic bit-sliced rank-order filtering algorithm**

*<sup>j</sup>* ··· *<sup>u</sup>*<sup>1</sup>

sample in the input window *Wi*, denoted as *vB*−<sup>1</sup>

*<sup>i</sup>* ... *<sup>u</sup><sup>b</sup> i*+*l*

*<sup>i</sup>*−*k*, *<sup>u</sup><sup>b</sup>*

*<sup>i</sup>*−*k*+1, ..., *<sup>u</sup><sup>b</sup>*

*<sup>i</sup>* =0.

*j u*0

*<sup>i</sup>* is 0. Any input sample whose MSB has the same value as *<sup>v</sup>B*−<sup>1</sup>

input *xj* is denoted as *uB*−<sup>1</sup>

*<sup>i</sup>*−*k*+<sup>1</sup> ... *<sup>u</sup><sup>b</sup>*

an input sample *xj* is 0 and *vB*−<sup>1</sup>

rank-order filtering algorithm:

Step 3: (Threshold decomposition)

*<sup>i</sup>* =1; otherwise *<sup>v</sup><sup>b</sup>*

not equal to *vB*−<sup>1</sup>

*xj* is 1 and *vB*−<sup>1</sup>

Step 1: Set *b*=B-1.

If *Zb* <sup>≥</sup> *<sup>r</sup>*,*v<sup>b</sup>*

Step 2: (Bit counting)

Calculate *Zb* from {*u<sup>b</sup>*

as *u<sup>b</sup> <sup>i</sup>*−*ku<sup>b</sup>*

*vB*−<sup>1</sup>

filtering.

Step 4: (Polarization) If *u<sup>b</sup> <sup>j</sup>* � *<sup>v</sup><sup>b</sup> <sup>i</sup>* , *<sup>u</sup><sup>m</sup> <sup>j</sup>* <sup>=</sup> *<sup>u</sup><sup>b</sup> <sup>j</sup>* for 0 ≤ *m* ≤ *b* − 1 and *i* − *k* ≤ *j* ≤ *i* + *l* Step 5: *b=b-1*. Step 6: If *b* ≥ 0 go to Step 2. Step 7: Output *yi*.

Fig.15 illustrates a bit-sliced ROF example for *N*=7, *B*=4, and *r*=1. Given that the input samples are 7(01112), 5(01012), 11(10112 ), 14(11102), 2(00102), 8(10002), and 3(00112), the generic algorithm will produce 14(11102 ) as the output result. At the beginning, the "Bit counting" step will calculate the number of 1's at MSBs, which is 3. Since the number of 1's is greater than *r*, the "Threshold decomposition" step sets the MSB of *yi* to '1'. Then, the "Polarization" step will consider the inputs with *u*<sup>3</sup> *<sup>j</sup>* = 1 as candidates of the ROF output and polarize the lower bits of the others to all 0's. After repeating the above steps with decreasing *b*, the output *yi* will be 14(11102 ).

#### **3.2 The dual-cell RAM architecture for rank-order filtering**

As mentioned above, the generic rank-order filtering algorithm generates the rank-order value bit-by-bit without using complex sorting computations. The main advantage of this algorithm is that the calculation of rank-order filtering has low computational complexity and can be mapped to a highly parallel architecture. In the algorithm, there are three main tasks: bit counting, threshold decomposition, and polarization. To have these tasks efficiently implemented, this paper presents an ROF processor based on a novel maskable memory architecture, as shown in Fig.16. The memory structure is highly scalable with the window size increasing, by simply adding memory cells. Furthermore, with the instruction decoder and maskable memory, the proposed architecture is programmable and flexible for different kinds of ROFs.

The dual-cell random-access memory (DCRAM) plays a key role in the proposed ROF architecture. In the DCRAM, there are two fields for reusing the input data and pipelining the filtering process. For the one-dimensional (1-D) ROF, the proposed architecture receives one sample at a time. For the *n-by-n* two-dimensional (2-D) ROF, the architecture reads *n* samples into the input window within a filtering iteration. To speed up the process of rank-order filtering and pipeline the data loading and filtering calculation, the data field loads the input data while the computing field is performing bit-sliced operations. Hence, the execution of the architecture has two pipeline stages: data fetching and rank-order calculation. In each iteration, the data fetching first loads the input sample(s) into the data field and then makes copies from the data field to the computing field. After having the input window in the computing field, the rank-order calculation bitwisely accesses the computing field and executes the ROF tasks.

The computing field is in the maskable part of DCRAM. The maskable part of DCRAM performs parallel reads for bit counting and parallel writes for polarization. The read-mask register (RMR) is configured to mask unwanted bits of the computing field during read operation. The value of RMR is one-hot-encoded so that the bit-sliced values can be read from

**d\_in**

Addr

ess Decoder

Data Field

DCRAM

c\_in

Fig. 16. The proposed rank-order filtering architecture.

is the quantized value of the Level-Quantizer.

FA S

FA S

FA S

C

FA S

FA S

FA/HA tree

C

C

HA S

C

HA S

C

r[0]

*Zb* [0]

r[1]

r[2]

*Zb* [1]

r[3]

*Zb* [3]

*Zb* [2]

C

C

Fig. 17. The block diagram of the Level-Quantizer.

c\_d(0)

c\_d(1) c\_d(2)

c\_d(3) c\_d(4) c\_d(5)

c\_d(6) c\_d(7) c\_d(8)

**addr** wr

R eset Circuit

reset

rst

Computing Field

WMR RMR

**instruction**

Instruction Decoder

cp **wm rm rank**

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 267

**c\_d**

comparator is implemented by a carry generator because the comparison result of *Zb* and *r* can be obtained from the carry output of *Zb* plus the two's complement of *r*. The carry output

en ren

RR

done

**d\_out**

OUTR

en

sr[0]

Shift Register

Quantized value

Carry Generator

sen

clock

Level-Quantizer

clk

**c\_wl** en c\_in

Maj Cin

Cout

Maj Cin

Cout

Maj Cin

Cout

PS


Fig. 15. An example of the generic bit-sliced ROF algorithm for *N*=7, *B*=4, and *r*=1.

the memory in parallel. The bit-sliced values will then go to the Level-Quantizer for threshold decomposition. When the ROF performs polarization, the write-mask register (WMR) is configured to mask untouched bits and allow the polarization selector (PS) to polar lower bits of noncandidate samples. Since the structure of memory circuits is regular and the maskable scheme provides fast logic operations, the maskable memory structure features low cost and high speed. It obviously outperforms logic networks on implementation of bit counting and polarization.

To start with the algorithm, the RMR is one-hot-masked according to the value *b* in the generic algorithm and then the DCRAM outputs a bit-sliced value {*u<sup>b</sup> <sup>i</sup>*−*k*, *<sup>u</sup><sup>b</sup> <sup>i</sup>*−*k*+1, ..., *<sup>u</sup><sup>b</sup> <sup>i</sup>* , ..., ... *<sup>u</sup><sup>b</sup> i*+*l* } on "**c d**". The bit-sliced value will go to both the Level-Quantizer and PS. The Level-Quantizer performs the Step 2 and Step 3 by summing up bits of the bit-sliced value to *Zb* and comparing *Zb* with the rank value *r*. The rank value *r* is stored in the rank register (RR). The bitwidth *w* of *Zb* is �*log<sup>N</sup>* <sup>2</sup> �. Fig.17 illustrates the block diagram of the Level-Quantizer, where FA denotes the full adder and HA denotes the half adder. The signals "S" and "C" of each FA or HA represent sum and carry, respectively. The circuit in the dash-lined box is a comparator. The 24 Will-be-set-by-IN-TECH

0 000

4: Polarization

0 000

yi 1 1 1 0

*<sup>i</sup>*−*k*, *<sup>u</sup><sup>b</sup>*

*<sup>i</sup>*−*k*+1, ..., *<sup>u</sup><sup>b</sup>*

*<sup>i</sup>* , ..., ... *<sup>u</sup><sup>b</sup>*

*i*+*l* }

0 000

7: T hreshold decompos ition

yi 1 1 X X

0 000

3: T hreshold decompos ition

0 000

Fig. 15. An example of the generic bit-sliced ROF algorithm for *N*=7, *B*=4, and *r*=1.

the memory in parallel. The bit-sliced values will then go to the Level-Quantizer for threshold decomposition. When the ROF performs polarization, the write-mask register (WMR) is configured to mask untouched bits and allow the polarization selector (PS) to polar lower bits of noncandidate samples. Since the structure of memory circuits is regular and the maskable scheme provides fast logic operations, the maskable memory structure features low cost and high speed. It obviously outperforms logic networks on implementation of bit counting and

To start with the algorithm, the RMR is one-hot-masked according to the value *b* in the generic

on "**c d**". The bit-sliced value will go to both the Level-Quantizer and PS. The Level-Quantizer performs the Step 2 and Step 3 by summing up bits of the bit-sliced value to *Zb* and comparing *Zb* with the rank value *r*. The rank value *r* is stored in the rank register (RR). The bitwidth *w*

the full adder and HA denotes the half adder. The signals "S" and "C" of each FA or HA represent sum and carry, respectively. The circuit in the dash-lined box is a comparator. The

<sup>2</sup> �. Fig.17 illustrates the block diagram of the Level-Quantizer, where FA denotes

algorithm and then the DCRAM outputs a bit-sliced value {*u<sup>b</sup>*

0 000

6: Polarization

2: Polarization

yi 1 X X X

yi 1 1 1 X

polarization.

of *Zb* is �*log<sup>N</sup>*

0 000

5: T hreshold decomposition

1: T hreshold decomposition

Fig. 16. The proposed rank-order filtering architecture.

comparator is implemented by a carry generator because the comparison result of *Zb* and *r* can be obtained from the carry output of *Zb* plus the two's complement of *r*. The carry output is the quantized value of the Level-Quantizer.

Fig. 17. The block diagram of the Level-Quantizer.

*Q*

*Q*

*Q*

CLR

gated by "rm[j]", to output the complement value of the stored bit to the dataline "c\_d[i]". The datalines of computing cells of each word will be then merged as a single net. Since the RMR is one-hot configured, each word has only a single bit being activated during the *read*

As shown in Fig.20, the dataline "c\_d[i]" finally goes to an inverter to pull up the weak '1', which is generated by the "rm[j]"-gated NMOS, and hence the signal "c d[i]" has the value of the *i*-th bit of each bit-slice. Because the ROF algorithm polarizes the non-candidate words with either all zeros or all ones, the bitline pairs of computing cells are merged as a single pair

Fig.21 illustrates the implementation of DCRAM with the floorplan. Each D*<sup>i</sup>* − C*<sup>i</sup>* pair is a maskable memory cell where D*<sup>i</sup>* denotes D\_cell(*i*) and C*<sup>i</sup>* denotes C\_cell(*i*). Each word is split into higher and lower parts for reducing the memory access time and power dissipation (64). The control block is an interface between control signals and address decoder. It controls wordlines and bitlines of DCRAM. When the write signal "wr" is not asserted, the control

*Q* SET

D

CLR

*Q* SET

CLR

*Q* SET

sr[0]

D

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 269

c\_wl [0]

c\_wl [1]

c\_wl [s-1]

c\_d[s-1]

block will disassert all wordlines by the address decoder.

Fig. 18. The polarization selector (PS).

operation.

of "c in" and "c\_in".

c\_d[1]

c\_d[0]

D

Normally, the comparison can be made by subtracting *r* from *Zb*. Since *Zb* and *r* are unsigned numbers, to perform the subtraction, both numbers have to reformat to two's complement numbers by adding a sign bit. In this paper, the reformated numbers of *Zb* and *r* are expressed as *Zb*,*<sup>S</sup>* and *rS*, respectively. Since both numbers are positive, their sign bits are equal to '0'. If *Zb*,*<sup>S</sup>* is less than *rS*, the result of subtraction, Δ, will be negative; that is, the sign bit (or MSB) of Δ is '1'. Eq.34 shows the inequation of the comparison, where *rS* denotes the one's complement of *rS* and **1** denotes (00 . . . 01)2. Because the MSB of *Zb*,*<sup>S</sup>* is '0' and the MSB of *rS* is '1', to satisfy Eq.34, the carry of *Zw*−<sup>1</sup> *<sup>b</sup>*,*<sup>S</sup>* <sup>+</sup> *<sup>r</sup>w*−<sup>1</sup> *<sup>S</sup>* must be equal to '0' so that the sign bit of Δ becomes '1'. To simplify the comparison circuit, instead of implementing an adder, we use the carry generator to produce the carry of *Zw*−<sup>1</sup> *<sup>b</sup>*,*<sup>S</sup>* <sup>+</sup> *<sup>r</sup>w*−<sup>1</sup> *<sup>S</sup>* . Each cell of the carry generator is a majority (Maj) circuit that performs the boolean function shown in Eq.35. Furthermore, we use an OR gate at the LSB stage because of Eq.36. Thus, the dash-lined box is an optimized solution for comparison of *Zb* and *r* without implementing the *bit-summation* and *signed-extension* parts.

$$
\Delta = Z\_{b,S} + \overline{r\_S} + \mathbf{1} < 0. \tag{34}
$$

$$\text{Maj}(A, B, \mathbb{C}) = AB + BC + AC.\tag{35}$$

$$Z\_b^0 \cdot r^0 + Z\_b^0 \cdot 1 + 1 \cdot r^0 = Z\_b^0 + r^0. \tag{36}$$

After the Level-Quantizer finishes the threshold decomposition, the quantized value goes to the LSB of the shift register, "sr[0]". Then, the polarization selector (PS) uses exclusive ORs (XORs) to determine which words should be polarized, as shown in Fig.18. Obviously, the XORs can examine the condition of *u<sup>b</sup> <sup>j</sup>* �<sup>=</sup> *<sup>v</sup><sup>b</sup> <sup>i</sup>* and select the word-under-polarization's (WUPs) accordingly. When **"c wl"** is '1', the lower bits of selected words will be polarized; the lower bits are selected by WMR. According to the Step 4, the polarized value is *u<sup>b</sup> <sup>j</sup>* which is the inversion of *v<sup>b</sup> <sup>i</sup>* . Since *<sup>v</sup><sup>b</sup> <sup>i</sup>* is the value of sr[0], we inverse the value of "sr[0]" to "c in", as shown in Fig.16.

As seen in the generic algorithm, the basic ROF repeatedly executes *Bit-counting*, *Threshold decomposition*, and *Polarization* until the LSB of the ROF result being generated. Upon executing *B* times of three main tasks, the ROF will have the result in the Shift Register. A cycle after, the result will then go to the output register (OUTR). Doing so, the proposed architecture is able to pipeline the iterations for high-performance applications.

#### **3.3 Implementation of dual-cell random-access memory**

Fig.19 illustrates a basic element of DCRAM. Each element has two cells for data field and computing field, respectively. The data cell is basically an SRAM cell with a pair of bitlines. The SRAM cell is composed of INV1 and INV2 and stores a bit of input sample addressed by the wordline "d wl[i]". The computing cell performs three tasks: *copy*, *write*, and *read*. When the copy-line "cp" is high, through INV5 and INV6, the pair of INV3 and INV4 will have the copy of the 1-bit datum in the data cell. The *copy* operation is unidirectional, and the pair of INV5 and INV6 can guarantee this directivity. When the one-bit value stored in the computing cell needs to be polarized, the "wm[j]" and "c wl[i]" will be asserted, and the computing cell will perform the *write* operation according to the pair of bitlines "c bl[j]" and "c\_bl[j]". When the ROF reads the bit-sliced value, the computing cell uses an NMOS, 26 Will-be-set-by-IN-TECH

Normally, the comparison can be made by subtracting *r* from *Zb*. Since *Zb* and *r* are unsigned numbers, to perform the subtraction, both numbers have to reformat to two's complement numbers by adding a sign bit. In this paper, the reformated numbers of *Zb* and *r* are expressed as *Zb*,*<sup>S</sup>* and *rS*, respectively. Since both numbers are positive, their sign bits are equal to '0'. If *Zb*,*<sup>S</sup>* is less than *rS*, the result of subtraction, Δ, will be negative; that is, the sign bit (or MSB) of Δ is '1'. Eq.34 shows the inequation of the comparison, where *rS* denotes the one's complement of *rS* and **1** denotes (00 . . . 01)2. Because the MSB of *Zb*,*<sup>S</sup>* is '0' and

the sign bit of Δ becomes '1'. To simplify the comparison circuit, instead of implementing

carry generator is a majority (Maj) circuit that performs the boolean function shown in Eq.35. Furthermore, we use an OR gate at the LSB stage because of Eq.36. Thus, the dash-lined box is an optimized solution for comparison of *Zb* and *r* without implementing the *bit-summation*

*<sup>b</sup>* · <sup>1</sup> <sup>+</sup> <sup>1</sup> · *<sup>r</sup>*<sup>0</sup> <sup>=</sup> *<sup>Z</sup>*<sup>0</sup>

After the Level-Quantizer finishes the threshold decomposition, the quantized value goes to the LSB of the shift register, "sr[0]". Then, the polarization selector (PS) uses exclusive ORs (XORs) to determine which words should be polarized, as shown in Fig.18. Obviously, the

accordingly. When **"c wl"** is '1', the lower bits of selected words will be polarized; the lower

As seen in the generic algorithm, the basic ROF repeatedly executes *Bit-counting*, *Threshold decomposition*, and *Polarization* until the LSB of the ROF result being generated. Upon executing *B* times of three main tasks, the ROF will have the result in the Shift Register. A cycle after, the result will then go to the output register (OUTR). Doing so, the proposed architecture

Fig.19 illustrates a basic element of DCRAM. Each element has two cells for data field and computing field, respectively. The data cell is basically an SRAM cell with a pair of bitlines. The SRAM cell is composed of INV1 and INV2 and stores a bit of input sample addressed by the wordline "d wl[i]". The computing cell performs three tasks: *copy*, *write*, and *read*. When the copy-line "cp" is high, through INV5 and INV6, the pair of INV3 and INV4 will have the copy of the 1-bit datum in the data cell. The *copy* operation is unidirectional, and the pair of INV5 and INV6 can guarantee this directivity. When the one-bit value stored in the computing cell needs to be polarized, the "wm[j]" and "c wl[i]" will be asserted, and the computing cell will perform the *write* operation according to the pair of bitlines "c bl[j]" and "c\_bl[j]". When the ROF reads the bit-sliced value, the computing cell uses an NMOS,

*<sup>b</sup>*,*<sup>S</sup>* <sup>+</sup> *<sup>r</sup>w*−<sup>1</sup>

*<sup>S</sup>* must be equal to '0' so that

*<sup>b</sup>* + *r*0. (36)

*<sup>S</sup>* . Each cell of the

*<sup>j</sup>* which is the

*<sup>b</sup>*,*<sup>S</sup>* <sup>+</sup> *<sup>r</sup>w*−<sup>1</sup>

Δ = *Zb*,*<sup>S</sup>* + *rS* + **1** < 0. (34) *Maj*(*A*, *B*, *C*) = *AB* + *BC* + *AC*. (35)

*<sup>i</sup>* and select the word-under-polarization's (WUPs)

*<sup>i</sup>* is the value of sr[0], we inverse the value of "sr[0]" to "c in", as

the MSB of *rS* is '1', to satisfy Eq.34, the carry of *Zw*−<sup>1</sup>

and *signed-extension* parts.

inversion of *v<sup>b</sup>*

shown in Fig.16.

XORs can examine the condition of *u<sup>b</sup>*

*<sup>i</sup>* . Since *<sup>v</sup><sup>b</sup>*

an adder, we use the carry generator to produce the carry of *Zw*−<sup>1</sup>

*Z*0

is able to pipeline the iterations for high-performance applications.

**3.3 Implementation of dual-cell random-access memory**

*<sup>b</sup>* · *<sup>r</sup>*<sup>0</sup> <sup>+</sup> *<sup>Z</sup>*<sup>0</sup>

*<sup>j</sup>* �<sup>=</sup> *<sup>v</sup><sup>b</sup>*

bits are selected by WMR. According to the Step 4, the polarized value is *u<sup>b</sup>*

Fig. 18. The polarization selector (PS).

gated by "rm[j]", to output the complement value of the stored bit to the dataline "c\_d[i]". The datalines of computing cells of each word will be then merged as a single net. Since the RMR is one-hot configured, each word has only a single bit being activated during the *read* operation.

As shown in Fig.20, the dataline "c\_d[i]" finally goes to an inverter to pull up the weak '1', which is generated by the "rm[j]"-gated NMOS, and hence the signal "c d[i]" has the value of the *i*-th bit of each bit-slice. Because the ROF algorithm polarizes the non-candidate words with either all zeros or all ones, the bitline pairs of computing cells are merged as a single pair of "c in" and "c\_in".

Fig.21 illustrates the implementation of DCRAM with the floorplan. Each D*<sup>i</sup>* − C*<sup>i</sup>* pair is a maskable memory cell where D*<sup>i</sup>* denotes D\_cell(*i*) and C*<sup>i</sup>* denotes C\_cell(*i*). Each word is split into higher and lower parts for reducing the memory access time and power dissipation (64). The control block is an interface between control signals and address decoder. It controls wordlines and bitlines of DCRAM. When the write signal "wr" is not asserted, the control block will disassert all wordlines by the address decoder.

C D7 <sup>7</sup>

hardware.

Fig. 21. The floorplan of DCRAM.

Address Decoder hi\_subword1 lo\_subword1

C D6 <sup>6</sup> D5 C5 D4 C4 D3 C3 D2 C2 D1 C1 D0 C0

hi\_subword0 lo\_subword0

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 271

**addr**

lo\_subwords-1 hi\_subwords-1

Input block Input block Control block

I/O

Fig. 22. The conceptual diagram of the ROF processor.

the architecture are updated a cycle after instruction issued.

Interface I/O ROF

Fig.23 lists the format of the instruction set. An instruction word contains two subwords: the data field instruction and the computing field instruction. Each instruction cycle can concurrently issue two field instructions for parallelizing the data preparation and ROF execution; hence, the proposed processor can pipeline ROF iterations. When one of the field instructions performs "no operation", DF\_NULL or CF\_NULL will be issued. All registers in

The instruction SET resets all registers and set the rank register RR for a given rank-order *r*. The instruction LOAD loads data from "**d\_in**" by asserting "wr" and setting "addr". The instruction COPY/DONE can perform the "COPY" operation or "DONE" operation. When the

cp wr c\_in

**rm, wm, d\_in[7:4] rm, wm, d\_in[3:0]**

of input/output streams. The instruction sequencer can be a microprocessor or dedicated

c\_in

Instruction Sequencer

Output block

**c\_d, c\_wl**

Fig. 20. A DCRAM word mixing data field and computing field. D\_cell(*i*) denotes the data field of *i*-th bit and C\_cell(*i*) denotes the computing field of *i*-th bit.

#### **3.4 Instruction set of proposed ROF processor**

The proposed ROF processor is a core for the impulsive noise removal and enabled by an instruction sequencer. Fig.22 illustrates the conceptual diagram of the ROF processor. The instruction sequencer is used for the generation of instruction codes and the control

Fig. 21. The floorplan of DCRAM.

28 Will-be-set-by-IN-TECH

cp wm[j]

INV 6

INV 5

rm[j] Data Cell Computing Cell

c\_bl[j] c\_bl[j]

d\_bl[0]

rm[0]

C\_cell (0)

wm[0]

c\_bl[0]

c\_bl[0]

c\_d[i]

c\_in

c\_wl[i]

c\_in

d\_bl[0]

D\_cell (0)

INV 3

INV 4

d\_bl[j] d\_bl[j]

wm[n-1]

D\_cell (n-2)

d\_bl[n-2]

rm[n-2]

C\_cell (n-2)

wm[n-2]

c\_bl[n-2]

c\_bl[n-2]

Fig. 20. A DCRAM word mixing data field and computing field. D\_cell(*i*) denotes the data

The proposed ROF processor is a core for the impulsive noise removal and enabled by an instruction sequencer. Fig.22 illustrates the conceptual diagram of the ROF processor. The instruction sequencer is used for the generation of instruction codes and the control

d\_bl[n-2]

Fig. 19. A basic element of DCRAM.

d\_bl[n-1]

rm[n-1]

C\_cell (n-1)

c\_bl[n-1]

**3.4 Instruction set of proposed ROF processor**

c\_bl[n-1]

field of *i*-th bit and C\_cell(*i*) denotes the computing field of *i*-th bit.

D\_cell (n-1)

d\_bl[n-1]

INV 1

INV 2

d\_wl[i]

c\_d[i]

cp d\_wl[i] c\_wl[i] c\_d[i]

c\_wl[i]

of input/output streams. The instruction sequencer can be a microprocessor or dedicated hardware.

Fig. 22. The conceptual diagram of the ROF processor.

Fig.23 lists the format of the instruction set. An instruction word contains two subwords: the data field instruction and the computing field instruction. Each instruction cycle can concurrently issue two field instructions for parallelizing the data preparation and ROF execution; hence, the proposed processor can pipeline ROF iterations. When one of the field instructions performs "no operation", DF\_NULL or CF\_NULL will be issued. All registers in the architecture are updated a cycle after instruction issued.

The instruction SET resets all registers and set the rank register RR for a given rank-order *r*. The instruction LOAD loads data from "**d\_in**" by asserting "wr" and setting "addr". The instruction COPY/DONE can perform the "COPY" operation or "DONE" operation. When the

Instruction Sequencer

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 273

Programma ble Rank-Order Filter

The first iteration The second iteration The third iteration

At the same time, the first iteration prepares the first ROF result for OUTR. At the nineteenth clock step, the first iteration sends the result out while the second iteration performs the first

In Section 3.4, we use 1-D non-recursive ROF as an example to show the programming of the proposed ROF processor. Due to the programmable design, the proposed ROF processor can implement a variety of ROF applications. The following subsections will illustrate the optimized programs for three examples: 1-D RMF, 2-D non-recursive ROF, and 2-D RMF.

The recursive median filtering (RMF) has been proposed for signal smoothing and impulsive noise elimination. It can effectively remove sharp discontinuities of small duration without blurring the original signal. The RMF recursively searches for the median results from the most recent median values and input samples. So, the input window of RMF can be denoted as {*yi*−*k*, *yi*−*k*<sup>+</sup>1, ..., *yi*−1, *xi*, ..., *xi*<sup>+</sup>*l*}, where *yi*−*k*, *yi*−*k*<sup>+</sup>1, ..., *yi*−<sup>1</sup> are the most recent median values and *xi*, ..., *xi*<sup>+</sup>*<sup>l</sup>* are the input samples, and the result *yi* is the �(*l* + *k* + 1)/2�-th value of

Fig.26 demonstrates the implementation of the 1-D RMF. To recursively perform RMF with previous median values, the *i*-th iteration of 1-D RMF loads two inputs to the DCRAM; one is *xi*<sup>+</sup>*<sup>l</sup>* and the other is *yi*−1. As shown in Fig.26, the 2-to-1 multiplexer is used to switch the input stream to the data field, controlled by the instruction sequencer; the input stream is from either "**d\_in**" or "**d\_out**". When the proposed ROF processor receives the input stream, the

**d\_in**

8

Fig. 24. Block diagram of the 1-D non-recursive ROF.

Fig. 25. Reservation table of the 1-D non-recursive ROF.

**3.5 Application of the proposed ROF processor**

polarization. Thus, the iteration period for each iteration is 15 cycles.

Cycle Time 1 2 3 4 5 6 7 8 9 10

**3.6 1-D recursive median filter**

the input window.

RR DF of DCRAM CF of DCRAM RMR Level Quantizer WMR PS Shift Register OUTR

<sup>8</sup> **instruction**

**d\_out**

8

...

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

done

bit value of *c* is '1', the DCRAM will copy a window of input samples from the data field to the computing field. When the bit value of *d* is '1', the DCRAM wraps up an iteration by asserting "en" and puts the result into OUTR.

The instruction P\_READ is issued when the ROF algorithm executes bit-sliced operations. The field <*mask*> of P\_READ is one-hot coded. It allows the DCRAM to send a bit-slice to the Level-Quantizer and PS for the *Threshold decomposition* task. The instruction P\_WRITE is issued when the ROF algorithm performs the *Polarization* task. The field <*mask*> of P\_WRITE is used to set a consecutive sequence of 1's. The sequence can mask out the higher bits for polarization.


Fig. 23. The format of the instruction set.

To generate instructions to the ROF processor, the complete 1-D non-recursive ROF circuit includes an instruction sequencer, as shown in Fig.24.

Since the instruction set is in the format of long-instruction-word (LIW), the data fetching and ROF computing can be executed in parallel. So, the generated instruction stream can pipeline the ROF iterations, and the data fetching is hidden in each ROF latency. Fig.25 shows the reservation table of the 1-D ROF example. As seen in the reservation table, the first iteration and the second iteration are overlapped at the seventeenth, eighteenth and nineteenth clock steps. At the seventeenth clock step, the second iteration starts with loading a new sample while the first iteration processes the LSB bit-slice. At the eighteenth clock step, the second iteration copies samples from the data field to the computing field, and reads the MSB bit-slice.

30 Will-be-set-by-IN-TECH

bit value of *c* is '1', the DCRAM will copy a window of input samples from the data field to the computing field. When the bit value of *d* is '1', the DCRAM wraps up an iteration by asserting

The instruction P\_READ is issued when the ROF algorithm executes bit-sliced operations. The field <*mask*> of P\_READ is one-hot coded. It allows the DCRAM to send a bit-slice to the Level-Quantizer and PS for the *Threshold decomposition* task. The instruction P\_WRITE is issued when the ROF algorithm performs the *Polarization* task. The field <*mask*> of P\_WRITE is used to set a consecutive sequence of 1's. The sequence can mask out the higher bits for

addr ess, rank-order, c\_mode mask

data field instruction computer field instruction

131415 10 789 0

P \_R EAD <*mas k*>

P \_WRITE <*mas k*>

C F\_NULL

To generate instructions to the ROF processor, the complete 1-D non-recursive ROF circuit

Since the instruction set is in the format of long-instruction-word (LIW), the data fetching and ROF computing can be executed in parallel. So, the generated instruction stream can pipeline the ROF iterations, and the data fetching is hidden in each ROF latency. Fig.25 shows the reservation table of the 1-D ROF example. As seen in the reservation table, the first iteration and the second iteration are overlapped at the seventeenth, eighteenth and nineteenth clock steps. At the seventeenth clock step, the second iteration starts with loading a new sample while the first iteration processes the LSB bit-slice. At the eighteenth clock step, the second iteration copies samples from the data field to the computing field, and reads the MSB bit-slice.

0 0 mask

0 1 mask

1 1 11111111

c=1, copy; d=1, done

789 0

789 0

789 0

"en" and puts the result into OUTR.

or control code

0 0 rank-order value 131415 10

0 1 addr ess 131415 10

1 0 11cd 131415 10

1 1 1111 131415 10

Fig. 23. The format of the instruction set.

includes an instruction sequencer, as shown in Fig.24.

polarization.

d\_mode

S ET <*rank*>

LOAD <*address*>

C OPY/DONE

DF\_NULL

Fig. 24. Block diagram of the 1-D non-recursive ROF.

Fig. 25. Reservation table of the 1-D non-recursive ROF.

At the same time, the first iteration prepares the first ROF result for OUTR. At the nineteenth clock step, the first iteration sends the result out while the second iteration performs the first polarization. Thus, the iteration period for each iteration is 15 cycles.

### **3.5 Application of the proposed ROF processor**

In Section 3.4, we use 1-D non-recursive ROF as an example to show the programming of the proposed ROF processor. Due to the programmable design, the proposed ROF processor can implement a variety of ROF applications. The following subsections will illustrate the optimized programs for three examples: 1-D RMF, 2-D non-recursive ROF, and 2-D RMF.

### **3.6 1-D recursive median filter**

The recursive median filtering (RMF) has been proposed for signal smoothing and impulsive noise elimination. It can effectively remove sharp discontinuities of small duration without blurring the original signal. The RMF recursively searches for the median results from the most recent median values and input samples. So, the input window of RMF can be denoted as {*yi*−*k*, *yi*−*k*<sup>+</sup>1, ..., *yi*−1, *xi*, ..., *xi*<sup>+</sup>*l*}, where *yi*−*k*, *yi*−*k*<sup>+</sup>1, ..., *yi*−<sup>1</sup> are the most recent median values and *xi*, ..., *xi*<sup>+</sup>*<sup>l</sup>* are the input samples, and the result *yi* is the �(*l* + *k* + 1)/2�-th value of the input window.

Fig.26 demonstrates the implementation of the 1-D RMF. To recursively perform RMF with previous median values, the *i*-th iteration of 1-D RMF loads two inputs to the DCRAM; one is *xi*<sup>+</sup>*<sup>l</sup>* and the other is *yi*−1. As shown in Fig.26, the 2-to-1 multiplexer is used to switch the input stream to the data field, controlled by the instruction sequencer; the input stream is from either "**d\_in**" or "**d\_out**". When the proposed ROF processor receives the input stream, the

iteration so that the data fetching and result preparing can be run at the same time. As the

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 275

The first iteration The second iteration The third iteration

Fig.29 illustrates the block diagram for the 2-D non-recursive ROF. From Fig.30, each iteration needs to update three input samples (the pixels in the shadow region) for the 3 × 3 ROF; that is, only *n* input samples need to be updated in each iteration for the *n* × *n* ROF. To reuse the windowing data, the data storage is arranged as shown in Fig.31. So, for the 2-D ROF, the data reusability of our process is high; each iteration updates only *n* input samples for an *n* × *n* window. Given a 2-D *n* × *n* ROF application with *n*=3 and *r*=5, the optimized reservation table

**d\_in**

2

**input\_sel**

8

Similar to the 1-D RMF, the two-dimensional(2-D) *n-by-n* RMF finds the median value from the window formed by some previous-calculated median values and input values. Fig.33(a) shows the content of the 3 × 3 window centered at (*i*, *j*). At the end of each iteration, the 2-D 3 × 3 RMF substitutes the central point of the current window with the median value. The renewed point will then be used in the next iteration. The windowing for 2-D RMF iterations is shown in Fig.33(b), where the triangles represent the previous-calculated median values and the pixels in the shadow region are updated at the beginning of each iteration. According to the windowing, Fig.34 illustrates the data storage for high degree of data reusability. Finally, we can implement the 2-D RMF as the block diagram illustrated in Fig.35. Given a 2-D 3 × 3

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

...

32 33 34 35 36 37 38 39 40 41

Instruction Sequencer

Programmable Rank-Order

Filter **d\_out**

<sup>8</sup> **instruction**

8

done

result, the sample period is 18 cycles.

Fig. 28. Reservation table of the 1-D RMF.

**3.6.1 2-D non-recursive rank-order filter**

Cycle Time 2 3 4 5 6 7 8 9 10 11

can be scheduled as Fig.32.

**pixel input**

Scan Line

N-1 words

Scan Line

N-1 words

**3.6.2 2-D recursive median filter**

D

8

D

00

01

10

Fig. 29. Block diagram of the 2-D non-recursive ROF with 3-by-3 window.

RMF application, the optimized reservation table can be scheduled as Fig.36.

D

RR DF of DCRAM CF of DCRAM RMR Level Quantizer WMR PS Shift Register OUTR

1

program will arrange the data storage as shown in Fig.26. The date storage shows the data reusability of the proposed ROF processor.

Fig. 26. Block diagram of the 1-D RMF.

As mentioned above, the input stream to the DCRAM comes from either "**d\_in**"" or "**d\_out**"". The statements of "set input\_sel, 0;" and "set input\_sel, 1;" can assert the signal "input\_sel" to switch the input source accordingly. The statements of "LOAD i, CF\_NULL;" and "LOAD i, CF\_NULL;" is employed for the data stream, as per Fig.27. As seen in Fig.28, the throughput rate is limited by the recursive execution of the 1-D RMF; that is, the second iteration cannot load the newest median value until the first iteration generates the result to the output. However, we still optimized the throughput rate as much as possible. At the twentieth clock step, the program overlaps the first iteration and the second

Fig. 27. The flow for data storage of the 1-D RMF.

iteration so that the data fetching and result preparing can be run at the same time. As the result, the sample period is 18 cycles.

Fig. 28. Reservation table of the 1-D RMF.

32 Will-be-set-by-IN-TECH

program will arrange the data storage as shown in Fig.26. The date storage shows the data

As mentioned above, the input stream to the DCRAM comes from either "**d\_in**"" or "**d\_out**"". The statements of "set input\_sel, 0;" and "set input\_sel, 1;" can assert the signal "input\_sel" to switch the input source accordingly. The statements of "LOAD i, CF\_NULL;" and "LOAD i, CF\_NULL;" is employed for the data stream, as per Fig.27. As seen in Fig.28, the throughput rate is limited by the recursive execution of the 1-D RMF; that is, the second iteration cannot load the newest median value until the first iteration generates the result to the output. However, we still optimized the throughput rate as much as possible. At the twentieth clock step, the program overlaps the first iteration and the second

(1) (2) (3) (4) (5)

y0 y1 x2 x3 x4 x5 x6 y-2 y-1

8

input\_sel

Instruction Sequencer

Programmable R ank-Order Filter

<sup>8</sup> **instruction**

Address Data Field 0000 x0 0001 x1 0010 x2 0011 0 0100 y-5 0101 y-4 0110 y-3 0111 0 1000 0

y0 y1 y2 x3 x4 x5 x6 x7 y-1 **d\_out**

8

Address Data Field 0000 x0 0001 x1 0010 x2 0011 x3 0100 y-5 0101 y-4 0110 y-3 0111 y-2 1000 0

...

The symbol " " points to position for the newest median value

The symbol " " points to position for the newest input sample

done

reusability of the proposed ROF processor.

0

1

Fig. 27. The flow for data storage of the 1-D RMF.

y0 x1 x2 x3 x4 x5 y-3 y-2 y-1

(6) (7) (8) (9)

**d\_in**

8

8

Address Data Field 0000 x0 0001 x1 0010 x2 0011 x3 0100 x4 0101 y-4 0110 y-3 0111 y-2 1000 y-1

Fig. 26. Block diagram of the 1-D RMF.

#### **3.6.1 2-D non-recursive rank-order filter**

Fig.29 illustrates the block diagram for the 2-D non-recursive ROF. From Fig.30, each iteration needs to update three input samples (the pixels in the shadow region) for the 3 × 3 ROF; that is, only *n* input samples need to be updated in each iteration for the *n* × *n* ROF. To reuse the windowing data, the data storage is arranged as shown in Fig.31. So, for the 2-D ROF, the data reusability of our process is high; each iteration updates only *n* input samples for an *n* × *n* window. Given a 2-D *n* × *n* ROF application with *n*=3 and *r*=5, the optimized reservation table can be scheduled as Fig.32.

Fig. 29. Block diagram of the 2-D non-recursive ROF with 3-by-3 window.

#### **3.6.2 2-D recursive median filter**

Similar to the 1-D RMF, the two-dimensional(2-D) *n-by-n* RMF finds the median value from the window formed by some previous-calculated median values and input values. Fig.33(a) shows the content of the 3 × 3 window centered at (*i*, *j*). At the end of each iteration, the 2-D 3 × 3 RMF substitutes the central point of the current window with the median value. The renewed point will then be used in the next iteration. The windowing for 2-D RMF iterations is shown in Fig.33(b), where the triangles represent the previous-calculated median values and the pixels in the shadow region are updated at the beginning of each iteration. According to the windowing, Fig.34 illustrates the data storage for high degree of data reusability. Finally, we can implement the 2-D RMF as the block diagram illustrated in Fig.35. Given a 2-D 3 × 3 RMF application, the optimized reservation table can be scheduled as Fig.36.

Cycle Time 1 2 3 4 5 6 7 8 9 10

0

j

i

1

2

3

4

254

previous window

**3.7 The fully-pipelined DCRAM-based ROF architecture**

255

RMF.

Fig. 32. Reservation table of the 2-D ROF.

x(i+1,j-1)

y(i,j-1)

y(i-1,j-1)

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

x(i+1,j+1)

x(i,j+1)

y(i-1,j+1)

The first iteration The second iteration The third iteration

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 277

(a)

x(i+1,j)

x(i,j)

y(i-1,j)

(b)

Fig. 33. (a) The content of the 3 × 3 window centered at (*i*, *j*). (b) The windowing of the 2-D

As seen in Section 3.5, the reservation tables are not tightly scheduled because the dependency of bit-slicing read, threshold decomposition, and polarization forms a cycle. The dependency cycle limits the schedulability of ROF tasks. To increase the schedulability, we further extended the ROF architecture to a fully-pipelined version at the expense of area. The

0 1 2 3 4 5 6 7 254 255

...

current window

RR DF of DCRAM CF of DCRAM RMR Level Quantizer WMR PS Shift Register OUTR

Fig. 30. The windowing of the 3 × 3 non-recursive ROF.

x(1,6) x(3,7) x(2,7) x(1,7) x(3,5) x(2,5) x(1,5)

Fig. 31. The data storage of the 2-D non-recursive ROF.

x(1,6) x(3,4) x(2,4) x(1,4) x(3,5) x(2,5) x(1,5)

Fig. 32. Reservation table of the 2-D ROF.

34 Will-be-set-by-IN-TECH

Address Data Field

(1) (2) (3)

Address Data Field

x(3,6) x(2,6) x(1,6) x(3,7) x(2,7) x(1,7) x(3,5) x(2,5) x(1,5)

(4) (5)

Fig. 31. The data storage of the 2-D non-recursive ROF.

x(3,3) x(2,3) x(1,3) x(3,4) x(2,4) x(1,4) x(3,2) x(2,2) x(1,2)

0

j

i

1

2

3

4

254

previous window

255

x(1,3) x(2,3)

Address Data Field 0000 x(3,3)

0011 x(3,1) 0100 x(2,1) 0101 x(1,1) 0110 x(3,2) 0111 x(2,2) 1000 x(1,2)

Address Data Field

x(3,6) x(2,6) x(1,6) x(3,4) x(2,4) x(1,4) x(3,5) x(2,5) x(1,5)

0001 0010

Fig. 30. The windowing of the 3 × 3 non-recursive ROF.

0 1 2 3 4 5 6 7 254 255

current window

Address Data Field

x(3,3) x(2,3) x(1,3) x(3,4) x(2,4) x(1,4) x(3,5) x(2,5) x(1,5)

...

for the newest input samples

The symbols " " point to positions


Fig. 33. (a) The content of the 3 × 3 window centered at (*i*, *j*). (b) The windowing of the 2-D RMF.

#### **3.7 The fully-pipelined DCRAM-based ROF architecture**

As seen in Section 3.5, the reservation tables are not tightly scheduled because the dependency of bit-slicing read, threshold decomposition, and polarization forms a cycle. The dependency cycle limits the schedulability of ROF tasks. To increase the schedulability, we further extended the ROF architecture to a fully-pipelined version at the expense of area. The

Cycle Time 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

**c\_d**

42 43

OUTR

en

sen

Shift Register

**d\_out**

**rank**

RR

clock

clk

ren

LQ1 LQ2

L evel Quantizer

PS

sr[0]

...

done

The first iteration The second iteration The third iteration

for the highest degree of parallelism. The LQ1 is the FA/HA tree and the LQ2 is the carry

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 279

**instruction**

**rm**

c\_in **c\_wl**

c\_in c\_in c\_in c\_in

Computing Field 3

Since there exists three iterations being processed simultaneously, a larger memory is required for two more iterations. Hence, we extended the DCRAM to an (*N* + 2*δ*)-word memory, where *N* is the window size of ROF and *δ* is the number of updating samples for each iteration. The value of *δ* is 1 for 1-D ROF, and *n* for 2-D *n*-by-*n* ROF. To correctly access the right samples for each iteration, the signal "**cm**" is added to mask the unwanted samples during the copy operation. In each computing field, the unwanted samples are stored as all zeros. Doing so, the unwanted samples will not affect the rank-order results. Fig.38 illustrates the modified computing cell for fully-pipelined ROF. The INV5 and INV6 are replaced with GATE1 and GATE2. When "cm[i]" is '0' the computing cell will store '0'; otherwise, the computing cell will have the copy of the selected sample from the data cell. Finally, we use "**cp**", "**w\_cf**", and "**r\_cf**" to selectively perform *read*, *write*, or *copy* on computing fields. To efficiently program the fully-pipelined architecture, the instruction set is defined as shown in Fig.39. The fields <*c*\_*c f*> of COPY,<*w*\_*c f*> of P\_WRITE, and <*r*\_*c f*> of P\_READ are used to control "**cp**", "**w\_cf**", and "**r\_cf**". Given a 1-D non-recursive rank order filter application with *N*=9 and *r*=3, the

RMR

Computing Field 2

**r \_cf**

Instruction Decoder

**wm**

WMR

Fig. 36. Reservation table of optimal pipeline for 2-D recursive median filter.

RR DF of DCRAM CF of DCRAM RMR Level Quantizer WMR PS Shift Register OUTR

generator.

rst

**d\_in**

Addr ess Decoder

Data Field

**addr** wr

**cp**

R eset Circuit reset

DCRAM

Computing Field 1

Fig. 37. The fully-pipelined ROF architecture.

reservation table can be optimized as shown in Fig.40.

**w\_cf**

**cm**

c\_in

1

Fig. 35. Block diagram of the 2-D RMF with 3-by-3 window.

fully-pipelined ROF architecture interleaves three ROF iterations with triple computing fields. As shown in Fig. 37, there are three computing fields which process three tasks alternatively. To have the tightest schedule, we pipelined the Level-Quantizer into two stages, LQ1 and LQ2, so the loop (computing field, Level-Quantizer, Shift Register) has three pipeline stages 36 Will-be-set-by-IN-TECH

x(3,3) x(2,3) y(1,3) x(3,4) x(2,4) y(1,4) x(3,2) y(2,2) y(1,2)

Address Data Field

x(3,3) y(2,3) y(1,3) x(3,4) x(2,4) y(1,4) x(3,5) x(2,5) y(1,5)

...

for the newest input samples

Instruction Sequencer

Programmable Rank-Order

Filter **d\_out**

<sup>8</sup> **instruction**

for the windowing median values

The symbols " "

The symbols " " point to positions

point to positions

done

8

Addr ess Data Field

(1) (2) (3)

Addr ess Data Field

x(3,6) x(2,6) y(1,6) x(3,7) x(2,7) y(1,7) x(3,5) y(2,5) y(1,5)

**d\_in**

2

**input\_sel**

8

00

01

10

11

fully-pipelined ROF architecture interleaves three ROF iterations with triple computing fields. As shown in Fig. 37, there are three computing fields which process three tasks alternatively. To have the tightest schedule, we pipelined the Level-Quantizer into two stages, LQ1 and LQ2, so the loop (computing field, Level-Quantizer, Shift Register) has three pipeline stages

(4) (5)

D

8

D

Fig. 35. Block diagram of the 2-D RMF with 3-by-3 window.

Addr ess Data Field 0000 x(3,3) 0001 x(2,3) 0010 y(1,3) 0011 x(3,1) 0100 y(2,1) 0101 y(1,1) 0110 x(3,2) 0111 x(2,2) 1000 y(1,2)

Addr ess Data Field

Fig. 34. The data storage of 2-D RMF.

**pixel input**

Scan Line

N-1 words

Scan Line

N-2 words

x(3,6) x(2,6) y(1,6) x(3,4) y(2,4) y(1,4) x(3,5) x(2,5) y(1,5)

Fig. 36. Reservation table of optimal pipeline for 2-D recursive median filter.

for the highest degree of parallelism. The LQ1 is the FA/HA tree and the LQ2 is the carry generator.

Fig. 37. The fully-pipelined ROF architecture.

Since there exists three iterations being processed simultaneously, a larger memory is required for two more iterations. Hence, we extended the DCRAM to an (*N* + 2*δ*)-word memory, where *N* is the window size of ROF and *δ* is the number of updating samples for each iteration. The value of *δ* is 1 for 1-D ROF, and *n* for 2-D *n*-by-*n* ROF. To correctly access the right samples for each iteration, the signal "**cm**" is added to mask the unwanted samples during the copy operation. In each computing field, the unwanted samples are stored as all zeros. Doing so, the unwanted samples will not affect the rank-order results. Fig.38 illustrates the modified computing cell for fully-pipelined ROF. The INV5 and INV6 are replaced with GATE1 and GATE2. When "cm[i]" is '0' the computing cell will store '0'; otherwise, the computing cell will have the copy of the selected sample from the data cell. Finally, we use "**cp**", "**w\_cf**", and "**r\_cf**" to selectively perform *read*, *write*, or *copy* on computing fields. To efficiently program the fully-pipelined architecture, the instruction set is defined as shown in Fig.39. The fields <*c*\_*c f*> of COPY,<*w*\_*c f*> of P\_WRITE, and <*r*\_*c f*> of P\_READ are used to control "**cp**", "**w\_cf**", and "**r\_cf**". Given a 1-D non-recursive rank order filter application with *N*=9 and *r*=3, the reservation table can be optimized as shown in Fig.40.

Cycle Time RR DF of DCRAM CF3 of DCRAM RMRS LQ1 WMRS PS Shift Register OUTR

CF1 of DCRAM CF2 of DCRAM

LQ2

The first iteration The second iteration The third iteration

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 281

The fourth iteration The fifth iteration The sixth iteration Fig. 40. Reservation table of the 1-D non-recursive ROF for fully-pipelined ROF architecture.

To exercise the proposed architecture, we have implemented the ROF architecture, shown in Fig.16, using TSMC 0.18 *μm* 1P6M technology. First, we verified the hardware in VHDL at the behavior level. The behavior VHDL model is cycle-accurate. As the result of simulation, the implementations of the above examples are valid. Fig.41 and Fig.42 demonstrate the results of VHDL simulations for the 2-D ROF and RMF, respectively. Fig.41(a) is a noisy "Lena" image corrupted by 8% of impulsive noise. After being processed by 2-D ROFs with *r*=4, 5, and 6, the denoise results are shown in Fig.41(b), (c), and (d), respectively. Fig.42(a) is a noisy "Lena"

(a) ( b)

(c) (d) Fig. 41. Simulation results of a 2-D ROF application. (a) The noisy "Lena" image corrupted by 8% of impulsive noise. (b) The "Lena" image processed by the 3 × 3 4th-order filtering. (c) The "Lena" image processed by the 3 × 3 5th-order filtering. (d) The "Lena" image processed

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

**3.8 Chip design and simulation results**

by the 3 × 3 6th-order filtering.

...

21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Fig. 38. A modified circuit of computing cell for fully-pipelined ROF.


Fig. 39. The format of the extended instruction set for the fully-pipelined ROF architecture.

Fig. 40. Reservation table of the 1-D non-recursive ROF for fully-pipelined ROF architecture.

#### **3.8 Chip design and simulation results**

38 Will-be-set-by-IN-TECH

c\_wl[i] cp

c\_d[i]

Fig. 38. A modified circuit of computing cell for fully-pipelined ROF.

Sub instruction 1 S ET <*rank value*<sup>&</sup>gt; LOAD <*address*<sup>&</sup>gt;

C OPY <*c\_cf*> <*cp\_mask*> S I1\_NULL

Sub instruction 3 P\_WRITE <*w\_cf*> <*mask*<sup>&</sup>gt; SI3\_NULL

Sub instruction 4 P\_R EAD <r\_cf> <*mask*<sup>&</sup>gt;

20 00 w\_cf mask 212223 19 12

DONE

d0 25 24

Computing field instruction

Data field instruction

000 111111111 rank value 41 39 38 30 29 26

010 c\_cf cp\_mask 41 39 38 37 36 26

8 00 r\_cf mask 1011 79 0

cm[i]

GATE1

GATE2

from output of INV1

from output of INV2

c\_bl[j] c\_bl[j]

Sub instruction 3 23 12

S I4\_NULL

11 25 24 S I2\_NULL

11 1111111111 212223 12

111 1111111111111 41 39 38 26

001 111111111 address 41 39 38 30 29 26

11 1111111111 1011 9 0

Sub instruction 1 Sub instruction 2 Sub instruction 4 41 26 25 24 11 0

Sub instruction 2

Fig. 39. The format of the extended instruction set for the fully-pipelined ROF architecture.

INV 3

INV 4

rm[j] Computing Cell

wm[j]

To exercise the proposed architecture, we have implemented the ROF architecture, shown in Fig.16, using TSMC 0.18 *μm* 1P6M technology. First, we verified the hardware in VHDL at the behavior level. The behavior VHDL model is cycle-accurate. As the result of simulation, the implementations of the above examples are valid. Fig.41 and Fig.42 demonstrate the results of VHDL simulations for the 2-D ROF and RMF, respectively. Fig.41(a) is a noisy "Lena" image corrupted by 8% of impulsive noise. After being processed by 2-D ROFs with *r*=4, 5, and 6, the denoise results are shown in Fig.41(b), (c), and (d), respectively. Fig.42(a) is a noisy "Lena"

by 8% of impulsive noise. (b) The "Lena" image processed by the 3 × 3 4th-order filtering. (c) The "Lena" image processed by the 3 × 3 5th-order filtering. (d) The "Lena" image processed by the 3 × 3 6th-order filtering.

(a) (b)

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 283

Fig. 43. The result of chip design using TSMC 0.18um 1P6M technology. (a) The chip layout of proposed rank-order filter. (b) The core of the proposed ROF processor. (c) The floorplan and placement of (b). (1: Instruction decoder; 2: Reset circuit, 3: WMR, 4: RMR, 5: RR, 6:

Furthermore, We have successfully built a prototype which is composed of a FPGA board and DCRAM chips to validate the proposed architecture before fabricating the custom designed chip. The FPGA board is made by Altera and the FPGA type is APEX EP20K. The FPGA board can operate at 60 MHz at the maximum. The DCRAM chip was designed by full-custom CMOS technology. Fig.44(a) shows the micrograph of the DCRAM chip. The chip implements a subword part of DCRAM and the Fig.44(b) illustrates the chip layout. The fabricated DCRAM chip was layouted by full-custom design flow using TSMC 0.35 2P4M technology. As shown in Fig.45, with the supply voltage of 3.3V, the DCRAM chip can operate at 25 MHz. Finally, we successfully integrated the FPGA board and the DCRAM chips into a prototype as

shown in Fig.46 The prototype was validated with ROF algorithms mentioned above.

(c)

DCRAM, 7: PS; 8: Level Quantizer; 9: Shift Register; 10: OUTR.)

image corrupted by 9% of impulsive noise. After being processed by the 2-D 3 × 3 RMF, the denoise result is shown in Fig.42(b). The results are the same as those of Matlab simulation.

Fig. 42. Simulation results of a 2-D RMF application. (a) The noisy "Lena" image corrupted by 9% of impulsive noise. (b) The "Lena" image processed by the 3 × 3 RMF.

Upon verifying the proposed ROF processor using the cycle-accurate behavior model, we then implemented the processor in the fully-custom design methodology. Because of high regularity of memory, the proposed memory-based architecture saves the routing area while comparing with the logic-based solutions. Fig.43 (a) shows the overall chip layout and the dash-lined region is the core. The die size is 1063.57 <sup>×</sup> 1069.21*μm*<sup>2</sup> and the pinout count is 40. Fig.43 (b) illustrates the detail layout of the ROF core. The core size is 356.1 <sup>×</sup> 427.7*μm*<sup>2</sup> and the total transistor count is 3942. Fig.43 (c) illustrates the floorplan and placement. The physical implementation has been verified by the post-layout simulation. Table 8 shows the result of timing analysis, obtained from NanoSim. As seen in the table, the critical path is the path 3 and the maximum clock rate can be 290 MHz at 3.3V and 256 MHz at 1.8V. As the result of post-layout simulation, the power dissipation of the proposed ROF is quite low. For the 1-D/2-D ROFs, the average power consumption of the core is 29mW at 290MHz or 7mW at 256MHz. The performance sufficiently satisfies the real-time requirement of video applications in the formats of QCIF, CIF, VGA, and SVGA. The chip is submitting to Chip Implementation Center (CIC), Taiwan for the fabrication.


Table 8. Timing analysis of the proposed ROF processor.

40 Will-be-set-by-IN-TECH

image corrupted by 9% of impulsive noise. After being processed by the 2-D 3 × 3 RMF, the denoise result is shown in Fig.42(b). The results are the same as those of Matlab simulation.

(a) (b)

Fig. 42. Simulation results of a 2-D RMF application. (a) The noisy "Lena" image corrupted

Upon verifying the proposed ROF processor using the cycle-accurate behavior model, we then implemented the processor in the fully-custom design methodology. Because of high regularity of memory, the proposed memory-based architecture saves the routing area while comparing with the logic-based solutions. Fig.43 (a) shows the overall chip layout and the dash-lined region is the core. The die size is 1063.57 <sup>×</sup> 1069.21*μm*<sup>2</sup> and the pinout count is 40. Fig.43 (b) illustrates the detail layout of the ROF core. The core size is 356.1 <sup>×</sup> 427.7*μm*<sup>2</sup> and the total transistor count is 3942. Fig.43 (c) illustrates the floorplan and placement. The physical implementation has been verified by the post-layout simulation. Table 8 shows the result of timing analysis, obtained from NanoSim. As seen in the table, the critical path is the path 3 and the maximum clock rate can be 290 MHz at 3.3V and 256 MHz at 1.8V. As the result of post-layout simulation, the power dissipation of the proposed ROF is quite low. For the 1-D/2-D ROFs, the average power consumption of the core is 29mW at 290MHz or 7mW at 256MHz. The performance sufficiently satisfies the real-time requirement of video applications in the formats of QCIF, CIF, VGA, and SVGA. The chip is submitting to Chip

Path Description 1.8V 3.3V

1 From the output of RR to the input of the shift register. 1.2 ns 0.78 ns 2 From the output of RMR, thru DCRAM to the input of the PS. 1.8 ns 1.1 ns 3 From the output of RMR, thru DCRAM and the Level-Quantizer, 3.9 ns 3.44 ns

4 From the shift register, thru the inverter connected to "c\_in", 3.02 ns 1.96 ns

5 From "**d\_in**" to the SRAM cell of the data field. 3.05 ns 1.85 ns 6 From the SRAM cell of the data field to the SRAM cell of 1.24 ns 1.09 ns

supply supply

by 9% of impulsive noise. (b) The "Lena" image processed by the 3 × 3 RMF.

Implementation Center (CIC), Taiwan for the fabrication.

to the SRAM cell of the computing field.

Table 8. Timing analysis of the proposed ROF processor.

to the input of the shift register.

the computing field.

Fig. 43. The result of chip design using TSMC 0.18um 1P6M technology. (a) The chip layout of proposed rank-order filter. (b) The core of the proposed ROF processor. (c) The floorplan and placement of (b). (1: Instruction decoder; 2: Reset circuit, 3: WMR, 4: RMR, 5: RR, 6: DCRAM, 7: PS; 8: Level Quantizer; 9: Shift Register; 10: OUTR.)

Furthermore, We have successfully built a prototype which is composed of a FPGA board and DCRAM chips to validate the proposed architecture before fabricating the custom designed chip. The FPGA board is made by Altera and the FPGA type is APEX EP20K. The FPGA board can operate at 60 MHz at the maximum. The DCRAM chip was designed by full-custom CMOS technology. Fig.44(a) shows the micrograph of the DCRAM chip. The chip implements a subword part of DCRAM and the Fig.44(b) illustrates the chip layout. The fabricated DCRAM chip was layouted by full-custom design flow using TSMC 0.35 2P4M technology. As shown in Fig.45, with the supply voltage of 3.3V, the DCRAM chip can operate at 25 MHz. Finally, we successfully integrated the FPGA board and the DCRAM chips into a prototype as shown in Fig.46 The prototype was validated with ROF algorithms mentioned above.

Fig. 45. Measured waveform of the DCRAM chip.

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 285

(a)

Fig. 44. (a) The micrograph of DCRAM chip. (b) The layout of the DCRAM chip.

42 Will-be-set-by-IN-TECH

(a)

(b)

Fig. 44. (a) The micrograph of DCRAM chip. (b) The layout of the DCRAM chip.

Fig. 45. Measured waveform of the DCRAM chip.

**5. References**

[1] G. Iddan, G. Meron, A. Glukhovsky, and P. Swain, "Wireless capsule endoscopy,"

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 287

[2] Shinya Itoha, Shoji Kawahitob, Tomoyuki Akahoric, and Susumu Terakawad, "Design and implementation of a one-chip wireless camera device for a capsule endoscope,"

[3] F. Gong. P. Swain, and T. Mills, "Wireless endoscopy," Gastrointestinal Endoscopy,

[4] H. J.Park, H.W. Nam, B.S. Song, J.L. Choi, H.C. Choi, J.C. Park, M.N. Kim, J.T. Lee, and J.H. Cho, "Design of bi-directional and multi-channel miniaturized telemetry module for wireless endoscopy," in Proc. of the 2nd Annual Intl IEEE-EMBS Special Topic Conference on Microtechnologies in Medicine and Biology, May 2-4, 2002, Madison,

[7] M. Sendoh, k. Ishiyama, and K.-I. Arai, "Fabrication of Magnetic Actuator for Use in a Capsule Endoscope," IEEE Trans. on Magnetics, vol. 39, no. 5, pp. 3232-3234, September

[8] Louis Phee, Dino Accoto, Arianna Menciassi\*, Cesare Stefanini, Maria Chiara Carrozza, and Paolo Dario, "Analysis and Development of Locomotion Devices for the Gastrointestinal Tract," IEEE Trans. on Biomedical Engineering, vol. 49, no. 6, JUNE

[9] Shaou-Gang Miaou, Shih-Tse Chen, and Fu-Sheng Ke, "Capsule Endoscopy Image Coding Using Wavelet-Based Adaptive Vector Quantization without Codebook Training," International Conference on Information Technology and Applications

[10] Shaou-Gang Miaou, Shih-Tse Chen, and Chih-Hong Hsiao, "A wavelet-based compression method with fast quality controlling capability for long sequence of capsule endoscopy images," IEEE-EURASIP Workshop on Nonlinear Signal and Image

[11] M. Lin, L. Dung, and P. Weng, "An Ultra Low Power Image Compressor for capsule

[12] X. Xie, G. Li, X. Chen, X. Li, and Z. Wang, "A Low Power Digital IC Design Inside the Wireless Endoscopic Capsule", IEEE Journal of Solid-State Circuits, vol. 41, no. 11, pp.

[13] K. Wahid, S-B. Ko, and D. Teng, "Efficient Hardware Implementation of an Image Compressor for Wireless Capsule Endoscopy Applications", Proceedings of the IEEE

[14] Xinkai Chen; Xiaoyu Zhang; Linwei Zhang; Xiaowen Li; Nan Qi; Hanjun Jiang; Zhihua Wang, "A Wireless Capsule Endoscope System With Low-Power Controlling and Processing ASIC", IEEE Trans. on Biomedical Circuits and Systems, vol. 3, no. 1,

[15] H.A. Peterson, H. Peng, J. H. Morgan, and W. B. Pennebaker, "Quantization of color image components in the DCT domain" SPIE , Human Vision, Visual Processing, and

International Joint Conference on Neural Networks pp. 2762-2766, 2008;

[5] http://www.givenimaging.com/Cultures/en-US/given/english

Endoscope," BioMedical Engineering Online 2006, vol. 5:14.

Nature, vol. 405, pp. 417-418, May 25, 2000.

SPIE, vol. 5677, pp. 108-118, 2005.

vol.51, no. 6, pp. 725-729, June 2000.

(ICITA), vol. 2, pp. 634-637, July 2005.

Processing (NSIP), pp. 34-34, 2005.

2390-2400, November 2006.

pp.11-22, Feb. 2009.

Digital Display II, vol.1453, 1991.

USA, pp. 273-276.

2003.

2002.

[6] http://www.rfsystemlab.com/

Fig. 46. The system prototype of rank-order filtering processor.

#### **4. Conclusion**

In order to further extend the battery life of capsule endoscope, this paper mainly focus on a series of mathematical statistics to systematically analyze the color sensitivity in GI images from the RGB color space domain to the 2-D DCT spatial frequency domain. According to the analysis results, an improved ultra-low-power subsample-based GICam image compression processor are proposed for capsule endoscope or swallowable imaging capsules. we make use of the subsample technique to reduce the memory requirements of G1, G2 and B components according to the analysis results of DC/AC coefficients in 2-D DCT domain. As shown in the simulation result, the proposed image compressor can efficiently save 38.5% more power consumption than previous GICam one (11), and can efficiently reduce image size by 75% at least for each sampled gastrointestinal image. Therefore, video sequences totally reduce size by 75% at least. Furthermore, the proposed image compressor has lower area and lower operation frequency according to the comparison results. It can fit into the existing designs.

Forthemore, we have proposed an architecture based on a maskable memory for rank-order filtering. This paper is the first literature using maskable memory to realize ROF. Driving by the generic rank-order filtering algorithm, the memory-based architecture features high degree of flexibility and regularity while the cost is low and the performance is high. With the LIW instruction set, this architecture can be applied for arbitrary ranks and a variety of ROF applications, including recursive and non-recursive algorithms. As shown in the implementation results, the core of the processor has high performance and low cost. The post-layout simulation shows that the power consumption can be as low as 7 mW at 256 MHz. The processing speed can meet the real-time requirement of image applications in the QCIF, CIF, VGA, or SVGA formats.

#### **5. References**

44 Will-be-set-by-IN-TECH

In order to further extend the battery life of capsule endoscope, this paper mainly focus on a series of mathematical statistics to systematically analyze the color sensitivity in GI images from the RGB color space domain to the 2-D DCT spatial frequency domain. According to the analysis results, an improved ultra-low-power subsample-based GICam image compression processor are proposed for capsule endoscope or swallowable imaging capsules. we make use of the subsample technique to reduce the memory requirements of G1, G2 and B components according to the analysis results of DC/AC coefficients in 2-D DCT domain. As shown in the simulation result, the proposed image compressor can efficiently save 38.5% more power consumption than previous GICam one (11), and can efficiently reduce image size by 75% at least for each sampled gastrointestinal image. Therefore, video sequences totally reduce size by 75% at least. Furthermore, the proposed image compressor has lower area and lower operation frequency according to the comparison results. It can fit into the existing designs. Forthemore, we have proposed an architecture based on a maskable memory for rank-order filtering. This paper is the first literature using maskable memory to realize ROF. Driving by the generic rank-order filtering algorithm, the memory-based architecture features high degree of flexibility and regularity while the cost is low and the performance is high. With the LIW instruction set, this architecture can be applied for arbitrary ranks and a variety of ROF applications, including recursive and non-recursive algorithms. As shown in the implementation results, the core of the processor has high performance and low cost. The post-layout simulation shows that the power consumption can be as low as 7 mW at 256 MHz. The processing speed can meet the real-time requirement of image applications in the

Fig. 46. The system prototype of rank-order filtering processor.

**4. Conclusion**

QCIF, CIF, VGA, or SVGA formats.


[32] P. Zamperoni, "Variation on the rank-order filtering theme for grey-tone and binary image enhancement," IEEE Int. Conf. Acoust., Speech, Signal Processing, pp.1401-1404,

Study on Low-Power Image Processing for Gastrointestinal Endoscopy 289

[33] C.T. Chen and L.G. Chen, "A self-adjusting weighted median filter for removing impulse noise in images," Int. Conf. Image Processing, pp.16-19, Sept. 1996. [34] D. Yang and C. Chen, "Data dependence analysis and bit-level systolic arrays of the median filter," IEEE Trans. Circuits and Systems for Video Technology, vol.8, no.8,

[35] T. Ikenaga and T. Ogura, "CAM2: A highly-parallel two-dimensional cellular automation architecture," IEEE Trans. Computers, vol.47, no.7, pp.788-801, July 1998. [36] L. Breveglieri and V. Piuri, "Digital median filter," Journal of VLSI Signal Processing,

[37] C. Chakrabarti, "Sorting network based architectures for median filters," IEEE Trans. Circuits ans Systems II: Analog and Digital Signal Processing, vol.40, pp.723-727, Nov.

[38] C. Chakrabarti, "High sample rate architectures for median filters," IEEE Trans. Signal

[39] L. Chang and J. Lin, "Bit-level systolic array for median filter," IEEE Trans. Signal

[40] C. Chen, L. Chen, and J. Hsiao, "VLSI implementation of a selective median filter," IEEE

[41] M.R. Hakami, P.J. Warter, and C.C. Boncelet, Jr., "A new VLSI architecture suitable for multidimensional order statistic filtering," IEEE Trans. Signal Processing, vol.42,

[42] Hatirnaz, F.K. Gurkaynak, and Y. Leblebici, "A compact modular architecture for the realization of high-speed binary sorting engines based on rank ordering," IEEE Inter.

[43] A.A. Hiasat, M.M. Al-lbrahim, and K.M. Gharailbeh, "Design and implementation of a new efficient median filtering algorithm," IEE Proc. Image Signal Processing, vol.146,

[44] R.T. Hoctor and S.A. Kassam, "An algorithm and a pipelined architecture for order-statistic determination and L-filtering," IEEE Trans. Circuits and Systems, vol.36,

[45] M. Karaman, L. Onural, and A. Atalar, "Design and implementation of a general-purpose median filter unit in CMOS VLSI," IEEE Journal of Solid State Circuits,

[46] C. Lee, P. Hsieh, and J, Tsai, "High-speed median filter designs using shiftable content-addressable memory," IEEE Trans. Circuits and Systems for Video Technology,

[47] C.L. Lee and C. Jen, "Bit-sliced median filter design based on majority gate," IEE Proc.-G

[48] L.E. Lucke and K.K. Parchi, "Parallel processing architecture for rank order and stack filter," IEEE Trans. Signal Processing, vol.42, no.5, pp.1178-1189, May 1994. [49] K. Oazer, "Design and implementation of a single-chip 1-D median filter," IEEE Trans. Acoust., Speech, Signal Processing, vol.ASSP-31, no.4, pp.1164-1168, Oct. 1983.

Circuits, Devices and Systems, vol.139, no.1, pp.63-71, Feb. 1992.

Symp. Circuits and Syst., Geneva, Switzerland, pp.685-688, May 2000.

1989.

1993.

pp.1015-1024, Dec. 1998.

vol.31, pp.191-206, 2002.

pp.991-993, April 1994.

no.5, pp.273-278, Oct. 1999.

no.3, pp.344-352, March 1989.

vol.4, pp.544-549, Dec. 1994.

vol.25, no.2, pp.505-513, April 1990.

Processing, vol.42, no.3, pp.707-712, March 1994.

Processing, vol.40, no.8, pp.2079-2083, Aug. 1992.

Trans. Consumer Electronics, vol.42, no.1, pp.33-42, Feb. 1996.


46 Will-be-set-by-IN-TECH

[17] Henry R. Kang "Color Technology For Electronic Imaging Devices," SPIE Optical

[18] J.Ziv and A. Lempel, "A universal algorithm for sequential data compression," IEEE

[19] Vasudev Bhaskaran, and Konstantinos Kon stantinides "Images and Video Compression Standards : Alogorithms and Architectures, Second edition," Kliwer

[20] Meng-Chun Lin, Lan-Rong Dung, and Ping-Kuo Weng "An Improved Ultra-Low-Power Subsample based Image Compressor for Capsule Endoscope,"

[21] Gi-Shih Lien, Chih-Wen Liu, Ming-Tsung Teng, and Yan-Min Huang, " Integration of Two Optical Image Modules and Locomotion Functions in Capsule Endoscope Applications," The 13th IEEE International Symposium on Consumer Electronics,

[22] Mao Li, Chao Hu, Shuang Song, Houde Dai, and Max Q.-H. Meng, "Detection of Weak Magnetic Signal for Magnetic Localization and Orientation in Capsule Endoscope," Proceedings of the IEEE International Conference on Automation and

[23] Chao Hu, Max Q.-H. Meng, Li Liu, Yinzi Pan, and Zhiyong Liu "Image Representation and Compression for Capsule Endoscope Robot," Proceedings of the 2009 IEEE International Conference on Information and Automation, pp.506-511, June 2009. [24] Jing Wu, and Ye Li, "Low-complexity Video Compression for Capsule Endoscope Based on Compressed Sensing Theory," 31st Annual International Conference of the IEEE

[25] Jinlong Hou, Yongxin Zhu, Le Zhang, Yuzhuo Fu, Feng Zhao, Li Yang, and Guoguang Rong, "Design and Implementation of a High Resolution Localization System for In-vivo Capsule Endoscopy," 2009 Eighth IEEE International Conference on

[26] Chang Cheng, Zhiyong Liu and Chao Hu, and Max Q.-H. Meng "A Novel Wireless Capsule Endoscope With JPEG Compression Engine," Proceedings of the 2010 IEEE

[28] H. Rantanen, M. Karlsson, P. Pohjala, and S. Kalli, "Color video signal processing with median filters," IEEE Trans. Consumer Electron., vol.38, no.3, pp.157-161, Apr. 1992. [29] T. Viero, K. Oistamo, and Y. Neuvo, "Three-dimensional median-related filters for color image sequence filtering," IEEE Trans. Circuits Syst. Video Technol., vol.4, no.2,

[30] X. Song, L. Yin, and Y. Neuvo, "Image sequence coding using adaptive weighted median prediction," Signal Processing VI, EUSIPCO-92, Brussels, pp.1307-1310, Aug.

[31] K. Oistamo and Y. Neuvo, "A motion insensitive method for scan rate conversion and cross error cancellation," IEEE Trans. Consumer Electron., vol.37, pp.296-302, Aug. 1991.

International Conference on Automation and Logistics, pp.553-558, Aug. 2010. [27] D.H. Kang, J.H. Choi, Y.H. Lee, and C. Lee, "Applications of a DPCM system with median predictors for image coding," IEEE Trans. Consumer Electronics, vol.38, no.3,

Dependable, Autonomic and Secure Computing, pp.209-214, 2009.

[16] Dr. R.W.G. Hunt, "Measuring Colour," Fountain Press, 1998.

Trans. on Inform. Theory, vol. 23, pp. 337-343, May 1977.

Medical Informatics Symposium in Taiwan (MIST), 2006.

Logistics Shenyang, China, pp.900-905, August 2009.

EMBS Minneapolis, pp.3727-3730, Sep. 2009.

Engineering Press, 1997.

Academic Publishers.

pp.828-829, 2009.

pp.429-435, Aug. 1992.

pp.129-142, Apr. 1994.

1992.




48 Will-be-set-by-IN-TECH

290 VLSI Design

[50] D.S. Richards, "VLSI median filters," IEEE Trans. Acoust., Speech, and Signal

[51] G.G. Boncelet, Jr., "Recursive algorithm and VLSI implementation for median filtering,"

[52] C. Henning and T.G. Noll, "Architecture and implementation of a bitserial sorter for weighted median filtering," IEEE Custom Integrated Circuits Conference, pp.189-192,

[53] C.C Lin and C.J. Kuo, "Fast response 2-D rank order filter by using max-min sorting

[54] M. Karaman, L. Onural, and A. Atalar, "Design and implementaion of a general purpose VLSI median filter unit and its application," IEEE Int. Conf. Acoustics, Speech, and

[55] J. Hwang and J. Jong, "Systolic architecture for 2-D rank order filtering," Int. Conf.

[56] I. Pitas, "Fast algorithms for running ordering and max/min calculation," IEEE Trans.

[57] O. Vainio, Y. Neuvo, and S.E. Butner, "A signal processor for median-based algorithm," IEEE Trans. Acoustics, Speech, and Signal Processing, vol.37, no.9, pp.1406-1414, Sept.

[58] H. Yu, J. Lee, and J, Cho, "A fast VLSI implementation of sorting algorithm for standard median filters," IEEE Int. ASIC/SOC Conference, pp.387-390, Sptember 1999. [59] J.P. Fitch, "Software and VLSI algorithm for generalized renked order filtering," IEEE

[60] M.Karaman and L. Onural, "New radix-2-based algorithm for fast median filtering,"

[61] J.P. Fitch, "Software and VLSI Algorithms for Generalized Ranked Order Filtering,"

[62] B.K. Kar and D.K. Pradhan, "A new algorithm for order statistic and sorting," IEEE

[63] V.A. Pedroni, "Compact hamming-comparator-based rank order filter for digital VLSI and FPGA implementations," IEEE Int. Sym. on Circuits and Systems, vol.2, pp.585-588,

[64] Shobha Singh, Shamsi Azmi, Nutan Agrawal, Penaka Phani and Ansuman Rout, "Architecture and design of a high performance SRAM for SOC design," IEEE Int. Sym.

Trans. Circuits and Systems, vol.CAS-34, no.5, pp.553-559, May 1987.

IEEE Trans. Circuits and Syst., vol.CAS-34, no.5, pp.553-559, May 1987.

Trans. Signal Processing, vol.41, no.8, pp.2688-2694, Aug. 1993.

IEEE Int. Sym. on Circuits and Systems, pp.1745-1747, June 1988.

network," Int. Conf. Image Processing, pp.403-406, Sept. 1996.

Application-Specific Array Processors, pp.90-99, Sept. 1990.

Circuits and Systems, vol.36, no.6, pp.795-804, June 1989.

Processing, vol.38, pp.145-153, January 1990.

Signal Processing, pp.2548-2551, May 1989.

Electron. Lett., vol.25, pp.723-724, May 1989.

on VLSI Design, pp.447-451, Jan 2002.

May 1998.

1989.

May 2004.
