**3.1. rDCME-based RS decoder**

In traditional ME algorithm, the inherent degree computation and systolic architecture renders large consumption of area and power (see [6] and [7]), which is not suitable for the discussed application. To reduce the unnecessary degree computation, DCME algorithm was introduced in [9]. By generating internal switch and shift signals, the DCME algorithm can achieve the same function as ME algorithm without degree computation.

#### **DCME algorithm**


270 Optical Communication

Based on the above described Chien search and Forney algorithm, the architecture of CSEE block for example RS (255, 239) code is illustrated in Fig. 2. It consists of several unit cells (shown in Fig. 2(a)). Both of the sub-blocks that carry out Chien search and Forney algorithm consist of these basic cells. In the beginning, *λi*, and *ωi*, (represented by *Ui* in the figure), as the coefficients of **Λ**(*x*) and **Ω**(*x*), are parallel loaded into these basic cells (enable=1). Then, during the next 255 cycles, those basic cells will carry out multiply iteratively. Fig. 2(b) is the overall architecture for CSEE block. Once a zero is detected in Chien search, the corresponding error

As mentioned in previous paragraph, since KES is the dominating step in the whole RS decoding, Section 3 and 4 will focus on the algorithm and architecture optimization of KES

**3. Low-complexity high-speed RS decoders for short-distance network** 

For short-distance optical transmission, such as 10GBase-LR, since the noise rendered from transmission distance is quite limited, the requirement of coding gain is not as strict as long-

magnitude will be computed via executing the above Forney algorithm.

**Figure 2.** (a) The diagram of CSEE cell. (b) The block diagram of CSEE.

(a) (b)

The overall architecture of RS decoding is summarized in Fig. 3.

**Figure 3.** The overall architecture of RS decoder.

block.


```
8: if ( =1) then switch and ;
                i i
     sw R Q
```

18 : as error locator polynomial and as error value polynom *i i L R* **Output** ial

It is needed to point out that the initialization of DCME and ME is different due to the consideration of the design of the following introduced FSM.

Fig. 4 shows the FSM for generating control signals. In each iteration, there are two possible states: S0 and S1. S0 represents the case when both of *ai* and *bi* are nonzero; otherwise the state of FSM is S1. The different combinations of the current and previous states will determine control signals in the current iteration.

When the leading coefficients of *Ri* and *Qi* are both nonzero, it denotes that polynomial computation can be carried out (*pc*=1), otherwise shift operation would be performed (*pc*=0) to reduce leading coefficient (and in this case the leading coefficient must be zero, the details will be shown in next paragraph). In each iteration, the possible shift operation would be executed once at most. The whole shift process would not stop until both of *ai* and *bi* are nonzero, which means the degrees of *Ri* and *Qi* are equal again. And in that case KES block starts executing polynomial computation. So it is clear that in each iteration the algorithm would perform only shift operation or only polynomial computation operation.

It should be pointed out that after every polynomial computation, if being carried out, the original leading coefficient of *Ri*+1 must be zero due to the arithmetic character of *Ri*+1 = *biRi* + *aiQi*. Different from the leading coefficients referred in above paragraph, this kind of leading zero is a "false" leading coefficient which will cause logic errors in next iteration. (For example, after polynomial computation if *Ri*+1 is represented by 0, 0, 0, α2, α3, the "false" leading coefficient is the first zero, and the real representation of *Ri*+1 should be0, 0, α2, α3.) So in every possible polynomial computation process, the designed rDCME KES block has automatically eliminated this kind of leading zero with the aid of "*start*" signal in hardware design (Fig.5): the coefficients which arrive simultaneously with "*start*" signal are selected as the leading coefficients. So once polynomial computations are finished, by delaying *Qi*+1, *Ui*+1 and *start* signal one more clock cycle, the "false" leading zero is eliminated, and deg*Ri*+1 is one less than deg *Ri* or equal to it (this condition happens when the previous iteration's actual input is *xRi* brought by initial input *R*0=*xS*(*x*)). Then *ai* and *bi* represent the real leading coefficients respectively.

**Figure 4.** The FSM for generating control signals.

If the previous state and the current state are both S0, it indicates that the polynomial computation is able to be executed in the two successive iterations. So *pc*=1 since the current operation is polynomial computation. Due to the fact the previous state is S0, after the previous polynomial computation and the degree reduction, deg*Ri* is one smaller than deg

*Ri*-1 or equal to it. So deg *Ri* is the same with deg*Qi* or one smaller than deg*Qi*. These two possible conditions occur successively and the switch signal (*sw*) alternates successively (*sw*=~*sw*).

If the previous state is S0 and the current state is S1, KES block would process shift operation to eliminate leading zero (*pc* is set to 0) in the current iteration, because S1 shows leading coefficient is zero. *sw* is also 0 because switch operation always be carried out with polynomial operation.

If the previous state and the current state are both S1, it indicates that the two successive iterations are both in shifting operations. Similar with the above condition, *sw* and *pc* are both set to 0.

If the previous state is S1 and the current state is S0, the polynomial computation would be executed (*pc*=1) in the current iteration. Since in the previous iteration *Ri* is in shift operation (*Qi* is never in shift operation because of its character in polynomial computation, it is also guaranteed by rDCME's initial conditions), actual degree of *Ri* must be smaller than *Qi*, so *sw* is set to 1.

After 2*t*=16 iterations, the rDCME KES block stops and outputs error value polynomial *R*(*x*)=Ω(*x*) and error locator polynomial *L*(*x*)=Λ(*x*).

**Figure 5.** The block diagram of KES.

272 Optical Communication

coefficients respectively.

**Figure 4.** The FSM for generating control signals.

determine control signals in the current iteration.

Fig. 4 shows the FSM for generating control signals. In each iteration, there are two possible states: S0 and S1. S0 represents the case when both of *ai* and *bi* are nonzero; otherwise the state of FSM is S1. The different combinations of the current and previous states will

When the leading coefficients of *Ri* and *Qi* are both nonzero, it denotes that polynomial computation can be carried out (*pc*=1), otherwise shift operation would be performed (*pc*=0) to reduce leading coefficient (and in this case the leading coefficient must be zero, the details will be shown in next paragraph). In each iteration, the possible shift operation would be executed once at most. The whole shift process would not stop until both of *ai* and *bi* are nonzero, which means the degrees of *Ri* and *Qi* are equal again. And in that case KES block starts executing polynomial computation. So it is clear that in each iteration the algorithm

It should be pointed out that after every polynomial computation, if being carried out, the original leading coefficient of *Ri*+1 must be zero due to the arithmetic character of *Ri*+1 = *biRi* + *aiQi*. Different from the leading coefficients referred in above paragraph, this kind of leading zero is a "false" leading coefficient which will cause logic errors in next iteration. (For example, after polynomial computation if *Ri*+1 is represented by 0, 0, 0, α2, α3, the "false" leading coefficient is the first zero, and the real representation of *Ri*+1 should be0, 0, α2, α3.) So in every possible polynomial computation process, the designed rDCME KES block has automatically eliminated this kind of leading zero with the aid of "*start*" signal in hardware design (Fig.5): the coefficients which arrive simultaneously with "*start*" signal are selected as the leading coefficients. So once polynomial computations are finished, by delaying *Qi*+1, *Ui*+1 and *start* signal one more clock cycle, the "false" leading zero is eliminated, and deg*Ri*+1 is one less than deg *Ri* or equal to it (this condition happens when the previous iteration's actual input is *xRi* brought by initial input *R*0=*xS*(*x*)). Then *ai* and *bi* represent the real leading

If the previous state and the current state are both S0, it indicates that the polynomial computation is able to be executed in the two successive iterations. So *pc*=1 since the current operation is polynomial computation. Due to the fact the previous state is S0, after the previous polynomial computation and the degree reduction, deg*Ri* is one smaller than deg

would perform only shift operation or only polynomial computation operation.

Fig. 5 shows the detailed architecture of rDCME algorithm. The KES block is designed with single PE. It is commonly known that recursive architecture usually can not be pipelined due to data dependency. And recursive architecture is always a bottleneck for high-speed. However, in the rDCME architecture, these disadvantages can be avoided. A 11-stages (2*t*-5=11) shifter registers are used to store the last iteration results and feedback to the next iteration for avoiding dependency between successive iterations: At the end of each iteration, the leading coefficients of five updated inputs (R, Q, L, U and *start*) are just stored back into the leftmost registers of shift registers and ready to be updated in the next iteration. Because during the computation procedure the whole iteration process of KES block is a close loop, the property of leading coefficients' in-time arrival makes dependency between iterations be avoided and logical validity guaranteed. Furthermore, because the former SC block takes *n* clock cycles to output one codeword, the PEs in conventional systolic DCME architecture in [10] and [11] are idle in the most of processing time and at the same time it occupies a large amount of chip area. So the multi-stages pipeline can be employed in the area efficient recursive KES block with valid logic and only a little data processing rate degradation. Note that in Fig. 5 the multipliers are pipelined.


**Table 1.** Implementation results and comparisons

Table 1 presents performance comparisons between the rDCME RS decoder and other existing RS decoders. It can be observed that the rDCME decoder has very low hardware complexity and high throughput. Compared with the existing ME architectures, the total gate count of the rDCME architecture is reduced by at least 30.4%. Therefore, the hardware efficiency is improved at least 1.84 times, which means under the same technology condition our design would be much more area-efficient compared with other existing RS decoder designs for multi-Gb/s optical communication systems.

## **3.2. PI-iBM-based RS decoder**

Besides ME algorithm, BM algorithm is another main decoding approach for RS codes. An important and inevitable disadvantage of traditional iBM/RiBM algorithms is the high cost of area or iteration time for computing error value polynomial **Ω**(*x*). In iBM architecture stated in [12], one third of total iteration time or half of hardware complexity is employed to compute **Ω**(*x*); in RiBM architecture stated in [5], one third of processing elements (PE) are utilized to calculate and store **Ω**(*x*). Therefore, the calculation of **Ω**(*x*) impedes further performance improvement of current BM architectures.

274 Optical Communication

pipelined.

**Throughput**

**Table 1.** Implementation results and comparisons

**3.2. PI-iBM-based RS decoder** 

designs for multi-Gb/s optical communication systems.

high-speed. However, in the rDCME architecture, these disadvantages can be avoided. A 11-stages (2*t*-5=11) shifter registers are used to store the last iteration results and feedback to the next iteration for avoiding dependency between successive iterations: At the end of each iteration, the leading coefficients of five updated inputs (R, Q, L, U and *start*) are just stored back into the leftmost registers of shift registers and ready to be updated in the next iteration. Because during the computation procedure the whole iteration process of KES block is a close loop, the property of leading coefficients' in-time arrival makes dependency between iterations be avoided and logical validity guaranteed. Furthermore, because the former SC block takes *n* clock cycles to output one codeword, the PEs in conventional systolic DCME architecture in [10] and [11] are idle in the most of processing time and at the same time it occupies a large amount of chip area. So the multi-stages pipeline can be employed in the area efficient recursive KES block with valid logic and only a little data processing rate degradation. Note that in Fig. 5 the multipliers are

**in [10]** 

**Tech.(***μ***m)** 0.18 0.13 0.25 0.13 **PE** 1 2t 3t+2 1 **SC** 2900 2900 2900 2900 **KES** 11400 46200 21760 17000 **CSEE** 4100 4100 4100 4100 **Total gates** 18400 53200 28760 24000 **fmax(MHz)** 640 660 200 625

**(Gb/s)** 5.1 5.3 1.6 5

Table 1 presents performance comparisons between the rDCME RS decoder and other existing RS decoders. It can be observed that the rDCME decoder has very low hardware complexity and high throughput. Compared with the existing ME architectures, the total gate count of the rDCME architecture is reduced by at least 30.4%. Therefore, the hardware efficiency is improved at least 1.84 times, which means under the same technology condition our design would be much more area-efficient compared with other existing RS decoder

Besides ME algorithm, BM algorithm is another main decoding approach for RS codes. An important and inevitable disadvantage of traditional iBM/RiBM algorithms is the high cost of area or iteration time for computing error value polynomial **Ω**(*x*). In iBM architecture stated in [12], one third of total iteration time or half of hardware complexity is employed to compute **Ω**(*x*); in RiBM architecture stated in [5], one third of processing elements (PE) are

**DCME in [11]**  **PrME in [7]** 

Architect. **rDCME pDCME**

The PI-iBM algorithm employs simplified Forney algorithm to compute error values. Simplified Forney algorithm, presented in [13] and [14], replaces Ω(*x*) with scratch polynomial B(*x*) as follows:

$$Y\_i = \frac{\mathcal{A}\_0 \delta}{\alpha B(\alpha) \Lambda^{'}(\alpha)}\bigg|\_{\chi = \alpha^{-i}}$$

In each iteration, scratch polynomial **B**(*x*), discrepancy *δ*, error locator polynomial **Λ**(*x*) and its coefficient *λ*0 are simultaneously updated. After completing iteration, KES block outputs them to CSEE block for calculating error values *Yi*. So the computation of **Ω**(*x*) is completely eliminated, which enables KES block to reduce a large amount of extra computation circuitry and iteration time.

Furthermore, in order to reduce hardware complexity significantly without sacrificing throughput per unit area, pipeline interleaving techniques in [15] is employed in the PI-iBM algorithm and architecture proposed in [16].

As depicted in the following PI-iBM algorithm, interleaving factor *g* is a crucial factor to design overall architecture. In practical RS (*n*, *k*, *t*) codes, such as (255, 239, 8) code, *t*=8 is a common value. So in this paper we set both *p* and *g* as 3 for demonstrating PI-iBM architecture.

The PI-iBM architecture consists of two blocks: pipeline interleaving error locator update (PI-ELU) block and pipeline interleaving discrepancy computation (PI-DC) block. As it is illustrated in Fig. 6, PI-ELU block is designed to execute Step3 for updating polynomials. Fig. 6(a) shows the internal architecture of the *i*-th PE. Initial values of upper and leftmost registers are shown in the figure and other registers are initialized to zero. For the *i*-th PE, in each iteration 10 cycles are required to update the stored coefficients of **Λ**(*x*) and **B**(*x*), meanwhile "ctrl" signal is set to be "1 0 0 0 0 0 0 0 0 0". At the beginning of *r*-th iteration, *b*3*<sup>i</sup>*(*r*), *b*3*<sup>i</sup>*+1(*r*), *b*3*<sup>i</sup>*+2(*r*) are stored in the leftmost three registers with *λ*3*<sup>i</sup>*(*r*-1)*γ*(*r*-1), *λ*3*<sup>i</sup>*+1(*r*-1)*γ*(*r*-1), *λ*3*<sup>i</sup>*+2(*r*-1)*γ*(*r*-1) in the upper three registers, then they are shifted in the upper and lower loops to be updated. During the first 3 cycles, *λ*3*<sup>i</sup>*(*r*), *λ*3*<sup>i</sup>*+1(*r*), *λ*3*<sup>i</sup>*+2(*r*) are successively computed and outputted to PI-DC block for calculating discrepancy *δ*(*r*) (Step 1). After current iteration is completed, *b*3*<sup>i</sup>*(*r+*1), *b*3*<sup>i</sup>*+1(*r*+1), *b*3*<sup>i</sup>*+2(*r+*1) and *λ*3*<sup>i</sup>*(*r*)*γ*(*r*), *λ*3*<sup>i</sup>*+1(*r*)*γ*(*r*), *λ*3*<sup>i</sup>*+2(*r*)*γ*(*r*) are just fed back to the initial registers which stored them at the beginning. The two dashed rectangles indicate that the critical path between lower multiplier and adder has been 3 stage fine-grain pipelined; the path between upper multiplier and adder is tackled in the same way.

In addition, PI-DC block mainly implements the function of updating discrepancy *δ*(*r*) (Step1). A low-complexity and high-speed architecture of PI-DC block is shown in Fig. 7. As shown in Fig. 7, 2*t*-1 syndromes are serially sent to PI-DC block and shifted in the upper *t*+1 registers **every 10 cycles.** The initialization of leftmost register is *S*0 while other registers are initialized to zero. In each iteration "ctrl 1"signal is set to be "0 0 0 0 0 0 1 0 0 0". In the first 5 cycles of each iteration, input *λj*, *λ*3+*<sup>j</sup>*, *λ*6+*<sup>j</sup>* and corresponding syndromes selected by multiplexers are multiplied by three 3-stage pipelined multipliers (shown by dashed lines). At the end of 6-th cycle accumulator circuit computes *δ*(*r*) and outputs it to control block for updating *γ*(*r*) and SEL(*r*). Passing another register which cuts path between PI-ELU and PI-DC in control block, the three signals are fed back to PI-ELU block. In the overall architecture of PI-iBM (Fig. 8), it takes 7 cycles to calculate and output *δ*(*r*) (PI-DC block), and another 3 cycles is the cost for calculating new coefficients (PI-ELU block), so the total time for one iteration is 10 cycles.

#### 0 The PI-iBM Algorithm Initialization and Input: Let 1 , where is the coefficient of pipeline interleaving; ( *t pg g* 0 2 2 1 0 1 2 21 0 1 0) (0) 1, (0) (0) 0 for = 1, 2, ..., ; (0) = 0, (0) = 1; ( ) = ... . Iteration Process: for 0 step 1 until 2 - 1 do Begin Step1: ( ) ( ) *l l t t r r b blt k Sx S Sx Sx S x r t rS rS* <sup>1</sup> 1 ( ) ... ( ); Step2: If ( ) 0 and ( ) 0 then 1; else 0; Step3: for 0 step 1 until 1 do Begin Step3.1: ( 1) ( ) ( ) ( ) ( *rtt gi j gi j gi j r Sr r kr a a j g r r r rb r* 1 ) ; Step3.2: ( 1) ( ) ( ) for 0,1,... 1; End ( 1) ( ) ( ) Step4: ; ( 1) ( ) 1 ( ) 1 End Output *gi j gi j gi j a b r rb r i p a r rr a kr kr kr a* 2 01 2 2 01 2 : ( ) (2 ) (2 ) (2 ) ... (2 ) ; ( ) (2 ) (2 ) (2 ) ... (2 ) . *t t t t x t tx tx tx Bx b t b tx b tx b tx*

Table 2 gives the implementation results of PI-iBM decoder and also lists some other designs. From this table we can find that the PI-iBM architecture deliver very high throughput with relatively low hardware complexity: the total throughput rate and throughput per unit area in the PI-iBM design are at least 200% more than those existing works. To achieve data rates from 10 Gb/s to 100 Gb/s, PI-iBM decoder has the lowest hardware complexity. If 65 nm CMOS technology is used in the implementation, the throughput of our design can be increased significantly. Thus the current designs can fit well for 10 Gb/s-40 Gb/s optical communication systems. For 100 Gb/s applications, we may need two to three independent hardware copies of the designs. However, the PI-iBM architecture will remain to have the lowest hardware complexity compared with existing designs. In short, the PI-iBM decoder is very area-efficient for very high-speed optical applications.

276 Optical Communication

time for one iteration is 10 cycles.

The PI-iBM Algorithm Initialization and Input:

> 0

0) (0) 1,

 

(0) = 0, (0) = 1;

*b*

*l l*

Iteration Process:

Begin

End

 End Output

Step1: ( ) ( )

*t pg g*

(0) (0) 0 for = 1, 2, ..., ;

for 0 step 1 until 2 - 1 do

*r t*

0 1 2 21

*blt*

( ) = ... .

*Sx S Sx Sx S x*

0 1

*r r*

*j g*

<sup>1</sup>

*rS rS*

Step3: for 0 step 1 until 1 do

( 1) ( ) ( ) Step4: ; ( 1) ( ) 1 ( ) 1

0

Begin

(

*k*

registers **every 10 cycles.** The initialization of leftmost register is *S*0 while other registers are initialized to zero. In each iteration "ctrl 1"signal is set to be "0 0 0 0 0 0 1 0 0 0". In the first 5 cycles of each iteration, input *λj*, *λ*3+*<sup>j</sup>*, *λ*6+*<sup>j</sup>* and corresponding syndromes selected by multiplexers are multiplied by three 3-stage pipelined multipliers (shown by dashed lines). At the end of 6-th cycle accumulator circuit computes *δ*(*r*) and outputs it to control block for updating *γ*(*r*) and SEL(*r*). Passing another register which cuts path between PI-ELU and PI-DC in control block, the three signals are fed back to PI-ELU block. In the overall architecture of PI-iBM (Fig. 8), it takes 7 cycles to calculate and output *δ*(*r*) (PI-DC block), and another 3 cycles is the cost for calculating new coefficients (PI-ELU block), so the total

Let 1 , where is the coefficient of pipeline interleaving;

*t*

( ) ... ( );

*r r r rb r*

2

2

*r Sr*

 

*gi j gi j gi j*

Step3.2: ( 1) ( ) ( ) for 0,1,... 1;

*rtt*

 

*b r rb r i p*

1

*t t*

*t*

  1

*a*

*a*

*t*

) ;

2 2 1

 

*r kr a a*

 

*gi j gi j gi j*

: ( ) (2 ) (2 ) (2 ) ... (2 ) ;

*x t tx tx tx Bx b t b tx b tx b tx*

 

Table 2 gives the implementation results of PI-iBM decoder and also lists some other designs. From this table we can find that the PI-iBM architecture deliver very high throughput with relatively low hardware complexity: the total throughput rate and throughput per unit area in the PI-iBM design are at least 200% more than those existing

Step3.1: ( 1) ( ) ( ) ( ) (

*r rr a kr kr kr a*

01 2

( ) (2 ) (2 ) (2 ) ... (2 ) .

01 2

 

 

Step2: If ( ) 0 and ( ) 0 then 1; else 0;

*t*

**Figure 6.** The diagram of PI-ELU block. (a) The internal architecture of the *i*-th PE. (b) The overall architecture of PI-ELU block.

(b)

**Figure 7.** The diagram of PI-DC block.

**Figure 8.** The diagram of overall PI-iBM architecture.


**Table 2.** Implementation results and comparisons
