**4. Efficient digital hardware architecture**

20 Will-be-set-by-IN-TECH

−5 0 5 10 15 20 25

 *σ*2 *x σ*2 *η*

10 · log10

**Figure 15.** The WFQ vs. the conventional WF and ZF receivers, QPSK modulation with *M* = 10, *N* = 10,

that are multiplexed in time, the problem can be reduced from the MIMO to the SIMO (single-input-multiple-output) case, because each line of the multiconductor interconnect is excited separately for time-multiplexed pilots. Finally, the problem can be reduced to the SISO (single-input-single-output) case, when the channel estimation is performed separately in parallel at each receiving end of the multiconductor interconnect. For this case, in [28], a closed-form solution can be found for the maximum likelihood channel estimation problem,

In [50], a more general setting for parameter estimation based on quantized observations was studied, which covers many processing tasks, e.g. channel estimation, synchronization, delay estimation, Direction Of Arrival (DOA) estimation, etc. An Expectation Maximization (EM) based algorithm is proposed to solve the Maximum a Posteriori Probability (MAP) estimation problem. Besides, the Cramér-Rao Bound (CRB) has been derived to analyze the estimation performance and its behavior with respect to the signal-to-noise ratio (SNR). The presented results treat both cases: pilot aided and non-pilot aided estimation. The paper extensively dealt with the extreme case of single bit quantization (comparator) which simplifies the sampling hardware considerably. It also focused on MIMO channel estimation and delay estimation as application area of the presented approach. Among others, a 2×2 channel estimation using 1-bit ADC is considered, which shows that reliable estimation may still be possible even when the quantization is very coarse, with any desired accuracy, provided the pilot sequence is long enough. Since in on-chip and chip-to-chip communications, the channel almost does not change in time, it is possible to use very long pilot sequences, and run the

10−4

4- (*ρ<sup>q</sup>* = 0.01154) and 5- (*ρ<sup>q</sup>* = 0.00349) bit uniform quantizer.

channel estimation only once, or once in a while.

which makes performance analysis possible in an analytical fashion.

10−3

10−2

Uncoded BER

WFQ−−5bit WF−−5bit ZF−−5bit WFQ−−4bit WF−−4bit ZF−−4bit WF without Quant. ZF without Quant.

10−1

Sole optimization of transmitting power in the standardization and conception phase of communication channels results in highly complex and energy-intensive receivers with a complex channel decoder as one of its key components. Neglecting the energy dissipation of the integrated decoder in this early phase results in suboptimal and, thus, costly communication systems in terms of manufacturing and usage costs. In the previous part of this chapter approaches to reduce the ADC complexity and, thus, the complexity of the subsequent digital components by using single-bit or medium-low resolution quantizations have been discussed. A quantitative comparison of these new approaches to standard receivers requires accurate cost models of the digital components. Quite accurate cost models are available for most of the communication system components except for channel decoders. While such cost models can be easily derived for Viterbi, Reed-Solomon, and Turbo Decoders, an estimation of the silicon area and the energy dissipation of LDPC decoders is challenging due to the high internal communication effort between the basic components.

Although LDPC codes have already been introduced by Gallager in 1962 [55], up to now they are known to achieve the best decoding performance [56] and are adopted in various communication standards (e.g. [57],[58],[59]) and other applications such as hard-disk drives [60]. They belong to the class of block codes and, thus, can be defined by a parity-check matrix *H* with *m* rows and *n* columns or by the corresponding Tanner Graph. Both are shown in Figure 16 for a very simplified LDPC code. Each row of the parity-check matrix represents one parity check wherein a '1'-entry in column *i* and row *j* indicates, that the received symbol *i* takes part in parity check number *j*. In the Tanner Graph such a parity check is represented by one so called check node and each column by one bit node. Furthermore, the number of one entries per row *dC* (column *dV*) defines the number of connected bit (check) nodes per check (bit) node.

**Figure 16.** Parity-check matrix, Tanner-Graph and non-linear recursive decoder loop

#### 22 Will-be-set-by-IN-TECH 96 Ultra-Wideband Radio Technologies for Communications, Localization and Sensor Applications Chip-to-Chip and On-Chip Communications <sup>23</sup>

Figure 16 also illustrates one of the *n* · *dV* non-linear recursive loops of the resulting decoder. In each iteration the extrinsic information *L*(*qi*,*j*) on the received symbol *i* is sent to check node *j*. Here, new A-posteriori information *L*(*ri*,*j*) is derived. The sign of *L*(*ri*,*j*) is chosen in such a way, that the confidence in the received symbol *i* indicated by the magnitude of *L*(*qi*,*j*) increases. For the sake of clarity only the magnitude calculation is illustrated in Figure 16. Considering the original Sum-Product decoding algorithm [55] the check node consists of transcendent functions and a multi-operand adder with subsequent subtractor stages. Here, the basic idea is, whenever all participating symbols in that parity check feature a high confidence in their current estimation, the magnitude of the A-posteriori information is high. The A-posteriori information *L*(*ri*,*j*) is then sent back to the bit node. Here, all information of symbol *i*, namely the *dV* A-posteriori information and the received information *L*(*ci*), are combined using a multi-operand adder resulting in a new estimation *L*(*Qi*) of symbol *i*. To avoid decoding-performance-demoting cycles, in the next decoding iteration only the extrinsic information *L*(*qi*,*j*) = *L*(*Qi*) − *L*(*ri*,*j*) is used instead of *L*(*Qi*). For more information on the decoding algorithm and possible fix-point realizations refer to [61].

estimates the magnitude of *L*(*ri*,*j*) using the minimal and second minimal magnitude of *L*(*qi*,*j*) (e.g. [62],[63],[64],[65]). The derivation of *AL* for this decoding algorithm as the accumulated silicon area of all logic gates has been presented in [67]. The resulting total logic area can be

This equation reveals a linear dependency between the code complexity *n* · *dV* and the

The major challenge in deriving an accurate routing-area model is the adaptability to different LDPC codes. It is possible to divide the problem into two parts: an estimation of the available and the required Manhattan length. Considering a certain logic area, the available Manhattan length is a measure for the routing resources above the decoder's node logic. Considering that the node layouts require *ML* of the total *M* metal layers in the CMOS stack for the local interconnect, *MR* = *M* − *ML* metal layers are available for the realization of the global bitand check-node communication. The required routing area *AR* can then be determined by equalizing the available and the required Manhattan lengths. This means, that the available Manhattan length allows the realization of the required Manhattan length. Thereby, the

*<sup>L</sup>* · *MR* <sup>+</sup> *min*

for the realization of the global interconnect. Therefore, this part is weighted with *M*.

*lDEC* − *lCNA* 2

When looking at the wire-length histogram for an exemplary code (see Figure 17(b)) the average Manhattan length is significantly smaller than the maximum length leading to an overestimation of the required Manhattan length and, thus, of the required routing area. An analysis of various LDPC codes showed, that the shape of the wire-length histogram is always similar. Especially, the ratio between the average and the maximum Manhattan length was found to be almost constant as can be seen from Table 2. For the derivation of the average Manhattan lengths all placements have been optimized using a custom simulated annealing process [62]. While code no. 11 is the code adopted in [57], the other codes are taken from

with the routing pitch *p*, an utilization factor *u* for each metal layer, and a decoder macro side length of *lDEC*. Considering that no artificial increase of the decoder is required (*lDEC* ≤ *lL*) the second term is zero and the available routing resources are on top of the node logic. If the decoder needs to be expanded, the whole metal stack is available between the node instances

The estimation of the required Manhattan length is more challenging as it depends on code characteristics as for example the number of interconnect lines and the average length of one interconnect line. An upper bound estimation of the required Manhattan length could be derived by using the maximum possible length *lMAX* of one bit- and check-node connection which is shown in Figure 17(a). In a typical placement with the bit nodes surrounding the check-node array the longest possible connection runs from one corner of the decoder macro to the opposing corner of the check-node array. An analysis of the logic model [67] shows that the check nodes occupy about 60% of the complete decoder macro leading to a maximum

*l* 2 *DEC* − *l* 2 *L*, 0 · *M* 

, (39)

Chip-to-Chip and On-Chip Communications 97

= *lDEC* + *lCNA* = 1.77 · *lDEC*. (40)

*<sup>L</sup>* <sup>=</sup> <sup>1000</sup> · *<sup>n</sup>* · *dV* · (11.5 · *<sup>w</sup>* <sup>+</sup> <sup>2</sup> · *ld* (*dV* )) · *<sup>λ</sup>*2. (38)

estimated using

accumulated gate area.

Manhattan length of

*lMAX* = 2 ·

*lCNA* +

*AL* = *l* 2

available Manhattan length can be derived as

*lAVA I L* <sup>=</sup> *<sup>u</sup>*

*p* · *l* 2

A metric for the code's and, thus, the decoder's complexity is the number of '1'-entries in the matrix *n* · *dV* = *m* · *dC*. Each '1'-entry can be assigned to a part of the bit- and check-node logic as highlighted in gray in Figure 16. Thereby, each '1'-entry leads to four two-operand adders/subtractors, a block for the calculation of *log* (*tanh* (*<sup>x</sup>*/2)), a block for the calculation of 2 · *atanh* (*ex*) and a register stage at the output of the bit node. Additionally, *<sup>n</sup>* · *dV* is a measure for the communication between the nodes as 2 · *w* · *n* · *dV* bits are exchanged between the bit and the check nodes in each decoding iteration with *w* being the word length of the exchanged messages.

In high-throughput applications with a time-invariant parity-check matrix all bit and check nodes are typically instantiated in parallel as in the first integrated LDPC decoder [66]. Here, typically the *m* check nodes are realized in the center of the decoder floorplan surrounded by the *n* bit-node instances. The communication between the nodes is then realized by 2 · *w* · *n* · *dV* dedicated interconnect lines. In [66] the logic area, which is the accumulated silicon area of all logic gates, is approximately 25 *mm*2. However, the total of 26,624 interconnect lines can not be realized on this area. The silicon area needs to be artificially expanded until a successful routing of all interconnect lines could be established. The resulting global interconnect has a length of 80 *m* on a macro size of 52.5 *mm*2. Thus, only 50% of the active silicon area is utilized in the final decoder. The impact of the complex global interconnect complicates the derivation of accurate area, timing and energy cost models which might be the reason why no cost models are available in literature so far. However, such models are necessary to avoid costly wrong decisions in early design phases, for example when choosing a certain LDPC code in the system-conception phase. Also in later design phases those models are indispensable, for example for a quantitative exploration of the architecture design space.

### **4.1. Accurate area, timing, and energy cost models**

In general the silicon area of a high-throughput LDPC decoders can be estimated using

$$A\_{\rm DEC\\_P} = \max(A\_{L\prime}A\_R)\_{\prime} \tag{37}$$

with *AL* being the logic area and *AR* the required area to realize the global interconnect. To reduce the logic area typically the approximative Min-Sum algorithm [70] is used which estimates the magnitude of *L*(*ri*,*j*) using the minimal and second minimal magnitude of *L*(*qi*,*j*) (e.g. [62],[63],[64],[65]). The derivation of *AL* for this decoding algorithm as the accumulated silicon area of all logic gates has been presented in [67]. The resulting total logic area can be estimated using

22 Will-be-set-by-IN-TECH

Figure 16 also illustrates one of the *n* · *dV* non-linear recursive loops of the resulting decoder. In each iteration the extrinsic information *L*(*qi*,*j*) on the received symbol *i* is sent to check node *j*. Here, new A-posteriori information *L*(*ri*,*j*) is derived. The sign of *L*(*ri*,*j*) is chosen in such a way, that the confidence in the received symbol *i* indicated by the magnitude of *L*(*qi*,*j*) increases. For the sake of clarity only the magnitude calculation is illustrated in Figure 16. Considering the original Sum-Product decoding algorithm [55] the check node consists of transcendent functions and a multi-operand adder with subsequent subtractor stages. Here, the basic idea is, whenever all participating symbols in that parity check feature a high confidence in their current estimation, the magnitude of the A-posteriori information is high. The A-posteriori information *L*(*ri*,*j*) is then sent back to the bit node. Here, all information of symbol *i*, namely the *dV* A-posteriori information and the received information *L*(*ci*), are combined using a multi-operand adder resulting in a new estimation *L*(*Qi*) of symbol *i*. To avoid decoding-performance-demoting cycles, in the next decoding iteration only the extrinsic information *L*(*qi*,*j*) = *L*(*Qi*) − *L*(*ri*,*j*) is used instead of *L*(*Qi*). For more information

A metric for the code's and, thus, the decoder's complexity is the number of '1'-entries in the matrix *n* · *dV* = *m* · *dC*. Each '1'-entry can be assigned to a part of the bit- and check-node logic as highlighted in gray in Figure 16. Thereby, each '1'-entry leads to four two-operand adders/subtractors, a block for the calculation of *log* (*tanh* (*<sup>x</sup>*/2)), a block for the calculation of 2 · *atanh* (*ex*) and a register stage at the output of the bit node. Additionally, *<sup>n</sup>* · *dV* is a measure for the communication between the nodes as 2 · *w* · *n* · *dV* bits are exchanged between the bit and the check nodes in each decoding iteration with *w* being the word length of the

In high-throughput applications with a time-invariant parity-check matrix all bit and check nodes are typically instantiated in parallel as in the first integrated LDPC decoder [66]. Here, typically the *m* check nodes are realized in the center of the decoder floorplan surrounded by the *n* bit-node instances. The communication between the nodes is then realized by 2 · *w* · *n* · *dV* dedicated interconnect lines. In [66] the logic area, which is the accumulated silicon area of all logic gates, is approximately 25 *mm*2. However, the total of 26,624 interconnect lines can not be realized on this area. The silicon area needs to be artificially expanded until a successful routing of all interconnect lines could be established. The resulting global interconnect has a length of 80 *m* on a macro size of 52.5 *mm*2. Thus, only 50% of the active silicon area is utilized in the final decoder. The impact of the complex global interconnect complicates the derivation of accurate area, timing and energy cost models which might be the reason why no cost models are available in literature so far. However, such models are necessary to avoid costly wrong decisions in early design phases, for example when choosing a certain LDPC code in the system-conception phase. Also in later design phases those models are indispensable, for

on the decoding algorithm and possible fix-point realizations refer to [61].

example for a quantitative exploration of the architecture design space.

In general the silicon area of a high-throughput LDPC decoders can be estimated using

with *AL* being the logic area and *AR* the required area to realize the global interconnect. To reduce the logic area typically the approximative Min-Sum algorithm [70] is used which

*ADEC*\_*<sup>P</sup>* = *max*(*AL*, *AR*), (37)

**4.1. Accurate area, timing, and energy cost models**

exchanged messages.

$$A\_L = l\_L^2 = 1000 \cdot n \cdot d\_V \cdot (11.5 \cdot w + 2 \cdot ld \,(d\_V)) \cdot \lambda^2. \tag{38}$$

This equation reveals a linear dependency between the code complexity *n* · *dV* and the accumulated gate area.

The major challenge in deriving an accurate routing-area model is the adaptability to different LDPC codes. It is possible to divide the problem into two parts: an estimation of the available and the required Manhattan length. Considering a certain logic area, the available Manhattan length is a measure for the routing resources above the decoder's node logic. Considering that the node layouts require *ML* of the total *M* metal layers in the CMOS stack for the local interconnect, *MR* = *M* − *ML* metal layers are available for the realization of the global bitand check-node communication. The required routing area *AR* can then be determined by equalizing the available and the required Manhattan lengths. This means, that the available Manhattan length allows the realization of the required Manhattan length. Thereby, the available Manhattan length can be derived as

$$\mathcal{I}\_{AVAL} = \frac{u}{p} \cdot \left( l\_L^2 \cdot M\_R + \min\left( l\_{DEC}^2 - l\_{L'}^2 \, 0 \right) \cdot \mathcal{M} \right), \tag{39}$$

with the routing pitch *p*, an utilization factor *u* for each metal layer, and a decoder macro side length of *lDEC*. Considering that no artificial increase of the decoder is required (*lDEC* ≤ *lL*) the second term is zero and the available routing resources are on top of the node logic. If the decoder needs to be expanded, the whole metal stack is available between the node instances for the realization of the global interconnect. Therefore, this part is weighted with *M*.

The estimation of the required Manhattan length is more challenging as it depends on code characteristics as for example the number of interconnect lines and the average length of one interconnect line. An upper bound estimation of the required Manhattan length could be derived by using the maximum possible length *lMAX* of one bit- and check-node connection which is shown in Figure 17(a). In a typical placement with the bit nodes surrounding the check-node array the longest possible connection runs from one corner of the decoder macro to the opposing corner of the check-node array. An analysis of the logic model [67] shows that the check nodes occupy about 60% of the complete decoder macro leading to a maximum Manhattan length of

$$l\_{MAX} = 2 \cdot \left( l\_{DNA} + \frac{l\_{DEC} - l\_{DNA}}{2} \right) = l\_{DEC} + l\_{DNA} = 1.77 \cdot l\_{DEG} \tag{40}$$

When looking at the wire-length histogram for an exemplary code (see Figure 17(b)) the average Manhattan length is significantly smaller than the maximum length leading to an overestimation of the required Manhattan length and, thus, of the required routing area. An analysis of various LDPC codes showed, that the shape of the wire-length histogram is always similar. Especially, the ratio between the average and the maximum Manhattan length was found to be almost constant as can be seen from Table 2. For the derivation of the average Manhattan lengths all placements have been optimized using a custom simulated annealing process [62]. While code no. 11 is the code adopted in [57], the other codes are taken from

24 Will-be-set-by-IN-TECH 98 Ultra-Wideband Radio Technologies for Communications, Localization and Sensor Applications Chip-to-Chip and On-Chip Communications <sup>25</sup>

**Figure 17.** Bit- and check-node architecture


**Table 2.** Interconnect properties of various LDPC codes

[68]. For a wide range of LDPC codes with code complexities *n* · *dV* between 300 and 24,000 the ratio varies only between 0.30 and 0.37. Approximating the ratio of the average to the maximum Manhattan length with 0.35 and using (40), the required Manhattan length can be estimated based on the decoder side length as

$$d\_{REQ} = 1.2 \cdot n \cdot d\_V \cdot w \cdot l\_{DEC}.\tag{41}$$

*SNR* 

Code nr. *SNR [dB] ı L(q) ı L(r)* 2.6 0.14 0.30 1.7 0.17 0.34 1.7 0.17 0.33 2.9 0.14 0.30 1.5 0.18 0.34 0.5 0.16 0.33 0.5 0.16 0.33 0.5 0.16 0.33 9.8 0.41 0.29 1.4 0.19 0.34 3.0 0.20 0.33 1.2 0.19 0.34 0.1 0.20 0.35 1.2 0.19 0.34

Chip-to-Chip and On-Chip Communications 99

(b) Table of switching activities (*BER* = 10−5)

*<sup>w</sup>* . (43)

*DD*, (44)

*SNR* 

In contrast to the logic area, the routing area shows a quadratic dependence on code complexity. By comparing (38) and (42) it can be shown that the bit-parallel decoder is routing

The required artificial increase of the silicon area also impacts the other two decoder features: the energy per iteration *EIT* and the iteration period, which is the required time for one decoding iteration and the inverse of the block throughput [69]. Here, only the interconnect fraction of the decoder energy will be discussed in detail. For more information on the derivation of the iteration period and the total decoder energy refer to [67]. The dynamic

with *VDD* being the supply voltage, *C*� the capacitive load per unit length of a minimum-spaced interconnect line and *α* a fitting factor to cover the fact, that on average the global interconnect lines are not minimum spaced [67]. Furthermore, the switching activities on the interconnect lines from bit to check nodes (*σLq*) and vice versa (*σLr*) need to be considered. In Figure 18(a) the BER and the switching activity *σLr* for two codes from Table 2 and different signal-to-noise ratios are illustrated. The switching activity highly depends on the considered SNR and is especially high in the so-called waterfall region when the BER starts to get significantly smaller. Furthermore, the two codes strongly differ when it comes to comparing the switching activities for a given SNR (e.g. 1*dB*) . But, considering a specific BER (indicated by the dashed lines) an almost equal switching activity for the two codes can be observed (approx. 0.33 for a BER of 10−5). The comparison of the switching activities *σLq* and *σLr* for all codes listed in Figure 18(b) shows, that this behavior is common for almost all other codes, as well. Therefore, a quite accurate estimation of the decoder energy based on the

*M*<sup>2</sup> *R*

· *α* · *C*� ·

*lREQ* <sup>2</sup> · *<sup>V</sup>*<sup>2</sup>

*n* · *dV* ≥ 500 ·

*σLq* (*BER*) + *σLr* (*BER*)

code parameters *n* and *dV* is possible without knowledge of the actual LDPC code.

energy dissipation of the global interconnect can be estimated using

(a) SNR-dependent switching activity

*EINT* <sup>=</sup> <sup>1</sup>

2 · 

**Figure 18.** Switching activity

dominated as soon as

*BER*

*ıLr*

Additionally, an estimation of the achievable utilization is possible based on the comparison of the average routing density *ρAVG* and the maximum routing density *ρMAX*. The ratios of these values for vertical and horizontal interconnect lines are also given in Table 2. Although there are exceptions (e.g. code no. 9) the utilization *u* = *ρAVG*/*ρMAX* is almost constant and will be chosen to *u* = 0.5 in the following.

Considering that the decoder area needs to be expanded and assuming a uniform stretch (39) and (41) still hold. Then, the minimal required decoder area to realize the global interconnect *AR* can be calculated by equating (39) and (41) and solving for *lR* as

$$A\_R = l\_R^2 = \left(1.2 \cdot \frac{n \cdot d\_V \cdot w}{M} \cdot p + \sqrt{\left(1.2 \cdot \frac{n \cdot d\_V \cdot w}{M} \cdot p\right)^2 - l\_L^2 \cdot \frac{M\_R - M}{M}}\right)^2. \tag{42}$$


(a) SNR-dependent switching activity

(b) Table of switching activities (*BER* = 10−5)

**Figure 18.** Switching activity

24 Will-be-set-by-IN-TECH

1 96 48 3 6 288 0.33 0.58 0.59 2 408 204 3 6 1224 0.31 0.56 0.55 3 408 204 3 6 1224 0.30 0.55 0.56 4 408 204 3 6 1224 0.31 0.54 0.50 5 816 408 3 6 2448 0.31 0.52 0.57 6 816 408 5 10 4080 0.34 0.52 0.58 7 816 408 5 10 4080 0.34 0.52 0.57 8 816 408 5 10 4080 0.34 0.50 0.55 9 999 111 3 27 2997 0.37 0.36 0.33 10 1008 504 3 6 3024 0.32 0.54 0.46 11 2048 384 6 32 12288 0.37 0.42 0.40 12 4000 2000 3 6 12000 0.34 0.57 0.54 13 4000 2000 4 8 16000 0.35 0.55 0.55 14 8000 4000 3 6 24000 0.35 0.55 0.45

[68]. For a wide range of LDPC codes with code complexities *n* · *dV* between 300 and 24,000 the ratio varies only between 0.30 and 0.37. Approximating the ratio of the average to the maximum Manhattan length with 0.35 and using (40), the required Manhattan length can be

Additionally, an estimation of the achievable utilization is possible based on the comparison of the average routing density *ρAVG* and the maximum routing density *ρMAX*. The ratios of these values for vertical and horizontal interconnect lines are also given in Table 2. Although there are exceptions (e.g. code no. 9) the utilization *u* = *ρAVG*/*ρMAX* is almost constant and

Considering that the decoder area needs to be expanded and assuming a uniform stretch (39) and (41) still hold. Then, the minimal required decoder area to realize the global interconnect

1.2 ·

*n* · *dV* · *w <sup>M</sup>* · *<sup>p</sup>*

��

*AR* can be calculated by equating (39) and (41) and solving for *lR* as

*<sup>M</sup>* · *<sup>p</sup>* <sup>+</sup>

*n* · *dV* · *w*

*l*

*lREQ* = 1.2 · *n* · *dV* · *w* · *lDEC*. (41)

�<sup>2</sup> − *l* 2 *L* ·

*MR* − *M M*

⎞ ⎠

2

. (42)

(b) Wire-length histogram

*ȡ AVR / ȡ MAX*  vertical

*ȡ AVR / ȡ MAX*  horizontal

*lAVG lMAX*

*lDEC*

*lCNA*

(a) Placement

**Figure 17.** Bit- and check-node architecture

**Table 2.** Interconnect properties of various LDPC codes

estimated based on the decoder side length as

will be chosen to *u* = 0.5 in the following.

⎛ ⎝1.2 ·

*AR* = *l* 2 *<sup>R</sup>* = *lMAX*

Code nr. *n m d <sup>V</sup> d <sup>C</sup> n·d <sup>V</sup> l AVG / l MAX*

In contrast to the logic area, the routing area shows a quadratic dependence on code complexity. By comparing (38) and (42) it can be shown that the bit-parallel decoder is routing dominated as soon as

$$m \cdot d\_V \ge 500 \cdot \frac{M\_R^2}{w}.\tag{43}$$

The required artificial increase of the silicon area also impacts the other two decoder features: the energy per iteration *EIT* and the iteration period, which is the required time for one decoding iteration and the inverse of the block throughput [69]. Here, only the interconnect fraction of the decoder energy will be discussed in detail. For more information on the derivation of the iteration period and the total decoder energy refer to [67]. The dynamic energy dissipation of the global interconnect can be estimated using

$$E\_{INT} = \frac{1}{2} \cdot \left(\sigma\_{Lq}\left(BER\right) + \sigma\_{Lr}\left(BER\right)\right) \cdot a \cdot C' \cdot \frac{I\_{REQ}}{2} \cdot V\_{DD'}^2 \tag{44}$$

with *VDD* being the supply voltage, *C*� the capacitive load per unit length of a minimum-spaced interconnect line and *α* a fitting factor to cover the fact, that on average the global interconnect lines are not minimum spaced [67]. Furthermore, the switching activities on the interconnect lines from bit to check nodes (*σLq*) and vice versa (*σLr*) need to be considered. In Figure 18(a) the BER and the switching activity *σLr* for two codes from Table 2 and different signal-to-noise ratios are illustrated. The switching activity highly depends on the considered SNR and is especially high in the so-called waterfall region when the BER starts to get significantly smaller. Furthermore, the two codes strongly differ when it comes to comparing the switching activities for a given SNR (e.g. 1*dB*) . But, considering a specific BER (indicated by the dashed lines) an almost equal switching activity for the two codes can be observed (approx. 0.33 for a BER of 10−5). The comparison of the switching activities *σLq* and *σLr* for all codes listed in Figure 18(b) shows, that this behavior is common for almost all other codes, as well. Therefore, a quite accurate estimation of the decoder energy based on the code parameters *n* and *dV* is possible without knowledge of the actual LDPC code.

26 Will-be-set-by-IN-TECH 100 Ultra-Wideband Radio Technologies for Communications, Localization and Sensor Applications Chip-to-Chip and On-Chip Communications <sup>27</sup>

*routing density*

(a) Bit- and check-node architecture

check node interconnect bit node

interconnect

**Figure 21.** Bit- and check-node architecture design space.

*4.1.2. Hardware-efficient partially bit-serial decoder architecture*

clock cycles /

**Figure 20.** Routing density

*routing density*

P S

P S P S P S P S P S P S P S

P S P S P S

iteration 1 9 9 9 9 14 9 12 9 9 99 14 14 14 14

Another promising approach to reduce the decoder's silicon area is the introduction of a bit-serial interconnect as proposed in [65]. The number of interconnect lines can be reduced by a factor of *w* resulting in a significant reduction in decoder area because of the quadratic dependency in (42). While the realized minimum search in the check node requires a most-significant-bit-first data flow in the check node the multi-operand adder in the bit node has to be realized using a least-significant-bit-first data flow. Therefore, the order of the bits needs to be flipped twice per iteration resulting in a high number of clock cycles. Although the clock frequency of the decoder is higher due to the bit-serial node logic, the high number of clock cycles per iteration limit the achievable decoder throughput and block latency. However, it is possible to introduce a bit-serial data flow in a more fine-grained way. A systematic architecture analysis is possible by breaking the decoder loop into four parts as shown in Figure 21, namely the bit and check node and the communication between the nodes in both directions. Now, possible architectures can be distinguished by assuming either a bit-serial or a bit-parallel approach in each of the four parts. Obviously, also a digit-serial approach is possible as discussed in [69]. Considering only a bit-serial or bit-parallel data flow, in total 16 different architectures are possible. As a first order metric of the decoder throughput, the number of clock cycles per iteration considering a message word length of *w* = 6 is given. To avoid extensive routing-induced extensions of the silicon area, especially the highlighted architectures with a bit-serial communication in both directions should be taken into account. When comparing the number of clock cycles per iteration for these four architectures, the

P S

P S

P S

P S

(b) Hybrid-cell architecture

Chip-to-Chip and On-Chip Communications 101

**Figure 19.** Hybrid-cell decoder architecture

### *4.1.1. Hybrid-cell decoder architecture*

The main routing problem of the bit- and check-node architecture arises from the high routing density at the border of the check-node array as it can be seen in the interconnect-density chart in Figure 20(a) for an exemplary code [57]. To overcome this drawback it is possible to break up the bit- and check-node clustering of the logic and rearrange it. The new idea is based on the observation, that each '1'-entry in the parity-check matrix can be assigned to certain parts of the decoder loop. Then, the decoder consists of *n* · *dV* small, equal basic components. A combination of the logic for one '1'-entry (see grey blocks in Figure 16) leads to the block diagram of one hybrid cell, as it is shown in Figure 19(a). This hybrid cell gets the accumulated information *LTEMP*\_*i*−1(*Qi*) of the received A-priori information *<sup>L</sup>*(*ci*) and of all A-posteriori information of the previous hybrid cells and adds the A-posteriori information *L*(*ri*,*j*) of check node *j*. The resulting information *LTEMP*\_*i*(*Qi*) is forwarded to the next hybrid cell. The last hybrid cell in that column calculates *L*(*Qi*) and sends this value back to all participating hybrid cells. A similar structure is used in the check-node part of the hybrid-cell where the calculation of *L*(*Ri*) is distributed over *dC* hybrid cells. Although, here, the hybrid-cell approach considers a Sum-Product algorithm, it is also applicable to a Min-Sum based decoder. Therefore, the Φ function and the multi-operand adder have to be replaced with basic compare-and-swap cells.

In contrast to the bit- and check-node architecture, in which the (*dV* + 1)-operand adder in the bit node and the *dC*-operand adder in the check node would be realized using a tree topology, the hybrid-cell architecture is based on an adder chain topology. However, it is possible to introduce tree-stages for the bit-node operation as illustrated in Figure 19(b). The *L*(*ri*,*j*) values are accumulated in two branches and the intermediate results are added to the channel information *L*(*ci*) in an additional IO cell. A similar topology is possible for the check-node operation. The global interconnect of the hybrid-cell architecture has been realized in a 90-nm CMOS technology using five metal layers for the same code as used for the bit- and check-node architecture in Figure 20(a). In a first step, the placement of the nodes has been optimized using a custom simulated annealing process. Thereby, a placement scheme as depicted in Figure 20(b) with the hybrid cells surrounded by the io cells has been assumed. The advantage of the hybrid-cell architecture becomes obvious when comparing the two interconnect densities. The routing density of the hybrid-cell architecture is distributed more uniformly especially without high density peaks at the border of the bit- and check-node array. Thus, the average routing density of the hybrid-cell interconnect is higher than of the bit- and check-node architecture, promising a smaller silicon area.

26 Will-be-set-by-IN-TECH

*0*

*L(ri,j)*

The main routing problem of the bit- and check-node architecture arises from the high routing density at the border of the check-node array as it can be seen in the interconnect-density chart in Figure 20(a) for an exemplary code [57]. To overcome this drawback it is possible to break up the bit- and check-node clustering of the logic and rearrange it. The new idea is based on the observation, that each '1'-entry in the parity-check matrix can be assigned to certain parts of the decoder loop. Then, the decoder consists of *n* · *dV* small, equal basic components. A combination of the logic for one '1'-entry (see grey blocks in Figure 16) leads to the block diagram of one hybrid cell, as it is shown in Figure 19(a). This hybrid cell gets the accumulated information *LTEMP*\_*i*−1(*Qi*) of the received A-priori information *<sup>L</sup>*(*ci*) and of all A-posteriori information of the previous hybrid cells and adds the A-posteriori information *L*(*ri*,*j*) of check node *j*. The resulting information *LTEMP*\_*i*(*Qi*) is forwarded to the next hybrid cell. The last hybrid cell in that column calculates *L*(*Qi*) and sends this value back to all participating hybrid cells. A similar structure is used in the check-node part of the hybrid-cell where the calculation of *L*(*Ri*) is distributed over *dC* hybrid cells. Although, here, the hybrid-cell approach considers a Sum-Product algorithm, it is also applicable to a Min-Sum based decoder. Therefore, the Φ function and the multi-operand adder have to be

In contrast to the bit- and check-node architecture, in which the (*dV* + 1)-operand adder in the bit node and the *dC*-operand adder in the check node would be realized using a tree topology, the hybrid-cell architecture is based on an adder chain topology. However, it is possible to introduce tree-stages for the bit-node operation as illustrated in Figure 19(b). The *L*(*ri*,*j*) values are accumulated in two branches and the intermediate results are added to the channel information *L*(*ci*) in an additional IO cell. A similar topology is possible for the check-node operation. The global interconnect of the hybrid-cell architecture has been realized in a 90-nm CMOS technology using five metal layers for the same code as used for the bit- and check-node architecture in Figure 20(a). In a first step, the placement of the nodes has been optimized using a custom simulated annealing process. Thereby, a placement scheme as depicted in Figure 20(b) with the hybrid cells surrounded by the io cells has been assumed. The advantage of the hybrid-cell architecture becomes obvious when comparing the two interconnect densities. The routing density of the hybrid-cell architecture is distributed more uniformly especially without high density peaks at the border of the bit- and check-node array. Thus, the average routing density of the hybrid-cell interconnect is higher than of the

*L(ri,j)*

*L(ri,j)*

*check-node operation*

*check-node operation*

*check-node operation*

*L(ci)*

*L(ri,j)*

*L(ri,j)*

*L(ri,j)*

*0*

*check-node operation*

*check-node operation*

*check-node operation*

*L(Qi)*

(b) Distributed bit node

*L(Qi)*

*L(qi,j)*

*L(ri,j)*

*2·atanh ( e<sup>x</sup> ) log( tanh ( <sup>x</sup> /2 ) )*

*LTEMP\_i-1(Qi)*

*LTEMP\_i(Qi)*

*L(Rj) LTEMP\_i-1(Rj) LTEMP\_i(Rj)*

(a) Sum-product hybrid cell

**Figure 19.** Hybrid-cell decoder architecture

replaced with basic compare-and-swap cells.

bit- and check-node architecture, promising a smaller silicon area.

*4.1.1. Hybrid-cell decoder architecture*

**Figure 21.** Bit- and check-node architecture design space.

### *4.1.2. Hardware-efficient partially bit-serial decoder architecture*

Another promising approach to reduce the decoder's silicon area is the introduction of a bit-serial interconnect as proposed in [65]. The number of interconnect lines can be reduced by a factor of *w* resulting in a significant reduction in decoder area because of the quadratic dependency in (42). While the realized minimum search in the check node requires a most-significant-bit-first data flow in the check node the multi-operand adder in the bit node has to be realized using a least-significant-bit-first data flow. Therefore, the order of the bits needs to be flipped twice per iteration resulting in a high number of clock cycles. Although the clock frequency of the decoder is higher due to the bit-serial node logic, the high number of clock cycles per iteration limit the achievable decoder throughput and block latency. However, it is possible to introduce a bit-serial data flow in a more fine-grained way. A systematic architecture analysis is possible by breaking the decoder loop into four parts as shown in Figure 21, namely the bit and check node and the communication between the nodes in both directions. Now, possible architectures can be distinguished by assuming either a bit-serial or a bit-parallel approach in each of the four parts. Obviously, also a digit-serial approach is possible as discussed in [69]. Considering only a bit-serial or bit-parallel data flow, in total 16 different architectures are possible. As a first order metric of the decoder throughput, the number of clock cycles per iteration considering a message word length of *w* = 6 is given. To avoid extensive routing-induced extensions of the silicon area, especially the highlighted architectures with a bit-serial communication in both directions should be taken into account. When comparing the number of clock cycles per iteration for these four architectures, the

#### 28 Will-be-set-by-IN-TECH 102 Ultra-Wideband Radio Technologies for Communications, Localization and Sensor Applications Chip-to-Chip and On-Chip Communications <sup>29</sup>

*n·dV=10,000*

*A* <sup>2</sup>

(a) Area and timing features

*n·dV=5,000*

*A·TIT=const.*

*TIT*

(SPP1202).

**5. Conclusion**

*n·dV=20,000*

**Figure 23.** Comparison of area, timing and energy features (*dV* = 6, *dC* = 32, *MR* = 4, *w* = 6, *λ* = 40*nm*) bit-serial interconnect becomes apparent. Considering code complexities larger than 10, 000, the energy per iteration of the bit-parallel decoder becomes more than twice as high as the partially bit-serial architecture. The latter allows for the smallest energy per iteration in the complete code complexity range. This emphasizes the efficiency of the new partially bit-serial architecture which allows for the smallest area-time product in a wide range of code complexities and the smallest decoding energy, simultaneously. This work has been supported by the German Research Foundation (DFG) under the priority program UKoLoS

This chapter presented results, accomplished within the frame of the DFG priority program »Ultrabreitband Funktechniken für Kommunikation, Lokalisierung und Sensorik«. Focus was put primarily on the analysis and optimization of on-chip and chip-to-chip multi-conductor/multi-antenna interconnects. While we could show that special techniques of physical optimization, coding and signal processing can improve interconnect performance to a remarkable degree, it is expected that even higher performance is achievable in chip-to-chip communication, when multi-conductor interconnects are replaced by wireless ultra-wideband multi-antenna interconnects. Hereby, the signal pulses do not necessarily increasingly disperse as they travel along their way to the receiving end of the interconnect. The propagating nature of the wireless interconnect can make for a much more attractive channel for chip-to-chip communications. The primary goal has been the development of both theoretical and empirical foundations for the application of ultra-wideband multi-antenna wireless interconnects for chip-to-chip communication. Suitable structures for integrated ultra-wideband antennas have been developed, their properties theoretically analyzed and verified against measurements performed on manufactured prototypes. Qualified coding and signal processing techniques, which aim at efficient use of available resources of bandwidth, power, and chip area has been proposed. In addition, attention was given to the implementation of iterative decoding structure for LDPC codes. Detailed cost-models, which are based on signal flow charts and VLSI implementations of dedicated functional blocks

*EIT*

*n·dV*

Chip-to-Chip and On-Chip Communications 103

(b) Energy features

**Figure 22.** Partially bit-serial architecture

architecture with a bit-parallel bit node allows for the smallest number of clock cycles per iteration and, thus, promises the highest decoder throughput. As the bit-parallel realization of the bit node would result in a large silicon area and a long critical path, further optimizations on arithmetic level have to be done. Here, it is possible to gain from the bit-serial input data stream by realizing the multi-operand adder in the bit node bit-serially using an MSB-first data flow. Within each clock cycle a partial sum *L<sup>k</sup>* (*c*, *q*) for the received bit-weight is generated which is accumulated subsequently to derive the new estimation *L*(*Qi*) as is shown in the decoder loop in Figure 22. The long ripple path in the accumulator unit running over the complete word length can be reduced using a carry-select principle. For further details of the realization on arithmetic and circuit level refer to [62].

### *4.1.3. Quantitative architecture comparison*

The cost models have been adapted to the new architecture concepts to allow for a quantitative evaluation of the architecture design space. Figure 23(a) illustrates the resulting silicon area *A* and iteration period *TIT* of the fully bit-parallel, fully bit-serial, hybrid-cell and partially bit-serial decoder architecture for three different code complexities *n* · *dV* = 5, 000, 10, 000 and 15, 000. For all code complexities the new architecture concepts are Pareto optimal as they allow for a trade-off between silicon area and iteration period in comparison to the bit-parallel and bit-serial architectures. Considering small code complexities the decoder architectures with a bit-parallel interconnect show the smallest area-time (AT) product and, therefore, are most AT-efficient. Considering a specified decoder throughput, the hybrid-cell architecture is promising whenever the timing constraints cannot be met by using bit-serial approaches, as it reduces the silicon area significantly in comparison to the bit-parallel bitand check-node architecture. The new partially bit-serial architecture features the smallest area-time product for all code complexities larger than 9, 000. In comparison to the bit-serial architecture a significantly smaller iteration period with only a slightly increased area is achieved. The architectures with a bit-parallel interconnect are located further and further away from the curve representing the smallest achievable area-time product. Here, the timing advantage of the bit-parallel architectures vanishes for large code complexities. Figure 23(b) depicts the energy per decoding iteration *EIT* of the four decoder architectures for different code complexities. The advantage with respect to energy of the decoder architectures with a

**Figure 23.** Comparison of area, timing and energy features (*dV* = 6, *dC* = 32, *MR* = 4, *w* = 6, *λ* = 40*nm*)

bit-serial interconnect becomes apparent. Considering code complexities larger than 10, 000, the energy per iteration of the bit-parallel decoder becomes more than twice as high as the partially bit-serial architecture. The latter allows for the smallest energy per iteration in the complete code complexity range. This emphasizes the efficiency of the new partially bit-serial architecture which allows for the smallest area-time product in a wide range of code complexities and the smallest decoding energy, simultaneously. This work has been supported by the German Research Foundation (DFG) under the priority program UKoLoS (SPP1202).
