5. Layered LDPC decoders

Layered architectures have been proposed first in [8], with the main goal of reducing the required memory bits. In the case of a layered decoder, two types of messages require memory storage: the AP-LLR messages and the check node messages. A typical layered LDPC decoder [29–33], depicted in Figure 5, contains the following components:

1. Processing units: The processing in the layered scheduling consists of the computation of the variable node messages, computation of the check node messages, and the AP-LLR update. The variable node message is computed from the AP-LLR and the check node message. The check node message is computed in the same way as in flooded scheduling, while the AP-LLR is updated from the new values of the variable node and check node messages. Because messages do not require routing between processing nodes—as in flooded—and just routing between memories and processing units, a combined unit variable-check unit—is employed for processing. The number of processing units in the typical layered decoder is equal to the number of rows which constitute one layer, which is usually given by the circulant size. A combined unit contains an adder to perform the variable message computation, a FIFO buffer, used for routing the updated variable node

Figure 5. Layered decoding architecture.

words and 512 words, this kind of configuration results in the usage of 6 BRAM block, with only 72/512 utilization for each BRAM. For the second type of flooded architectures, for the same LDPC code, for each memory block required to store the variable node messages, the memory word is of 4 bits, while the number of words is 96. Also, in this case, it can be observed that the BRAM has poor usage. Several approaches have been proposed to address this issue. One is to use multiple codewords. The solution in [23] targets the increase in the memory words within the BRAM. The codewords are processed in serial. This solution achieves increase in the BRAM utilization for the same logic usage and throughput. The solution in [28] targets increase in the memory word size stored in the BRAM and addresses serial circulant, parallel row/column processing architectures. In the same memory word are stored messages from multiple codewords. The number of processing units is increased in order to process in parallel the codewords. This solution results in an increase of CLB logic usage, as well as throughput increase. Also for the serial circulant, parallel row/column processing architectures in [26] are presented folding, which aims at storing in the same BRAM messages associated with different columns/rows

It can be observed that FPGA implementations of flooded architectures present a wide range of architectural variations, with different parallelism degrees at different levels, which aim at different throughput/cost/error correction capability trade-offs. The fully parallel solution presents increased throughput, but high cost due to routing, as well as low flexibility. Partial parallel solutions use memories for message storage. For these architectures, BRAM-based memory units are targeted in the FPGA implementation. However, employing BRAM blocks

Layered architectures have been proposed first in [8], with the main goal of reducing the required memory bits. In the case of a layered decoder, two types of messages require memory storage: the AP-LLR messages and the check node messages. A typical layered LDPC decoder

1. Processing units: The processing in the layered scheduling consists of the computation of the variable node messages, computation of the check node messages, and the AP-LLR update. The variable node message is computed from the AP-LLR and the check node message. The check node message is computed in the same way as in flooded scheduling, while the AP-LLR is updated from the new values of the variable node and check node messages. Because messages do not require routing between processing nodes—as in flooded—and just routing between memories and processing units, a combined unit variable-check unit—is employed for processing. The number of processing units in the typical layered decoder is equal to the number of rows which constitute one layer, which is usually given by the circulant size. A combined unit contains an adder to perform the variable message computation, a FIFO buffer, used for routing the updated variable node

leads to several challenges related especially to the low usage of these.

[29–33], depicted in Figure 5, contains the following components:

within the base matrix.

116 Field - Programmable Gate Array

5. Layered LDPC decoders

message to the AP-LLR update, a comparator for updating the check node message, and the addition unit for the AP-LLR update [29–32]. Specific FPGA optimization can be implemented within the combined processing unit, which includes the use of the 6-input LUT within the CLB for comparator implementation—the comparator is implemented as ROM memories [30]—as well as the usage of the dedicated shift register chains for the implementation of the FIFOs. The processing unit has as inputs dc AP-LLR messages and dc check node messages, and outputs dc updated AP-LLR messages and dc updated check node messages. An important parameter for the entire decoding architecture is represented by the parallelism degree at the variable-check unit level, which represents the number of AP-LLR messages processed each clock cycle (maximum parallelism degree is equal to dc). A higher degree of parallelism requires more simultaneous AP-LLRs read/write, as well as routing, which leads to increased number of memory ports or memory banks, and barrel shifters for routing [33].

2. Memory blocks: Layered decoders require the storage of two types of messages: AP-LLRs and check node messages. The AP-LLRs are messages which are routed between different processing units between layer processing. The check node messages are specific to each processing unit: these do not require routing from a processing unit to another between different layers. Therefore, the AP-LLR memory is a shared, global memory, while the check node message memories are local to each processing unit. Regarding the AP-LLR memory, the memory word for each bank is of quantðγ~Þ, while the maximum depth of this memory is equal to the number of columns in the base matrix. Regarding BRAM implementation of the AP-LLR memory, a drawback is represented by the low usage of the embedded block memory. Regarding the check node messages, two variants for their storage are used: (i) uncompressed form, when the β messages in their conventional two's complement format, and (ii) compressed form [34]. The compressed check node message is based on the fact that dc−1β messages within a row corresponding to a row in the parity

check matrix have the same absolute value, equal to the minimum of the α messages connected to the corresponding check node unit, while the dc-th check node message absolute value is equal to the second minimum. Therefore, a compressed check node message can be used, consisting of the signs, first minimum, second minimum, and the index of the first minimum. Regarding the FPGA implementation, the compressed form is suitable for shift register-based implementation in conventional CLB logic, while the uncompressed form is suitable for BRAM implementation. However, in BRAM-based implementation of the check node message, memory in compressed form is proposed for layered decoder with serial processing at processing node level. Routing from the BRAM blocks containing the check node messages to the processing units is achieved using large shift registers.


A major issue in the layer architecture is represented by the data hazards. Depending on the LDPC code, read-after-write (RAW) data hazards may affect the AP-LLR update: the updated value of the AP-LLR has not been written into the memory, before it is read for a new layer processing [35]. The problem of data hazards is aggravated by the usage of pipeline stages, both in the barrel shifters and in the processing units.
