HIGH THROUGHPUT MEMORY-EFFICIENT VLSI DESIGNS

FOR STRUCTURED LDPC DECODING

Hrishikesh Sharma, Subhasis Das, Rewati Raman Raut and Sachin Patkar

Dept. of Electrical Engineering, IIT Bombay, Bombay, India

Keywords:

Error-correction codes, LDPC decoder, VLSI design.

Abstract:

Low-density Parity Check(LDPC) codes have been in focus of intense research in Error-correction Coding

in recent years. High throughput decoder design for them has been a big challenge for these codes. In this

paper, we report the ﬁrst scalable VLSI decoder design based on projective geometry (PG) structure of LDPC

codes. The design is based on memory-efﬁcient communication primitives known as perfect access sequences.

A high-throughput variation of above design achieves a throughput of 620 Mbps, much higher than what

communication standards require. The corresponding fully-parallel VLSI architecture was implemented on

Xilinx LX110T FPGA, as well as on 90-nm SAED EDK90 CORE Cell Library from Synposys. We ﬁnd that

PG-based graphs indeed offer an exciting way of parallelizing this computation, and many others in future.

1 INTRODUCTION

LDPC codes are an emerging class of codes which ex-

hibit superior bit error rate(BER) performance. Rela-

tive ease of decoder design, coupled with better per-

formance, has made LDPC codes started being used

in recent digital transmission and storage systems. All

LDPC decoding algorithms need large number of par-

allel working memories. Hence designing for mem-

ory efﬁciency is one of the signiﬁcant problems in its

decoder design (Tarable et al., 2004). In this work,

we ﬁrst report an LDPC decoder design that simply

uses a particular hardware scheduling to avoid mem-

ory bottlenecks arising from on-chip access conﬂicts.

In general, different code structures result in

different architectures, and hence different memory

management schemes. Our choice of structures is de-

rived out of geometry of projective planes (Kou et al.,

2001). This choice of structure can avoid memory

conﬂicts. We report prototype implementation results

on FPGA, to demonstrate simplicity of design, and

high efﬁciency of the hardware, for PG-based LDPC

codes, apart from throughput improvement. To over-

come one of the limitations of the ﬁrst design, which

inhibited its throughput, a second design has also been

presented. The second design has been implemented

both for FPGA and ASIC targets.

2 OVERVIEW OF LDPC CODES

LDPC codes are generally decoded using probabilis-

tic soft decision decoding process. The knowledge

of channel noise statistics is used to generate prob-

abilistic information for received bits, and given to

the decoder. The reliability of this bit information

is then successively improved over iterations, and

hard decisions on bit values made. Such decoders

make use of graphs known as Tanner graphs to repre-

sent codes, passing probabilistic messages along the

graph’s edges iteratively. The adjacency matrix of

this graph is called parity-check matrix H. A Tan-

ner graph has two sets of nodes: n bit nodes, and m

parity-check nodes, for a m×n sized parity-check ma-

trix. Each parity-check node is connected by an edge

to bit nodes corresponding to the code bits included

in that parity-check equation. The sequence of steps

involved in iterative decoding using log-sum-product

algorithm are as follows(Johnson and Weller, 2003).

1. The initial message is ﬁrst sent by bit nodes to

check nodes. This message is based on the calcu-

lated log-likelihood ratio(LLR) of received signal.

2. Next, check nodes calculate and send updated

LLRs to the bit nodes using the received mes-

sages. Computation is performed separately on

the sign and magnitude parts of these messages.

A XOR on the sign bits of the incoming bit mes-

sages forms the parity check result for the matrix

518

Sharma H., Das S., Raman Raut R. and Patkar S. (2011).

HIGH THROUGHPUT MEMORY-EFFICIENT VLSI DESIGNS FOR STRUCTURED LDPC DECODING.

In Proceedings of the 1st International Conference on Pervasive and Embedded Computing and Communication Systems, pages 518-521

DOI: 10.5220/0003366305180521

 SciTePress

row. The sign of each outgoing check message for

each edge of the graph is formed by further XOR-

ing the sign of the incoming bit message of that

edge with the row parity check result. The mag-

nitude of outgoing check message(extrinsic relia-

bility) is computed in the logarithmic domain as

= 2 tanh

−1



exp







− ln



tanh









∑

∀q:H



tanh





where λ

is the (output) reliability of the check

message, λ

the input message, and ϕ(x) is de-

ﬁned as ϕ(x) = − lntanh (|x|/2).

3. Next, syndrome test is done using combined LLR

∑

∀q:H

+ R

of intrinsic(R

) and extrinsic messages. A hard

decision of bit being 0 or 1 is then made using

the sign of L

. From this set of decided bits z,

if in some iteration H · z

= 0, then the decoded

codeword is supposed to be z

4. Else, updated messages, called residues, are sent

back to check nodes

∑

∈

+ R

and the iterations continue until convergence.

In a modiﬁed algorithm used in the second design,

min-sum algorithm, calculation of λ

in step (2) is

simpliﬁed to

= min

j,h

=1,q̸=r





− 0.693

3 PARALLEL SCHEDULING

MODEL

The scheduling model used in ﬁrst design is based

on Karmarkar’s template(Karmarkar, 1991). By def-

inition, a projective plane of order s contains n =

+ s + 1 points and as many lines. Each line is con-

nected to s + 1 points, and vice-versa. Given n pro-

cessing units and a memory system partitioned into

n memory blocks M

, M

, ·· · M

, Karmarkar’s tem-

plate can be applied by mapping memory blocks to

points and processing units to the lines. A memory

block and a processing unit are connected if the cor-

responding point belongs to the corresponding line.

Then any operation can be scheduled on each pro-

cessing unit such that load/store of operands is based

on perfect access patterns(PAP) and sequences(PAS)

as deﬁned in (Karmarkar, 1991). A schedule for such

a collection of operations leads to several important

advantages such as no memory conﬂict, full utiliza-

tion of processing units and memory bandwidth etc.

Check Nodes

Bit Nodes

Figure 1: A 2-dimensional PG as LDPC Tanner Graph.

One can argue that the behavior of regular PG-

based LDPC decoder is made up of perfect access

patterns and sequences(Sharma, 2007). Once that is

established, applicability and utility of Karmarkar’s

template is obvious. The two types of computation,

done alternately by bit and check nodes, remain same

over iterations. For each computation, the input mes-

sages have different values and hence can be stored

in (two) different sets of memory blocks. Hence we

ﬁrst split the LDPC decoding iteration into two com-

putations having topology similar to that Karmarkar’s

template envisages.Thus the problem of scheduling

for PAP/PAS in LDPC decoding is decomposed into

two isomorphic scheduling problems. In each bit

node processing, majority of computation involves

adding the extrinsic information collected over all the

edges incident on the particular bit node. The num-

ber and size of input coefﬁcients per bit node is a

constant. By taking two inputs at a time for addi-

tion, we can schedule a binary operation on each bit-

node processing unit, in every machine cycle. The

set of concurrent operations in each cycle then form

a PAP, while the set of all operations within complete

bit node processing forms the PAS. The processing is

similar in check nodes, though in log(tanh()) domain,

hence one can again ﬁnd perfect access patterns in

check-node processing as well. Hence the PG-based

LDPC code decoding algorithm does exhibit behavior

which has a decomposition based on perfect access

patterns and sequences. All that remains, then, is to

deal with its reﬁnement for detail design purposes.

4 DETAILS OF HARDWARE

MODEL

Most of the computational logic of the design is con-

tained in the node processing units. The bit nodes

HIGH THROUGHPUT MEMORY-EFFICIENT VLSI DESIGNS FOR STRUCTURED LDPC DECODING

519

read input messages from the check memories, write

back to bit memories, and vice-versa. The process-

ing units and memories are connected according to

the line/point incidences of order-8 projective plane.

For resource and speed efﬁciency, we chose to im-

plement the data path using 9-bit ﬁxed-point arith-

metic, which has 1 sign, 3 integer and 5 fraction bits.

To avoid overﬂow during accumulation, internal data

path was made 13-bit wide in bit nodes and 12-bits

wide in check nodes. The detailed micro-architecture,

including bit-node architecture, check node architec-

ture, memory block architecture, interconnect archi-

tecture, overall data and control path design can be

found in (Sharma, 2007).

5 FPGA IMPLEMENTATION

RESULTS

The length-73 decoder implementation was targeted

to Virtex 5 LX330T FPGA. The functionality of each

bit node is mapped to CLBs, requiring 28 CLBs. Sim-

ilarly, the functionality of each check node is mapped

to 42 CLBs and 2 DSP slices, to realize the multiply-

add operations in ϕ(x) function. The bit and check

memory blocks are mapped onto BRAMs. To avoid

routing congestion on global routes, we have imple-

mented a 50% reduction in global wires by time-

multiplexing the 2 instances of PG interconnect dis-

cussed earlier. The pre-routing maximum frequency

was found to be approximately 130 MHz, which af-

ter critical path optimization, improved to 155 MHz.

The number of cycles per iteration is 42, thus the

maximum system throughput achieved at practical

SNRs(≥ 2) is ≈ 90 Mbps. This throughput matches

the requirements of WiMAX(75 Mbps) and DVB-

S2(90 Mbps). Further analysis revealed that %-wise

utilization of size of a BRAM block of LX330T was

limited. We then realized that in each cycle, the FPGA

architecture restricted the parallel memory operations

to a maximum of 2 operations per BRAM block. If

it was possible to read more datum per cycle from

each memory block, then more number of computa-

tional nodes could have been simultaneously served,

and also better utilization of each memory block size.

We then tried to do a speciﬁc high-throughput decoder

design based on distributed memory elements, rather

than block memory elements, presented next.

6 A HIGH-THROUGHPUT

MIN-SUM DECODER

Based on min-sum algorithm, we implemented a

length-73 LDPC decoder having the same parity-

check matrix, H. The internal ﬁxed point datapath

bit-width was shrunk to 5 bits, consisting of 1 sign

bit, 3 integer bits and 1 fraction bit. The microar-

chitecture of bit and check processing units was re-

designed to account for algorithm variation. The con-

trol path was modeled as a simple 3-state simple cycle

FSM. The design was also pipelined, such that two

blocks of data are taken in at once by decoder. While

one block of data gets processed in the bit processing

units, the other block gets processed/updated in check

processing units, every iteration, simultaneously. The

processing unit’s interface was changed to allow all

the 9 inputs arrive simultaneously, since BRAMs and

two-port bottlenecks were eliminated. Also, the input

system was changed from a parallel input of all the

73 × 5 bits at one time, to a system where one bit of

each input is taken in every cycle, thus making 73 in-

put bits per cycle. This reduced the number of input

pins to a great extent and thus contributed to mini-

mizing the delays due to input buffers. The overall

microarchitecture of this design is described in (Das,

2010).

7 RESULTS AND ANALYSIS

7.1 FPGA Synthesis Results

The decoder was put on board for Xilinx Virtex-5

LX110T FPGA, and tested in similar conditions as

previous design. The maximum clock frequency was

found to be 180 MHz. The 73×2 5-bit intrinsic inputs

to initialize the decoder are taken from 73 pins, se-

quentially over 5×2 cycles. The maximum through-

put achieved at practical SNRs(≥ 2) for this imple-

mentation is ≈ 620 Mbps. Re-analysis shows that

for a similar design scaled for block length 1057,

this implementation would have had a throughput of

2.7 Gbps. In terms of FPGA implementations, this

throughput is next only to the best reported so far for

comparable block lengths (Zarubica et al., 2007), on

Virtex-4 FPGA.

7.2 ASIC Synthesis Results

We have also done an ASIC implementa-

tion of the length-73 decoder design, using

SAED EDK90 CORE Digital Standard Cell Li-

brary of 90nm technology from Synposys. The

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

520

post-synthesis outcome suggested a maximum clock

frequency of 400 MHz, with enough timing and

power margins. Again, re-analysis shows that for

a similar design scaled for block length 1057, this

implementation would have had a throughput of 6

Gbps. This is very close to the highest throughput

reported for an ASIC design reported so far, 7 Gbps

in (Mohsenin et al., 2009) using 65 nm technology.

The area of our implementation was estimated as

1.99 mm

, and average power dissipation as 23.2

mW, at V

= 1.2 V.

7.3 Simulation Results

Detailed simulations show the good BER perfor-

mance of this implementation, assuming an AWGN

channel and BPSK modulation scheme. Our calcula-

tions show that the length-1057 code’s transmission

rate is within 0.03 bits/sec of Shannon capacity limit

over Binary Symmetric Channel.

7.4 Comparative Analysis

Our designs use PG structure of LDPC codes, that

have not been reported for decoder design before. In

general, PG codes converge very fast under SPA de-

coding(Kou et al., 2001), as well as for log-SPA de-

coding. This is because given the medium code rates

of PG codes, there are more parity checks updating

each bit probability, leading to faster convergence,

and hence higher throughput. Especially in 2

de-

sign, a novel micro-architecture of bit and check pro-

cessing units for higher degree nodes was evolved,

which has also not been reported until now. This

micro-architecture was able to meet more aggressive

timing constraints, and hence higher throughput.

8 CONCLUSIONS

We have reported two novel LDPC decoder designs

that are based on projective geometry structure of

LDPC codes. The throughputs of both designs ex-

ceed the requirements of various standards, with sec-

ond design’s throughput being many times greater

than required. BER and convergence performances

of both the decoders have also been found satisfac-

tory. The 1

design is currently undergoing further

system-level optimizations such as circuit retiming,

and elimination of multipliers. Based on the learn-

ing that wires are a limiting resources on a FPGA for

this decoder, a completely new superscalar pipelined

architecture is also currently being designed.

FPGAs are heavily resource limited, and hence we

could not ﬁt code of length more than 73, even though

the design is capable of handling any-length decod-

ing. A more pragmatic approach is to fold the geom-

etry, and map the folded geometry on the decoder’s

interconnect. A novel design for such semi-parallel

decoder architecture, based on symmetry and regular-

ity of projective geometry, was patented in (Sharma,

2007). In fact, we have found more applications of

PG-based interconnect in CD-ROM/DVD-R decod-

ing (Adiga et al., 2010) as well as matrix computa-

tions, and hence are convinced of its potential.

ACKNOWLEDGEMENTS

The authors are grateful to Tata Consultancy Services

for funding the research project under project code no.

1009298.

REFERENCES

Adiga, B., Chowdhary, S., Sharma, H., and Patkar, S.

(2010). System for Error Control Coding using

Expander-like codes constructed from higher dimen-

sional Projective Spaces, and their Applications. In-

dian Patent Requested. 2455/MUM/2010.

Das, S. (2010). A Min-Sum based 1.4 Gbps LDPC Decoder

Design. Technical report, Indian Institute of Technol-

ogy.

Johnson, S. J. and Weller, S. R. (2003). Low-density parity-

check codes: Design and decoding.

Karmarkar, N. (1991). A New parallel architecture for

sparse matrix computation based on ﬁnite projective

geometries. Proc. Supercomputing.

Kou, Y., Lin, S., and Fossorier, M. (2001). Low-density

parity-check codes based on ﬁnite geometries: a re-

discovery and new results. IEEE Transactions on In-

formation Theory, 47(7):2711–2736.

Mohsenin, T., Truong, D., and Baas, B. (2009). Multi-Split-

Row Threshold decoding implementations for LDPC

codes. IEEE International Symposium on Circuits and

Systems, pages 2449–2452.

Sharma, H. (2007). A Decoder for Regular LDPC codes

with folded Architecture. Indian Patent Requested.

205/MUM/2007.

Tarable, A., Benedetto, S., and Montorsi, G. (2004). Map-

ping interleaving laws to parallel Turbo and LDPC de-

coder architectures. IEEE Transactions on Informa-

tion Theory, 50(9):2002–2009.

Zarubica, R., Wilson, S., and Hall, E. (2007). Multi-Gbps

FPGA-Based Low Density Parity Check (LDPC) De-

coder Design. IEEE Global Telecommunications Con-

ference, pages 548–552.

HIGH THROUGHPUT MEMORY-EFFICIENT VLSI DESIGNS FOR STRUCTURED LDPC DECODING

521