SlideShare a Scribd company logo
1 of 6
Download to read offline
IEEE SiPS, Oct. 2013, Taipei, Taiwan 1 / 6
HIGH SPEED DECODING OF NON-BINARY IRREGULAR LDPC CODES USING GPUS
Moritz Beermann, Enrique Monzó⋆
, Laurent Schmalen†
, Peter Vary
Institute of Communication Systems and Data Processing ( ), RWTH Aachen University, Germany
⋆
now with Leineweber GmbH, Aachen, Germany
†
now with Bell Laboratories, Alcatel-Lucent, Stuttgart, Germany
{beermann|vary}@ind.rwth-aachen.de
ABSTRACT
Low-Density Parity-Check (LDPC) codes are very powerful chan-
nel coding schemes with a broad range of applications. The exis-
tence of low complexity (i.e., linear time) iterative message passing
decoders with close to optimum error correction performance is one
of the main strengths of LDPC codes. It has been shown that the
performance of these decoders can be further enhanced if the LDPC
codes are extended to higher order Galois fields, yielding so called
non-binary LDPC codes. However, this performance gain comes at
the cost of rapidly increasing decoding complexity. To deal with
this increased complexity, we present an efficient implementation of
a signed-log domain FFT decoder for non-binary irregular LDPC
codes that exploits the inherent massive parallelization capabilities
of message passing decoders. We employ Nvidia’s Compute Unified
Device Architecture (CUDA) to incorporate the available processing
power of state-of-the-art Graphics Processing Units (GPUs).
Index Terms— non-binary LDPC codes, iterative decoding,
GPU implementation
1. INTRODUCTION
Low-Density Parity-Check (LDPC) codes were originally presented
in [1] and later rediscovered in [2]. The most prominent decod-
ing algorithm for LDPC codes is the belief propagation (BP) algo-
rithm [3], which has linear time complexity. Even though BP de-
coding is only an approximation of maximum likelihood for practi-
cal codes, a performance close to the Shannon limit is observed for
long LDPC codes. It has been shown that the performance of short-
to medium-length LDPC codes can be improved by employing so
called non-binary LDPC codes over higher order Galois fields Fq
with q = 2p
(p ∈ N) elements [4].
While the decoding complexity of binary LDPC codes of di-
mension Nbin
is in the order of O(Nbin
), this is not the case for non-
binary LDPC codes of length N = Nbin
/p (this relation of N and
Nbin
serves for a fair comparison, as will be explained later). With
a straightforward implementation [4] , the decoding complexity in-
creases to O(Nq2
) = O(N22p
). With an intelligent, signed-log
domain Fast Fourier transform (FFT) based approach [5, 6, 7], this
complexity can be reduced to O(Nq log2 q), which will still be con-
siderably higher than the complexity of binary belief propagation, if
large Galois field dimensions are used.
Especially for applications where decoding complexity is not a
main concern (e.g., deep space applications), employing non-binary
LDPC codes is a good choice. They exhibit excellent error correc-
tion performance while the complexity stays linearithmic (i.e., better
than any polynomial time algorithm with exponent greater than 1).
Approaches to further decrease the decoding complexity exist
(e.g., [8]), but they suffer from an additional performance penalty
compared to BP. However, for comparison, it is often desirable to
assess the performance of the BP decoder without any approxima-
tions. Therefore, we propose to build a fast reference decoder using
the parallel computing capabilities of today’s graphic cards.
The massive computational power of state-of-the-art graphic
cards has led to the utilization of Graphics Processing Units (GPUs)
for signal processing applications, e.g., [9], [10]. Graphic cards
with GPUs by Nvidia Corporation can be elegantly programmed
in a language similar to C using Nvidia’s Compute Unified Device
Architecture (CUDA) [11]. The main advantage of GPUs is the
large amount of processing cores available on the processor, even
in low-end, low-cost products. These can be used for a parallel
execution of many problems.
In this paper, we present a massively parallelized implementa-
tion of the signed-log domain FFT based decoder that was presented
in [7]. The implementation is generic, i.e., it works for arbitrary ir-
regular codes and does not require any special code structure. The
CUDA framework is employed to adapt a single-core reference CPU
implementation to be executed on one of Nvidia’s general purpose
GPUs. Decoder throughput results are presented for the CPU as well
as the GPU implementations for different Galois field dimensions.
2. NON-BINARY LDPC CODES
2.1. Description
An (N, K) non-binary LDPC code over Fq := {0, 1, α, . . . αq−2
}
(α denoting a root of the primitive polynomial defining Fq) with
q = 2p
(N coded symbols and K information symbols of p ∈ N
bit each) is defined as the null space of a sparse parity check ma-
trix H with entries from Fq and of dimension M × N (with M =
N − K being the number of parity check symbols). Hm,n ∈ Fq
denotes the entry of H at row m and column n. The set N(m) =
{n : Hm,n = 0} denotes the symbols that participate in parity check
equation m. Similarly, the set M(n) = {m : Hm,n = 0} contains
the check equations in which symbol n participates. Exclusion is de-
noted by the operator “”,e.g., M(n){m} describes the set M(n)
with check equation m excluded. An irregular LDPC code has the
property that |M(n)| = const. and |N(m)| = const. Similarly to
the binary case, non-binary LDPC codes are commonly described by
their (weighted) Tanner graph [12], which is a bipartite graph with
N variable nodes and M check nodes connected by edges of weight
Hm,n according to the entries of H. Most decoding algorithms are
so called message passing algorithms that iteratively pass messages
between the variable and check nodes.
IEEE SiPS, Oct. 2013, Taipei, Taiwan 2 / 6
Channel
LLRs
Output
Bits Yes No
Initialize
Qn, Qn,m,
and Rm,n
Evaluate
Check
Equations
Abort?
Hard Decision
of Qn
Start
Stop
Permute
Qn,m
Log-FFT
F(·)
Check Node
Update (1a,b)
Inv. Log-FFT
F−1(·)
Inv. Permute
Rm,n
Variable Node
Update (2)
Update
posteriors Qn
Fig. 1. Flowchart of the non-binary signed-log domain FFT LDPC decoder. Each gray shaded block represents a CUDA kernel (see Sec. 3.2).
2.2. Decoding Algorithm: Non-Binary Belief Propagation
A decoding algorithm for non-binary LDPC codes over Fq has al-
ready been described in [4]. This algorithm uses the Fast Fourier
Transform (FFT) over Fq for efficiently computing the check node
operations (see also [5]). Its drawback is, however, that the imple-
mentation needs to be carried out in the probability domain and is,
thus, not well suited for an implementation due to eventual numer-
ical instabilities. An implementation in the log domain has been
given in [13], which, however, does not make use of the Fourier
transform for a low-complexity computation of the check node op-
eration (which considerably reduces the complexity for large field
sizes q). In this work, we follow the approach of [6, 7], which uses
an FFT and a representation of messages in the signed-log domain
to carry out the check node update. Other approaches to reduce the
complexity aim at computing only the most relevant terms of the
update equation [8], but the resulting decoder shows performance
losses. In the following, we briefly describe the signed-log domain
algorithm according to [7]. A flowchart is given in Fig. 1.
The binary input additive white Gaussian noise (BIAWGN)
channel output vector of (binary) log likelihood ratios (LLRs) is
denoted L(zbin
) = L(zbin
1 ), . . . L(zbin
N·p) with zbin
i denoting the re-
ceived, noisy binary phase shift keying (BPSK) symbol. To describe
the q probabilities of a symbol v being equal to one of the q elements
of Fq in the logarithmic domain, the notion of a q-dimensional LLR
vector is introduced
L(v) = L[0]
(v), L[1]
(v), L[2]
(v), . . . L[q−1]
(v)
= ln
P(v = 0)
P(v = 0)
, ln
P(v = 1)
P(v = 0)
, . . . ln
P(v = αq−2
)
P(v = 0)
.
Note that even though the first entry is redundant, we still include
it for implementation purpose since q is a power of two. The chan-
nel output vector L(z) = (L(z1), . . . L(zN )) on symbol level is a
vector of LLR vectors L(zn) = L[0]
(zn), . . . L[q−1]
(zn) whose
entries are computed from the binary LLRs according to
L[b]
(zn) = −
1≤i≤p: bi=1
L zbin
(n−1)·p+i 0 ≤ b ≤ q − 1
where bi is the i-th bit of the binary representation of the b-th element
of Fq. Representing all messages as LLR vectors and performing
all summations and multiplications of LLR vectors element-wise,
the signed-log domain FFT decoding algorithm can be described as
follows.
The message vectors Qn,m (message sent from variable node n
to check node m) and the symbol posteriors Qn are initialized with
the channel output L(zn). The messages Rm,n which are passed
from check node m to variable node n are initialized with zeros. The
check node update consists of multiple steps. First, the elements of
each variable to check message vector are permuted according to the
matrix entry a = Hm,n ∈ Fq{0}
Qn,m = Pa Qn,m , ∀ n, m.
The q − 1 permute functions Pa(·), with a ∈ Fq{0}, move the ith
element of Qn,m to position i · a in Qn,m. The permuted messages
are then transformed into the signed-log domain according to
ϕn,m = ϕs
n,m, ϕm
n,m
with ϕs
n,m = 1, ϕm
n,m = Qn,m, ∀ n, m
where the superscript (·)s
denotes the sign and the superscript (·)m
denotes the magnitude of a signed-log domain value. To transform
these values into the Fourier domain, the Fast Walsh-Hadamard
Transform F(·) is applied [7]:
Φn,m = F ϕn,m , ∀ n, m.
Finally, the check node update equations can be written as
Θ
s
m,n =
n′∈N (m){n}
Φ
s
n′,m, ∀ n, m (1a)
and Θ
m
m,n =
n′∈N (m){n}
Φ
m
n′,m, ∀ n, m. (1b)
Before the resulting message is sent back to a variable node, the
inverse Fourier transform
θm,n = F−1
Θm,n , ∀ n, m
is computed, the magnitude is extracted from the signed-log domain
Rm,n = θ
m
m,n, ∀ n, m,
and the inverse permutation (with a = Hm,n ∈ Fq{0})
Rm,n = P−1
a Rm,n , ∀ n, m
IEEE SiPS, Oct. 2013, Taipei, Taiwan 3 / 6
is applied to the message vector. The variable node update is exe-
cuted according to
Qn,m = L(zn)+


m′∈M(n){m}
Rm′,n

−δn,m, ∀ n, m. (2)
The normalization constant δn,m = maxb∈Fq Q
[b]
n,m helps avoiding
numerical problems. Furthermore, the posterior LLR vector
Qn = L(zn) +
m∈M(n)
Rm,n, ∀ n
is computed for each variable node and a hard decision is performed
according to ˆxn = arg maxb∈Fq Q
[b]
n . Based on this hard decision,
the parity check equations are evaluated at the end of each iteration.
The check and variable node updates are executed until all check
equations are satisfied or a maximum iteration counter is reached.
3. GPU IMPLEMENTATION
For an effective implementation, two main points have to be consid-
ered: the arrangement of the data in the device memory and the use
of the capabilities of the hardware for accelerating the data process-
ing. For an introduction to the CUDA terminology and specifics, we
refer the reader to [14].
3.1. Description of the Memory Arrangement
The most important structures are the messages Qn,m and Rm,n
that are passed between variable nodes and check nodes and the
posteriors Qn. All of these are accessed very frequently in most
kernels of the main loop. LDPC codes of different sizes and de-
gree profiles as well as different Galois field dimensions should be
supported, such that the potentially required amount of memory for
these message vectors can become too large for on-chip memory. In
addition, the need for quasi-random access to the message structures
only leaves the global device memory as suitable solution.
Although the data is stored as a one-dimensional array in the
device, it is interpreted as 2D (Qn) or 3D (Qn,m and Rm,n) struc-
tures, depending on the indexing as shown exemplarily for the 3D
case in Fig. 2. For the 3D arrays, the first index is pertinent to the
node, the second to the node connection and the third to the Galois
field element that corresponds to one entry of an LLR vector. The
2D arrays are indexed first by the node and second by the Galois
field element. This indexing points to the exact position of one el-
ement in the structure, and helps to set the CUDA grid and block
configuration [14] of the kernels.
Max. Node
Connections
Nodes
G
alois
Field
D
im
ension
Linear Device Memory
Fig. 2. Three-dimensional structures used for message vectors and
corresponding linear data allocation in the global device memory.
By arranging the data in this order, it is possible to access the
memory in a coalesced way, such that multiple float values are han-
dled in only one read/write memory access. If each of the q adjacent
LLR vector entries are accessed by q adjacent threads, this access
will be coalesced, which is a critical requirement for achieving high
throughput performance. This arrangement boosts the performance
with increasing q.
Note that CUDA requires fixed sizes for data structures, and
since irregular LDPC codes do not have a constant number of Tan-
ner graph edges per node, it is obligatory to set the node connection
dimension to the maximum value of all node degrees, taking care of
not accessing the invalid positions during the processing.
Temporary data that is needed within the kernels is either stored
in fast on-chip registers or in the on-chip shared memory if the data
is required by multiple threads within one block. A more extensive
description of the employed data structures can be found in [15].
Another GPU-specific implementation detail is the employment
of the GPU’s texture memory for the Galois field arithmetic. Two-
dimensional lookup tables of size q × q for the addition and multi-
plication of Galois field elements as well as one-dimensional lookup
tables for the inversion (size q−1) and conversion between exponen-
tial and decimal representation (size q) are initialized in the texture
memory and are, thus, available to all kernels.
3.2. Description of the CUDA Kernels
The implementation closely follows the algorithm described in
Sec. 2.2 and depicted in Fig. 1. Each of the gray shaded blocks
corresponds to one CUDA kernel (in CUDA terminology, a kernel
denotes an enclosed subroutine).
Initially, the number of iterations is zero. After initialization,
the input data is transferred to the CUDA device for an initial check
of the parity equations. If they are not fulfilled, the main decoding
loop is executed until they are, or until a maximum number of itera-
tions is reached. When the iterative decoding process finishes, a hard
decision is conducted to obtain the final binary decoding result. In
the following, the most important aspects of the main loop’s CUDA
kernel implementations are explained. An extensive description in-
cluding source code for the main routines can be found in [15].
3.2.1. Permute Message Vectors
Before and after the check node update, the elements of the message
LLR-vectors have to be permuted according to the Galois field ele-
ment at the corresponding check matrix position. A shared memory
block is used as temporary memory, allowing a parallel implemen-
tation of the permute operation. For permuting Rm,n, the shared
memory stores the input data using the permuted indices according
to the matrix entry Hm,n (using the texture tables for multiplication),
and writes the output data directly. For the permutation of Qn,m the
shared memory reads the input values directly, but the output is writ-
ten using the permuted indices as shown in Fig. 3. This way, one
thread (i.e., one of the many processing units of the GPU) per node
and per Galois field element can employed, without accessing the
same shared memory banks. Thus, no bank conflicts are generated,
resulting in full speed memory access.
3.2.2. Signed-Log Domain FFT
The implementations of the signed-log-FFT and the inverse signed-
log-FFT are identical and follow a butterfly diagram as shown in
Fig. 4. As explained in [7], additions in the log domain consist of two
IEEE SiPS, Oct. 2013, Taipei, Taiwan 4 / 6
...
...
Max. Node
Connections
Nodes
G
alois
Field
D
im
ension
Shared Mem.
Texture Mem.
Qn,m
Qn,m
Fig. 3. Permutation of message vectors Qn,m.
separate operations for the magnitude and the sign, which is why two
shared memory blocks are required for this purpose. Both operations
are executed simultaneously in the CUDA kernel implementation.
The signed-log-FFT is applied independently to each LLR-vector of
size q. Thus, the CUDA block width is fixed to q. For each node, a
loop processes all node connections to compute the signed-log-FFT
of the q LLR-vector elements. The execution of the butterfly is su-
pervised by a small loop of log2 q iterations. Since each continuous
thread accesses a different shared memory bank during the butterfly,
no bank conflicts are produced.
The exponential and logarithm functions that are needed for the
signed-log domain operations [7] are realized by the fast hardware
implemented functions which are provided by the GPU. The transfer
of data between the global memory and the shared memory is done
in a coalesced way, which is ensured by proper indexing.
3.2.3. Check Node Update
The implementation is visualized in Fig. 5 and uses one thread per
node and Galois field element to compute the outgoing message
LLR-vectors for each edge of the corresponding check node. Since
the computations are performed in the signed-log domain, magni-
tude and sign calculus are performed separately. In each thread, a
first loop over the node’s edges calculates the total sum of the mag-
nitude values and the total product of the sign values. In a second
loop, the current magnitude value is subtracted from the total magni-
tude sum to acquire the extrinsic sum, and the total sign is multiplied
by the current sign value for each edge. Shared memory is not re-
quired since all memory accesses are executed in a coalesced fashion
and, thus, provide high throughput performance.
NODES
NODES
...
...
Max. Node
Connections
Nodes
G
alois
Field
D
im
ension
Shared Memory
Butterfly NetworkLoopϕn,m
Φn,m
Fig. 4. Butterfly network for FFT implementation.
Max. Variable
Node Connections
Max. Check
Node Connections
VariableNodes
CheckNodes
G
alois
Field
D
im
ension
G
alois
Field
D
im
ension
Loop
Φn,m
Θm,n
Fig. 5. Check node update of message vectors Θm,n.
3.2.4. Variable Node Update
The kernel processes one thread per variable node and Galois field
dimension (see Fig. 6). An internal loop processes all the node con-
nections within one thread. Each thread first reads the input LLR zn
into a shared memory block and then adds all extrinsic information
from Rm,n for the current edge. In a second shared memory block,
the maximum resulting element δn,m of the LLR-vector is found
and subtracted from each entry from the first shared memory block
before the result is written into Qn,m. Since the shared memory op-
erations are inside the loop over the node connections, continuous
thread numbers access different shared memory banks in each itera-
tion. This avoids bank conflicts and provides maximum performance
for the shared memory.
3.2.5. Update Posteriors
The implementation of this kernel follows the one for the variable
node update, except that the search and subtraction of the maximum
element δn,m is omitted and the information of all edges rather than
all but one is added to the input LLR to yield the posterior LLR-
vectors Qn.
NODES
VariableNodes
G
alois
Field
D
im
.
LoopShared Memoryzn
δn,m
Rm,n Qn,m
Fig. 6. Variable node update of message vectors Qn,m.
IEEE SiPS, Oct. 2013, Taipei, Taiwan 5 / 6
0 1 2 3 4 5
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5
0
2
4
6
8
10
12
14
16
dec.throughput[Mbit/s]
Eb/N0Eb/N0
GPU
CPU
F2
F4
F8
F16
F32
F64
F128
F256
dec.throughputperiteration[Mbit/s]Fig. 7. Average total decoder throughput (left) and average throughput per iteration (right) of coded bits over channel quality in Eb/N0 at a
maximum of 100 iterations. The zoomed region shows the CPU implementation.
3.2.6. Evaluate Check Equations
First, one thread per variable node is employed to determine a hard
decision of the node’s posterior LLR-vector by looping over the Ga-
lois field dimension to find the maximum element. After this, one
thread per check node loops over its edges and utilizes the texture
tables for Galois field addition and multiplication to add up the hard
decision values. The result is a binary vector of length M. To de-
cide if decoding can be stopped, a parallelized reduction algorithm
is applied to sum up the elements of this vector in log2 M steps. The
parity equations are fully satisfied if the result of the sum is zero.
4. PERFORMANCE EVALUATION
The CUDA algorithm implementation is executed under the same
settings as the single-core CPU reference implementation, and the
decoder throughput is measured at several channel SNRs (Eb/N0
values) and for codes over different Galois field sizes q between 21
and 28
in a binary input AWGN channel environment with BPSK
modulation. It should be noted that the obtained results are related
to the hardware architecture used in the simulations and will differ
if different CPUs and/or GPUs are used. For the experiments of this
paper, the following standard hardware and graphic card were used:
• CPU: Intel Core i7-920, 2.67GHz, 6GB RAM
• GPU: GeForce GTX 580, 512 CUDA cores, 1536 MB RAM,
CUDA Compute Capability 2.0.
Table 1 shows the different non-binary rate 1
2
LDPC codes, used
for obtaining the results in this section. Since one Galois field sym-
bol is composed of p bits, the code size N is always chosen accord-
ing to N = Nbin
/p (with p = log2 q), such that the equivalent num-
ber Nbin
of bits per code word is constant. Concentrated check node
degree distributions have been used for all codes. The non-binary
check matrices have been constructed with the PEG algorithm [16],
by first constructing a binary matrix of the given dimensions and
subsequently replacing all ones by non-zero Galois field elements
chosen uniformly at random. A maximum number of 100 decoding
iterations have been conducted. For all employed codes, the error
rate results of the GPU and the CPU implementation are identical
within the boundaries given by the different number representations.
The left hand side of Fig. 7 shows the average total throughput of
(coded) bits that is processed by one core of the CPU (dashed lines)
as well as the GPU implementation (solid lines) over the channel
quality in Eb/N0. The right hand side shows the according average
throughput per decoding iteration. In both plots, a zoom in shows
the curves for the CPU implementation in more detail.
The very low total throughput at SNR values below 1 dB is due
to the high maximum number of 100 decoding iterations that are al-
most always executed, before the waterfall region of the code’s error
rate curve is reached. For higher Eb/N0 values, the total through-
put steadily increases, since fewer decoding iterations have to be
executed. The curves in the right hand side plot, however, steadily
decrease with increasing SNR, which is due to the fact that a roughly
constant amount of execution time for initialization and all process-
ing outside the main decoding loop is needed, independent of the
number of actually conducted iterations.
Table 1. Dimensions and degree distributions Lixi
in node per-
spective of employed non-binary rate 1
2
LDPC codes. Degree distri-
butions for 2p
∈ {8, 16, 32, 64} were taken from [16]
2p
N = Nbin
p
M L2 L3 L4 L5
2 2304 1152 0.549 0.144 0.001 0.306
4 1152 576 0.537 0.302 0.042 0.119
8 768 384 0.643 0.150 0.194 0.013
16 576 288 0.773 0.102 0.115 0.010
32 462 231 0.848 0.143 0.009 0
64 384 192 0.940 0.050 0.010 0
128 330 165 0.852 0 0.148 0
256 288 144 0.990 0.010 0 0
IEEE SiPS, Oct. 2013, Taipei, Taiwan 6 / 6
Table 2. Achieved Speedup throughput GPU
throughput CPU
for different Fq and Eb/N0
Eb/N0 F2 F4 F8 F16 F32 F64 F128 F256
0 dB 17 18 30 44 61 84 112 128
1 dB 18 20 30 47 66 92 124 148
2 dB 15 17 25 39 56 76 109 113
3 dB 13 15 22 34 48 65 100 100
The speedup achieved by the GPU implementation with respect
to the CPU implementation is shown in Tab. 2 for several SNR val-
ues. While the speedup hugely increases from below 20 to almost
150 with increasing Galois field dimension, it is not affected too
much by the varying channel quality. The increasing speedup is
mainly due to the fact that groups of q Galois field elements are
stored in continuous memory positions, which leads to increased ef-
ficiency of the coalesced memory access. This increasing efficiency
of the GPU implementation leads to the interesting effect that the
average throughput is not monotonically decreasing with the Galois
field dimension, even though the decoding complexity increases as
O(Nbin
2p
) where p = log2 q is the number of bits per Galois field
symbol.
In Fig. 8 the achieved average total throughput is plotted against
the number of bits per symbol for different fixed Eb/N0 values.
Again, a zoomed region is shown for the CPU implementation. As
expected, the throughput of the CPU implementation quickly de-
creases with the number of bits per symbol. The GPU implementa-
tion, on the other hand, shows an increasing throughput up to around
4 bits per symbol and then also starts to decrease. This behavior can
be explained by the fact that for moderate Galois field dimensions,
the increasing computational complexity is overcompensated by the
more efficient use of the GPU’s capabilities due to the coalesced
memory access of q addresses at once. However, for larger Galois
field dimensions q ≥ 25
a saturation occurs and the throughput starts
to decrease.
5. CONCLUSION
One way of improving the performance of LDPC codes is to use
codes over higher order Galois Fields Fq and according non-binary
decoders. While the signed-log domain FFT decoding algorithm
provides increasing error correction performance with increasing
Galois field dimension, the computational complexity increases
significantly at the same time. In this paper, we have presented
an efficient implementation of the signed-log domain FFT decoder
exploiting the massive parallel compute capabilities of of today’s
GPUs using Nvidia’s CUDA framework. The key strengths of our
implementation are the increasing efficiency for increasing Galois
field dimensions and the applicability to arbitrary non-binary LDPC
codes. While higher speedup can be expected if the codes are con-
strained to be, e.g., regular and/or quasi-cyclic, the universality of
the presented GPU implementation enables accelerated performance
estimation without the need for any specific code properties.
6. REFERENCES
[1] R. G. Gallager, Low-Density Parity-Check Codes, M.I.T. Press, Cam-
bridge, MA, USA, 1963.
[2] D. J. C. MacKay and R. M. Neal, “Good Codes based on Very
2 4 6 81 3 5 7
0.5
1
1.5
0
2
0.12
0.08
0.04
0
dec.throughput[Mbit/s]
Bits per symbol p = log2 q
GPU CPU
0 dB
1 dB
2 dB
3 dB
Fig. 8. Throughput over Galois field dimension for Eb/N0 ∈
{0 dB, 1 dB, 2 dB, 3 dB}. Zoom shows CPU implementation.
Sparse Matrices,” in Cryptography and Coding, 5th IMA Conference,
Springer, Berlin, Germany, 1995, pp. 100–111.
[3] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference, Morgan Kaufmann, San Mateo, CA, USA, 1988.
[4] M. C. Davey and D. J. C. MacKay, “Low-Density Parity Check Codes
over GF(q),” IEEE Comm. Lett., vol. 2, no. 6, pp. 165–167, June 1998.
[5] L. Barnault and D. Declercq, “Fast Decoding Algorithm for LDPC over
GF(2q),” in Proc. IEEE Inform. Theory Workshop (ITW), Apr. 2003,
pp. 70–73.
[6] H. Song and J.R. Cruz, “Reduced-Complexity Decoding of Q-ary
LDPC Codes for Magnetic Recording,” IEEE Trans. Magn., vol. 39,
no. 2, pp. 1081–1087, Mar. 2003.
[7] G. J. Byers and F. Takawira, “Fourier Transform Decoding of Non-
Binary LDPC Codes,” in Proc. Southern African Telecommunication
Networks and Applications Conference (SATNAC), Sept. 2004.
[8] A. Voicila, D. Declercq, F. Verdier, M. Fossorier, and P. Urard, “Low-
Complexity Decoding for Non-Binary LDPC Codes in High Order
Fields,” IEEE Trans. Commun., vol. 58, no. 5, pp. 1365–1375, May
2010.
[9] T. P. Chen and Y.-K. Chen, “Challenges and Opportunities of Obtaining
Performance from Multi-Core CPUs and Many-Core GPUs,” in Proc.
IEEE ICASSP, Taipei, Taiwan, Apr. 2009, pp. 609–613.
[10] C.-I. Colombo Nilsen and I. Hafizovic, “Digital Beamforming Using a
GPU,” in Proc. IEEE ICASSP, Taipei, Taiwan, Apr. 2009, pp. 609–613.
[11] T. R. Halfhill, “Parallel Processing with CUDA,” Microprocessor Re-
port, Jan. 2008.
[12] R. M. Tanner, “A Recursive Approach to Low Complexity Codes,”
IEEE Trans. Inform. Theory, vol. 27, no. 5, pp. 533–547, Sept. 1981.
[13] H. Wymeersch, H. Steendam, and M. Moeneclaey, “Log-Domain De-
coding of LDPC Codes over GF(q),” in Proc. IEEE International Con-
ference on Communications, June 2004.
[14] Shane Cook, CUDA Programming: A Developer’s Guide to Parallel
Computing With GPUs, Morgan Kaufmann, 2012.
[15] E. Monzó, “Massive Parallel Decoding of Low-Density Parity-Check
Codes Using Graphic Cards,” M.S. thesis, RWTH Aachen University
& Universidad Politecnica de Valencia, 2010.
[16] X. Y. Hu, E. Eleftheriou, and D. M. Arnold, “Regular and Irregular
Progressive Edge-Growth Tanner Graphs,” IEEE Trans. Inform. The-
ory, vol. 51, no. 1, pp. 386–398, Jan. 2005.

More Related Content

What's hot

Iaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applicationsIaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applicationsIaetsd Iaetsd
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Fault Tolerant Parallel Filters Based On Bch Codes
Fault Tolerant Parallel Filters Based On Bch CodesFault Tolerant Parallel Filters Based On Bch Codes
Fault Tolerant Parallel Filters Based On Bch CodesIJERA Editor
 
A new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converterA new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converterVLSICS Design
 
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...ijcseit
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesRoger Gomes
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Design of Reversible Sequential Circuit Using Reversible Logic Synthesis
Design of Reversible Sequential Circuit Using Reversible Logic SynthesisDesign of Reversible Sequential Circuit Using Reversible Logic Synthesis
Design of Reversible Sequential Circuit Using Reversible Logic SynthesisVLSICS Design
 
haffman coding DCT transform
haffman coding DCT transformhaffman coding DCT transform
haffman coding DCT transformaniruddh Tyagi
 
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...vtunotesbysree
 
An Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr Algorithm
An Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr AlgorithmAn Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr Algorithm
An Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr AlgorithmIJERA Editor
 
A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...
A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...
A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...VLSICS Design
 

What's hot (20)

Iaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applicationsIaetsd low power flip flops for vlsi applications
Iaetsd low power flip flops for vlsi applications
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Fault Tolerant Parallel Filters Based On Bch Codes
Fault Tolerant Parallel Filters Based On Bch CodesFault Tolerant Parallel Filters Based On Bch Codes
Fault Tolerant Parallel Filters Based On Bch Codes
 
Gf3511031106
Gf3511031106Gf3511031106
Gf3511031106
 
A new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converterA new efficient fpga design of residue to-binary converter
A new efficient fpga design of residue to-binary converter
 
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
PERFORMANCE OF ITERATIVE LDPC-BASED SPACE-TIME TRELLIS CODED MIMO-OFDM SYSTEM...
 
D04561722
D04561722D04561722
D04561722
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Data Communication Unit 2
Data Communication Unit 2 Data Communication Unit 2
Data Communication Unit 2
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devices
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Dg34662666
Dg34662666Dg34662666
Dg34662666
 
Design of Reversible Sequential Circuit Using Reversible Logic Synthesis
Design of Reversible Sequential Circuit Using Reversible Logic SynthesisDesign of Reversible Sequential Circuit Using Reversible Logic Synthesis
Design of Reversible Sequential Circuit Using Reversible Logic Synthesis
 
haffman coding DCT transform
haffman coding DCT transformhaffman coding DCT transform
haffman coding DCT transform
 
wcnc05
wcnc05wcnc05
wcnc05
 
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
SOLUTION MANUAL OF COMPUTER ORGANIZATION BY CARL HAMACHER, ZVONKO VRANESIC & ...
 
An Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr Algorithm
An Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr AlgorithmAn Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr Algorithm
An Energy-Efficient Lut-Log-Bcjr Architecture Using Constant Log Bcjr Algorithm
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...
A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...
A Low Power VITERBI Decoder Design With Minimum Transition Hybrid Register Ex...
 
Chap12 slides
Chap12 slidesChap12 slides
Chap12 slides
 

Viewers also liked

386.проектирование палубных конструкций и оборудования
386.проектирование палубных конструкций и оборудования386.проектирование палубных конструкций и оборудования
386.проектирование палубных конструкций и оборудованияivanov1566359955
 
Arado main certifecete
Arado main certifeceteArado main certifecete
Arado main certifeceteselwan mehgoub
 
Fork-Lift Operator Certificate
Fork-Lift Operator CertificateFork-Lift Operator Certificate
Fork-Lift Operator CertificateEsquire Makgale
 
Food Safety Certified
Food Safety CertifiedFood Safety Certified
Food Safety CertifiedKurt Thomas
 
Presentation to OT Study Day Lancaster 2016
Presentation to OT Study Day Lancaster 2016Presentation to OT Study Day Lancaster 2016
Presentation to OT Study Day Lancaster 2016Rachel Jones
 
PPP- las regiones de luisiana
PPP- las regiones de luisianaPPP- las regiones de luisiana
PPP- las regiones de luisianaNelly De la Cruz
 
บทคัดย่อ
บทคัดย่อบทคัดย่อ
บทคัดย่อPa'rig Prig
 
Curso completo de desenho vol 01 - natureza morta
Curso completo de desenho   vol 01 - natureza mortaCurso completo de desenho   vol 01 - natureza morta
Curso completo de desenho vol 01 - natureza mortapaulojadacosta
 
GarumとMeteoriteと私 3nd Edition ver0.04
GarumとMeteoriteと私 3nd Edition ver0.04GarumとMeteoriteと私 3nd Edition ver0.04
GarumとMeteoriteと私 3nd Edition ver0.04Shinobu Okano
 
Proyecto Transportes Karlita
Proyecto Transportes KarlitaProyecto Transportes Karlita
Proyecto Transportes Karlitatutor03770
 

Viewers also liked (15)

バギオ留学 TALK Yangcoセンターのシラバス(英語コース) 2016
バギオ留学 TALK Yangcoセンターのシラバス(英語コース) 2016バギオ留学 TALK Yangcoセンターのシラバス(英語コース) 2016
バギオ留学 TALK Yangcoセンターのシラバス(英語コース) 2016
 
386.проектирование палубных конструкций и оборудования
386.проектирование палубных конструкций и оборудования386.проектирование палубных конструкций и оборудования
386.проектирование палубных конструкций и оборудования
 
Arado main certifecete
Arado main certifeceteArado main certifecete
Arado main certifecete
 
Fork-Lift Operator Certificate
Fork-Lift Operator CertificateFork-Lift Operator Certificate
Fork-Lift Operator Certificate
 
Food Safety Certified
Food Safety CertifiedFood Safety Certified
Food Safety Certified
 
Presentation to OT Study Day Lancaster 2016
Presentation to OT Study Day Lancaster 2016Presentation to OT Study Day Lancaster 2016
Presentation to OT Study Day Lancaster 2016
 
PPP- las regiones de luisiana
PPP- las regiones de luisianaPPP- las regiones de luisiana
PPP- las regiones de luisiana
 
บทคัดย่อ
บทคัดย่อบทคัดย่อ
บทคัดย่อ
 
Curso completo de desenho vol 01 - natureza morta
Curso completo de desenho   vol 01 - natureza mortaCurso completo de desenho   vol 01 - natureza morta
Curso completo de desenho vol 01 - natureza morta
 
GarumとMeteoriteと私 3nd Edition ver0.04
GarumとMeteoriteと私 3nd Edition ver0.04GarumとMeteoriteと私 3nd Edition ver0.04
GarumとMeteoriteと私 3nd Edition ver0.04
 
Lect1.1
Lect1.1Lect1.1
Lect1.1
 
Q2
Q2Q2
Q2
 
LDPC Codes
LDPC CodesLDPC Codes
LDPC Codes
 
Proyecto Transportes Karlita
Proyecto Transportes KarlitaProyecto Transportes Karlita
Proyecto Transportes Karlita
 
Naperville Montessori
Naperville MontessoriNaperville Montessori
Naperville Montessori
 

Similar to High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)

GF(q) LDPC encoder and decoder FPGA implementation using group shuffled beli...
GF(q) LDPC encoder and decoder FPGA implementation using  group shuffled beli...GF(q) LDPC encoder and decoder FPGA implementation using  group shuffled beli...
GF(q) LDPC encoder and decoder FPGA implementation using group shuffled beli...IJECEIAES
 
Design and implementation of log domain decoder
Design and implementation of log domain decoder Design and implementation of log domain decoder
Design and implementation of log domain decoder IJECEIAES
 
Hossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on GraphsHossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on Graphsknowdiff
 
Ebc7fc8ba9801f03982acec158fa751744ca copie
Ebc7fc8ba9801f03982acec158fa751744ca   copieEbc7fc8ba9801f03982acec158fa751744ca   copie
Ebc7fc8ba9801f03982acec158fa751744ca copieSourour Kanzari
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Sourour Kanzari
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Sourour Kanzari
 
Performance comparison of eg ldpc codes
Performance comparison of eg ldpc codesPerformance comparison of eg ldpc codes
Performance comparison of eg ldpc codesijcsity
 
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N VLSICS Design
 
Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...
Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...
Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...IJNSA Journal
 
Energy-Efficient LDPC Decoder using DVFS for binary sources
Energy-Efficient LDPC Decoder using DVFS for binary sourcesEnergy-Efficient LDPC Decoder using DVFS for binary sources
Energy-Efficient LDPC Decoder using DVFS for binary sourcesIDES Editor
 
FPGA Implementation of LDPC Encoder for Terrestrial Television
FPGA Implementation of LDPC Encoder for Terrestrial TelevisionFPGA Implementation of LDPC Encoder for Terrestrial Television
FPGA Implementation of LDPC Encoder for Terrestrial TelevisionAI Publications
 
Belief Propagation Decoder for LDPC Codes Based on VLSI Implementation
Belief Propagation Decoder for LDPC Codes Based on VLSI ImplementationBelief Propagation Decoder for LDPC Codes Based on VLSI Implementation
Belief Propagation Decoder for LDPC Codes Based on VLSI Implementationinventionjournals
 
Iisrt jona priyaa(1 5)
Iisrt jona priyaa(1 5)Iisrt jona priyaa(1 5)
Iisrt jona priyaa(1 5)IISRT
 
Low complexity design of non binary ldpc decoder using extended min-sum algor...
Low complexity design of non binary ldpc decoder using extended min-sum algor...Low complexity design of non binary ldpc decoder using extended min-sum algor...
Low complexity design of non binary ldpc decoder using extended min-sum algor...eSAT Journals
 
PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...
PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...
PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...Journal For Research
 
Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...
Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...
Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...IOSR Journals
 
An identification of the tolerable time-interleaved analog-todigital convert...
An identification of the tolerable time-interleaved analog-todigital convert...An identification of the tolerable time-interleaved analog-todigital convert...
An identification of the tolerable time-interleaved analog-todigital convert...IJECEIAES
 

Similar to High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper) (20)

GF(q) LDPC encoder and decoder FPGA implementation using group shuffled beli...
GF(q) LDPC encoder and decoder FPGA implementation using  group shuffled beli...GF(q) LDPC encoder and decoder FPGA implementation using  group shuffled beli...
GF(q) LDPC encoder and decoder FPGA implementation using group shuffled beli...
 
Design and implementation of log domain decoder
Design and implementation of log domain decoder Design and implementation of log domain decoder
Design and implementation of log domain decoder
 
Hossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on GraphsHossein Taghavi : Codes on Graphs
Hossein Taghavi : Codes on Graphs
 
Ebc7fc8ba9801f03982acec158fa751744ca copie
Ebc7fc8ba9801f03982acec158fa751744ca   copieEbc7fc8ba9801f03982acec158fa751744ca   copie
Ebc7fc8ba9801f03982acec158fa751744ca copie
 
LDPC Encoding and Hamming Encoding
LDPC Encoding and Hamming EncodingLDPC Encoding and Hamming Encoding
LDPC Encoding and Hamming Encoding
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
 
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
Andrade sep15 fromlowarchitecturalexpertiseuptohighthroughputnonbinaryldpcdec...
 
Performance comparison of eg ldpc codes
Performance comparison of eg ldpc codesPerformance comparison of eg ldpc codes
Performance comparison of eg ldpc codes
 
Presentation
PresentationPresentation
Presentation
 
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
REDUCED COMPLEXITY QUASI-CYCLIC LDPC ENCODER FOR IEEE 802.11N
 
Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...
Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...
Multi-carrier Equalization by Restoration of RedundancY (MERRY) for Adaptive ...
 
Energy-Efficient LDPC Decoder using DVFS for binary sources
Energy-Efficient LDPC Decoder using DVFS for binary sourcesEnergy-Efficient LDPC Decoder using DVFS for binary sources
Energy-Efficient LDPC Decoder using DVFS for binary sources
 
LDPC_CODES.ppt
LDPC_CODES.pptLDPC_CODES.ppt
LDPC_CODES.ppt
 
FPGA Implementation of LDPC Encoder for Terrestrial Television
FPGA Implementation of LDPC Encoder for Terrestrial TelevisionFPGA Implementation of LDPC Encoder for Terrestrial Television
FPGA Implementation of LDPC Encoder for Terrestrial Television
 
Belief Propagation Decoder for LDPC Codes Based on VLSI Implementation
Belief Propagation Decoder for LDPC Codes Based on VLSI ImplementationBelief Propagation Decoder for LDPC Codes Based on VLSI Implementation
Belief Propagation Decoder for LDPC Codes Based on VLSI Implementation
 
Iisrt jona priyaa(1 5)
Iisrt jona priyaa(1 5)Iisrt jona priyaa(1 5)
Iisrt jona priyaa(1 5)
 
Low complexity design of non binary ldpc decoder using extended min-sum algor...
Low complexity design of non binary ldpc decoder using extended min-sum algor...Low complexity design of non binary ldpc decoder using extended min-sum algor...
Low complexity design of non binary ldpc decoder using extended min-sum algor...
 
PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...
PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...
PERFORMANCE ESTIMATION OF LDPC CODE SUING SUM PRODUCT ALGORITHM AND BIT FLIPP...
 
Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...
Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...
Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL a...
 
An identification of the tolerable time-interleaved analog-todigital convert...
An identification of the tolerable time-interleaved analog-todigital convert...An identification of the tolerable time-interleaved analog-todigital convert...
An identification of the tolerable time-interleaved analog-todigital convert...
 

Recently uploaded

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 

Recently uploaded (20)

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 

High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs (Paper)

  • 1. IEEE SiPS, Oct. 2013, Taipei, Taiwan 1 / 6 HIGH SPEED DECODING OF NON-BINARY IRREGULAR LDPC CODES USING GPUS Moritz Beermann, Enrique Monzó⋆ , Laurent Schmalen† , Peter Vary Institute of Communication Systems and Data Processing ( ), RWTH Aachen University, Germany ⋆ now with Leineweber GmbH, Aachen, Germany † now with Bell Laboratories, Alcatel-Lucent, Stuttgart, Germany {beermann|vary}@ind.rwth-aachen.de ABSTRACT Low-Density Parity-Check (LDPC) codes are very powerful chan- nel coding schemes with a broad range of applications. The exis- tence of low complexity (i.e., linear time) iterative message passing decoders with close to optimum error correction performance is one of the main strengths of LDPC codes. It has been shown that the performance of these decoders can be further enhanced if the LDPC codes are extended to higher order Galois fields, yielding so called non-binary LDPC codes. However, this performance gain comes at the cost of rapidly increasing decoding complexity. To deal with this increased complexity, we present an efficient implementation of a signed-log domain FFT decoder for non-binary irregular LDPC codes that exploits the inherent massive parallelization capabilities of message passing decoders. We employ Nvidia’s Compute Unified Device Architecture (CUDA) to incorporate the available processing power of state-of-the-art Graphics Processing Units (GPUs). Index Terms— non-binary LDPC codes, iterative decoding, GPU implementation 1. INTRODUCTION Low-Density Parity-Check (LDPC) codes were originally presented in [1] and later rediscovered in [2]. The most prominent decod- ing algorithm for LDPC codes is the belief propagation (BP) algo- rithm [3], which has linear time complexity. Even though BP de- coding is only an approximation of maximum likelihood for practi- cal codes, a performance close to the Shannon limit is observed for long LDPC codes. It has been shown that the performance of short- to medium-length LDPC codes can be improved by employing so called non-binary LDPC codes over higher order Galois fields Fq with q = 2p (p ∈ N) elements [4]. While the decoding complexity of binary LDPC codes of di- mension Nbin is in the order of O(Nbin ), this is not the case for non- binary LDPC codes of length N = Nbin /p (this relation of N and Nbin serves for a fair comparison, as will be explained later). With a straightforward implementation [4] , the decoding complexity in- creases to O(Nq2 ) = O(N22p ). With an intelligent, signed-log domain Fast Fourier transform (FFT) based approach [5, 6, 7], this complexity can be reduced to O(Nq log2 q), which will still be con- siderably higher than the complexity of binary belief propagation, if large Galois field dimensions are used. Especially for applications where decoding complexity is not a main concern (e.g., deep space applications), employing non-binary LDPC codes is a good choice. They exhibit excellent error correc- tion performance while the complexity stays linearithmic (i.e., better than any polynomial time algorithm with exponent greater than 1). Approaches to further decrease the decoding complexity exist (e.g., [8]), but they suffer from an additional performance penalty compared to BP. However, for comparison, it is often desirable to assess the performance of the BP decoder without any approxima- tions. Therefore, we propose to build a fast reference decoder using the parallel computing capabilities of today’s graphic cards. The massive computational power of state-of-the-art graphic cards has led to the utilization of Graphics Processing Units (GPUs) for signal processing applications, e.g., [9], [10]. Graphic cards with GPUs by Nvidia Corporation can be elegantly programmed in a language similar to C using Nvidia’s Compute Unified Device Architecture (CUDA) [11]. The main advantage of GPUs is the large amount of processing cores available on the processor, even in low-end, low-cost products. These can be used for a parallel execution of many problems. In this paper, we present a massively parallelized implementa- tion of the signed-log domain FFT based decoder that was presented in [7]. The implementation is generic, i.e., it works for arbitrary ir- regular codes and does not require any special code structure. The CUDA framework is employed to adapt a single-core reference CPU implementation to be executed on one of Nvidia’s general purpose GPUs. Decoder throughput results are presented for the CPU as well as the GPU implementations for different Galois field dimensions. 2. NON-BINARY LDPC CODES 2.1. Description An (N, K) non-binary LDPC code over Fq := {0, 1, α, . . . αq−2 } (α denoting a root of the primitive polynomial defining Fq) with q = 2p (N coded symbols and K information symbols of p ∈ N bit each) is defined as the null space of a sparse parity check ma- trix H with entries from Fq and of dimension M × N (with M = N − K being the number of parity check symbols). Hm,n ∈ Fq denotes the entry of H at row m and column n. The set N(m) = {n : Hm,n = 0} denotes the symbols that participate in parity check equation m. Similarly, the set M(n) = {m : Hm,n = 0} contains the check equations in which symbol n participates. Exclusion is de- noted by the operator “”,e.g., M(n){m} describes the set M(n) with check equation m excluded. An irregular LDPC code has the property that |M(n)| = const. and |N(m)| = const. Similarly to the binary case, non-binary LDPC codes are commonly described by their (weighted) Tanner graph [12], which is a bipartite graph with N variable nodes and M check nodes connected by edges of weight Hm,n according to the entries of H. Most decoding algorithms are so called message passing algorithms that iteratively pass messages between the variable and check nodes.
  • 2. IEEE SiPS, Oct. 2013, Taipei, Taiwan 2 / 6 Channel LLRs Output Bits Yes No Initialize Qn, Qn,m, and Rm,n Evaluate Check Equations Abort? Hard Decision of Qn Start Stop Permute Qn,m Log-FFT F(·) Check Node Update (1a,b) Inv. Log-FFT F−1(·) Inv. Permute Rm,n Variable Node Update (2) Update posteriors Qn Fig. 1. Flowchart of the non-binary signed-log domain FFT LDPC decoder. Each gray shaded block represents a CUDA kernel (see Sec. 3.2). 2.2. Decoding Algorithm: Non-Binary Belief Propagation A decoding algorithm for non-binary LDPC codes over Fq has al- ready been described in [4]. This algorithm uses the Fast Fourier Transform (FFT) over Fq for efficiently computing the check node operations (see also [5]). Its drawback is, however, that the imple- mentation needs to be carried out in the probability domain and is, thus, not well suited for an implementation due to eventual numer- ical instabilities. An implementation in the log domain has been given in [13], which, however, does not make use of the Fourier transform for a low-complexity computation of the check node op- eration (which considerably reduces the complexity for large field sizes q). In this work, we follow the approach of [6, 7], which uses an FFT and a representation of messages in the signed-log domain to carry out the check node update. Other approaches to reduce the complexity aim at computing only the most relevant terms of the update equation [8], but the resulting decoder shows performance losses. In the following, we briefly describe the signed-log domain algorithm according to [7]. A flowchart is given in Fig. 1. The binary input additive white Gaussian noise (BIAWGN) channel output vector of (binary) log likelihood ratios (LLRs) is denoted L(zbin ) = L(zbin 1 ), . . . L(zbin N·p) with zbin i denoting the re- ceived, noisy binary phase shift keying (BPSK) symbol. To describe the q probabilities of a symbol v being equal to one of the q elements of Fq in the logarithmic domain, the notion of a q-dimensional LLR vector is introduced L(v) = L[0] (v), L[1] (v), L[2] (v), . . . L[q−1] (v) = ln P(v = 0) P(v = 0) , ln P(v = 1) P(v = 0) , . . . ln P(v = αq−2 ) P(v = 0) . Note that even though the first entry is redundant, we still include it for implementation purpose since q is a power of two. The chan- nel output vector L(z) = (L(z1), . . . L(zN )) on symbol level is a vector of LLR vectors L(zn) = L[0] (zn), . . . L[q−1] (zn) whose entries are computed from the binary LLRs according to L[b] (zn) = − 1≤i≤p: bi=1 L zbin (n−1)·p+i 0 ≤ b ≤ q − 1 where bi is the i-th bit of the binary representation of the b-th element of Fq. Representing all messages as LLR vectors and performing all summations and multiplications of LLR vectors element-wise, the signed-log domain FFT decoding algorithm can be described as follows. The message vectors Qn,m (message sent from variable node n to check node m) and the symbol posteriors Qn are initialized with the channel output L(zn). The messages Rm,n which are passed from check node m to variable node n are initialized with zeros. The check node update consists of multiple steps. First, the elements of each variable to check message vector are permuted according to the matrix entry a = Hm,n ∈ Fq{0} Qn,m = Pa Qn,m , ∀ n, m. The q − 1 permute functions Pa(·), with a ∈ Fq{0}, move the ith element of Qn,m to position i · a in Qn,m. The permuted messages are then transformed into the signed-log domain according to ϕn,m = ϕs n,m, ϕm n,m with ϕs n,m = 1, ϕm n,m = Qn,m, ∀ n, m where the superscript (·)s denotes the sign and the superscript (·)m denotes the magnitude of a signed-log domain value. To transform these values into the Fourier domain, the Fast Walsh-Hadamard Transform F(·) is applied [7]: Φn,m = F ϕn,m , ∀ n, m. Finally, the check node update equations can be written as Θ s m,n = n′∈N (m){n} Φ s n′,m, ∀ n, m (1a) and Θ m m,n = n′∈N (m){n} Φ m n′,m, ∀ n, m. (1b) Before the resulting message is sent back to a variable node, the inverse Fourier transform θm,n = F−1 Θm,n , ∀ n, m is computed, the magnitude is extracted from the signed-log domain Rm,n = θ m m,n, ∀ n, m, and the inverse permutation (with a = Hm,n ∈ Fq{0}) Rm,n = P−1 a Rm,n , ∀ n, m
  • 3. IEEE SiPS, Oct. 2013, Taipei, Taiwan 3 / 6 is applied to the message vector. The variable node update is exe- cuted according to Qn,m = L(zn)+   m′∈M(n){m} Rm′,n  −δn,m, ∀ n, m. (2) The normalization constant δn,m = maxb∈Fq Q [b] n,m helps avoiding numerical problems. Furthermore, the posterior LLR vector Qn = L(zn) + m∈M(n) Rm,n, ∀ n is computed for each variable node and a hard decision is performed according to ˆxn = arg maxb∈Fq Q [b] n . Based on this hard decision, the parity check equations are evaluated at the end of each iteration. The check and variable node updates are executed until all check equations are satisfied or a maximum iteration counter is reached. 3. GPU IMPLEMENTATION For an effective implementation, two main points have to be consid- ered: the arrangement of the data in the device memory and the use of the capabilities of the hardware for accelerating the data process- ing. For an introduction to the CUDA terminology and specifics, we refer the reader to [14]. 3.1. Description of the Memory Arrangement The most important structures are the messages Qn,m and Rm,n that are passed between variable nodes and check nodes and the posteriors Qn. All of these are accessed very frequently in most kernels of the main loop. LDPC codes of different sizes and de- gree profiles as well as different Galois field dimensions should be supported, such that the potentially required amount of memory for these message vectors can become too large for on-chip memory. In addition, the need for quasi-random access to the message structures only leaves the global device memory as suitable solution. Although the data is stored as a one-dimensional array in the device, it is interpreted as 2D (Qn) or 3D (Qn,m and Rm,n) struc- tures, depending on the indexing as shown exemplarily for the 3D case in Fig. 2. For the 3D arrays, the first index is pertinent to the node, the second to the node connection and the third to the Galois field element that corresponds to one entry of an LLR vector. The 2D arrays are indexed first by the node and second by the Galois field element. This indexing points to the exact position of one el- ement in the structure, and helps to set the CUDA grid and block configuration [14] of the kernels. Max. Node Connections Nodes G alois Field D im ension Linear Device Memory Fig. 2. Three-dimensional structures used for message vectors and corresponding linear data allocation in the global device memory. By arranging the data in this order, it is possible to access the memory in a coalesced way, such that multiple float values are han- dled in only one read/write memory access. If each of the q adjacent LLR vector entries are accessed by q adjacent threads, this access will be coalesced, which is a critical requirement for achieving high throughput performance. This arrangement boosts the performance with increasing q. Note that CUDA requires fixed sizes for data structures, and since irregular LDPC codes do not have a constant number of Tan- ner graph edges per node, it is obligatory to set the node connection dimension to the maximum value of all node degrees, taking care of not accessing the invalid positions during the processing. Temporary data that is needed within the kernels is either stored in fast on-chip registers or in the on-chip shared memory if the data is required by multiple threads within one block. A more extensive description of the employed data structures can be found in [15]. Another GPU-specific implementation detail is the employment of the GPU’s texture memory for the Galois field arithmetic. Two- dimensional lookup tables of size q × q for the addition and multi- plication of Galois field elements as well as one-dimensional lookup tables for the inversion (size q−1) and conversion between exponen- tial and decimal representation (size q) are initialized in the texture memory and are, thus, available to all kernels. 3.2. Description of the CUDA Kernels The implementation closely follows the algorithm described in Sec. 2.2 and depicted in Fig. 1. Each of the gray shaded blocks corresponds to one CUDA kernel (in CUDA terminology, a kernel denotes an enclosed subroutine). Initially, the number of iterations is zero. After initialization, the input data is transferred to the CUDA device for an initial check of the parity equations. If they are not fulfilled, the main decoding loop is executed until they are, or until a maximum number of itera- tions is reached. When the iterative decoding process finishes, a hard decision is conducted to obtain the final binary decoding result. In the following, the most important aspects of the main loop’s CUDA kernel implementations are explained. An extensive description in- cluding source code for the main routines can be found in [15]. 3.2.1. Permute Message Vectors Before and after the check node update, the elements of the message LLR-vectors have to be permuted according to the Galois field ele- ment at the corresponding check matrix position. A shared memory block is used as temporary memory, allowing a parallel implemen- tation of the permute operation. For permuting Rm,n, the shared memory stores the input data using the permuted indices according to the matrix entry Hm,n (using the texture tables for multiplication), and writes the output data directly. For the permutation of Qn,m the shared memory reads the input values directly, but the output is writ- ten using the permuted indices as shown in Fig. 3. This way, one thread (i.e., one of the many processing units of the GPU) per node and per Galois field element can employed, without accessing the same shared memory banks. Thus, no bank conflicts are generated, resulting in full speed memory access. 3.2.2. Signed-Log Domain FFT The implementations of the signed-log-FFT and the inverse signed- log-FFT are identical and follow a butterfly diagram as shown in Fig. 4. As explained in [7], additions in the log domain consist of two
  • 4. IEEE SiPS, Oct. 2013, Taipei, Taiwan 4 / 6 ... ... Max. Node Connections Nodes G alois Field D im ension Shared Mem. Texture Mem. Qn,m Qn,m Fig. 3. Permutation of message vectors Qn,m. separate operations for the magnitude and the sign, which is why two shared memory blocks are required for this purpose. Both operations are executed simultaneously in the CUDA kernel implementation. The signed-log-FFT is applied independently to each LLR-vector of size q. Thus, the CUDA block width is fixed to q. For each node, a loop processes all node connections to compute the signed-log-FFT of the q LLR-vector elements. The execution of the butterfly is su- pervised by a small loop of log2 q iterations. Since each continuous thread accesses a different shared memory bank during the butterfly, no bank conflicts are produced. The exponential and logarithm functions that are needed for the signed-log domain operations [7] are realized by the fast hardware implemented functions which are provided by the GPU. The transfer of data between the global memory and the shared memory is done in a coalesced way, which is ensured by proper indexing. 3.2.3. Check Node Update The implementation is visualized in Fig. 5 and uses one thread per node and Galois field element to compute the outgoing message LLR-vectors for each edge of the corresponding check node. Since the computations are performed in the signed-log domain, magni- tude and sign calculus are performed separately. In each thread, a first loop over the node’s edges calculates the total sum of the mag- nitude values and the total product of the sign values. In a second loop, the current magnitude value is subtracted from the total magni- tude sum to acquire the extrinsic sum, and the total sign is multiplied by the current sign value for each edge. Shared memory is not re- quired since all memory accesses are executed in a coalesced fashion and, thus, provide high throughput performance. NODES NODES ... ... Max. Node Connections Nodes G alois Field D im ension Shared Memory Butterfly NetworkLoopϕn,m Φn,m Fig. 4. Butterfly network for FFT implementation. Max. Variable Node Connections Max. Check Node Connections VariableNodes CheckNodes G alois Field D im ension G alois Field D im ension Loop Φn,m Θm,n Fig. 5. Check node update of message vectors Θm,n. 3.2.4. Variable Node Update The kernel processes one thread per variable node and Galois field dimension (see Fig. 6). An internal loop processes all the node con- nections within one thread. Each thread first reads the input LLR zn into a shared memory block and then adds all extrinsic information from Rm,n for the current edge. In a second shared memory block, the maximum resulting element δn,m of the LLR-vector is found and subtracted from each entry from the first shared memory block before the result is written into Qn,m. Since the shared memory op- erations are inside the loop over the node connections, continuous thread numbers access different shared memory banks in each itera- tion. This avoids bank conflicts and provides maximum performance for the shared memory. 3.2.5. Update Posteriors The implementation of this kernel follows the one for the variable node update, except that the search and subtraction of the maximum element δn,m is omitted and the information of all edges rather than all but one is added to the input LLR to yield the posterior LLR- vectors Qn. NODES VariableNodes G alois Field D im . LoopShared Memoryzn δn,m Rm,n Qn,m Fig. 6. Variable node update of message vectors Qn,m.
  • 5. IEEE SiPS, Oct. 2013, Taipei, Taiwan 5 / 6 0 1 2 3 4 5 0 0.5 1 1.5 2 2.5 0 1 2 3 4 5 0 2 4 6 8 10 12 14 16 dec.throughput[Mbit/s] Eb/N0Eb/N0 GPU CPU F2 F4 F8 F16 F32 F64 F128 F256 dec.throughputperiteration[Mbit/s]Fig. 7. Average total decoder throughput (left) and average throughput per iteration (right) of coded bits over channel quality in Eb/N0 at a maximum of 100 iterations. The zoomed region shows the CPU implementation. 3.2.6. Evaluate Check Equations First, one thread per variable node is employed to determine a hard decision of the node’s posterior LLR-vector by looping over the Ga- lois field dimension to find the maximum element. After this, one thread per check node loops over its edges and utilizes the texture tables for Galois field addition and multiplication to add up the hard decision values. The result is a binary vector of length M. To de- cide if decoding can be stopped, a parallelized reduction algorithm is applied to sum up the elements of this vector in log2 M steps. The parity equations are fully satisfied if the result of the sum is zero. 4. PERFORMANCE EVALUATION The CUDA algorithm implementation is executed under the same settings as the single-core CPU reference implementation, and the decoder throughput is measured at several channel SNRs (Eb/N0 values) and for codes over different Galois field sizes q between 21 and 28 in a binary input AWGN channel environment with BPSK modulation. It should be noted that the obtained results are related to the hardware architecture used in the simulations and will differ if different CPUs and/or GPUs are used. For the experiments of this paper, the following standard hardware and graphic card were used: • CPU: Intel Core i7-920, 2.67GHz, 6GB RAM • GPU: GeForce GTX 580, 512 CUDA cores, 1536 MB RAM, CUDA Compute Capability 2.0. Table 1 shows the different non-binary rate 1 2 LDPC codes, used for obtaining the results in this section. Since one Galois field sym- bol is composed of p bits, the code size N is always chosen accord- ing to N = Nbin /p (with p = log2 q), such that the equivalent num- ber Nbin of bits per code word is constant. Concentrated check node degree distributions have been used for all codes. The non-binary check matrices have been constructed with the PEG algorithm [16], by first constructing a binary matrix of the given dimensions and subsequently replacing all ones by non-zero Galois field elements chosen uniformly at random. A maximum number of 100 decoding iterations have been conducted. For all employed codes, the error rate results of the GPU and the CPU implementation are identical within the boundaries given by the different number representations. The left hand side of Fig. 7 shows the average total throughput of (coded) bits that is processed by one core of the CPU (dashed lines) as well as the GPU implementation (solid lines) over the channel quality in Eb/N0. The right hand side shows the according average throughput per decoding iteration. In both plots, a zoom in shows the curves for the CPU implementation in more detail. The very low total throughput at SNR values below 1 dB is due to the high maximum number of 100 decoding iterations that are al- most always executed, before the waterfall region of the code’s error rate curve is reached. For higher Eb/N0 values, the total through- put steadily increases, since fewer decoding iterations have to be executed. The curves in the right hand side plot, however, steadily decrease with increasing SNR, which is due to the fact that a roughly constant amount of execution time for initialization and all process- ing outside the main decoding loop is needed, independent of the number of actually conducted iterations. Table 1. Dimensions and degree distributions Lixi in node per- spective of employed non-binary rate 1 2 LDPC codes. Degree distri- butions for 2p ∈ {8, 16, 32, 64} were taken from [16] 2p N = Nbin p M L2 L3 L4 L5 2 2304 1152 0.549 0.144 0.001 0.306 4 1152 576 0.537 0.302 0.042 0.119 8 768 384 0.643 0.150 0.194 0.013 16 576 288 0.773 0.102 0.115 0.010 32 462 231 0.848 0.143 0.009 0 64 384 192 0.940 0.050 0.010 0 128 330 165 0.852 0 0.148 0 256 288 144 0.990 0.010 0 0
  • 6. IEEE SiPS, Oct. 2013, Taipei, Taiwan 6 / 6 Table 2. Achieved Speedup throughput GPU throughput CPU for different Fq and Eb/N0 Eb/N0 F2 F4 F8 F16 F32 F64 F128 F256 0 dB 17 18 30 44 61 84 112 128 1 dB 18 20 30 47 66 92 124 148 2 dB 15 17 25 39 56 76 109 113 3 dB 13 15 22 34 48 65 100 100 The speedup achieved by the GPU implementation with respect to the CPU implementation is shown in Tab. 2 for several SNR val- ues. While the speedup hugely increases from below 20 to almost 150 with increasing Galois field dimension, it is not affected too much by the varying channel quality. The increasing speedup is mainly due to the fact that groups of q Galois field elements are stored in continuous memory positions, which leads to increased ef- ficiency of the coalesced memory access. This increasing efficiency of the GPU implementation leads to the interesting effect that the average throughput is not monotonically decreasing with the Galois field dimension, even though the decoding complexity increases as O(Nbin 2p ) where p = log2 q is the number of bits per Galois field symbol. In Fig. 8 the achieved average total throughput is plotted against the number of bits per symbol for different fixed Eb/N0 values. Again, a zoomed region is shown for the CPU implementation. As expected, the throughput of the CPU implementation quickly de- creases with the number of bits per symbol. The GPU implementa- tion, on the other hand, shows an increasing throughput up to around 4 bits per symbol and then also starts to decrease. This behavior can be explained by the fact that for moderate Galois field dimensions, the increasing computational complexity is overcompensated by the more efficient use of the GPU’s capabilities due to the coalesced memory access of q addresses at once. However, for larger Galois field dimensions q ≥ 25 a saturation occurs and the throughput starts to decrease. 5. CONCLUSION One way of improving the performance of LDPC codes is to use codes over higher order Galois Fields Fq and according non-binary decoders. While the signed-log domain FFT decoding algorithm provides increasing error correction performance with increasing Galois field dimension, the computational complexity increases significantly at the same time. In this paper, we have presented an efficient implementation of the signed-log domain FFT decoder exploiting the massive parallel compute capabilities of of today’s GPUs using Nvidia’s CUDA framework. The key strengths of our implementation are the increasing efficiency for increasing Galois field dimensions and the applicability to arbitrary non-binary LDPC codes. While higher speedup can be expected if the codes are con- strained to be, e.g., regular and/or quasi-cyclic, the universality of the presented GPU implementation enables accelerated performance estimation without the need for any specific code properties. 6. REFERENCES [1] R. G. Gallager, Low-Density Parity-Check Codes, M.I.T. Press, Cam- bridge, MA, USA, 1963. [2] D. J. C. MacKay and R. M. Neal, “Good Codes based on Very 2 4 6 81 3 5 7 0.5 1 1.5 0 2 0.12 0.08 0.04 0 dec.throughput[Mbit/s] Bits per symbol p = log2 q GPU CPU 0 dB 1 dB 2 dB 3 dB Fig. 8. Throughput over Galois field dimension for Eb/N0 ∈ {0 dB, 1 dB, 2 dB, 3 dB}. Zoom shows CPU implementation. Sparse Matrices,” in Cryptography and Coding, 5th IMA Conference, Springer, Berlin, Germany, 1995, pp. 100–111. [3] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Mateo, CA, USA, 1988. [4] M. C. Davey and D. J. C. MacKay, “Low-Density Parity Check Codes over GF(q),” IEEE Comm. Lett., vol. 2, no. 6, pp. 165–167, June 1998. [5] L. Barnault and D. Declercq, “Fast Decoding Algorithm for LDPC over GF(2q),” in Proc. IEEE Inform. Theory Workshop (ITW), Apr. 2003, pp. 70–73. [6] H. Song and J.R. Cruz, “Reduced-Complexity Decoding of Q-ary LDPC Codes for Magnetic Recording,” IEEE Trans. Magn., vol. 39, no. 2, pp. 1081–1087, Mar. 2003. [7] G. J. Byers and F. Takawira, “Fourier Transform Decoding of Non- Binary LDPC Codes,” in Proc. Southern African Telecommunication Networks and Applications Conference (SATNAC), Sept. 2004. [8] A. Voicila, D. Declercq, F. Verdier, M. Fossorier, and P. Urard, “Low- Complexity Decoding for Non-Binary LDPC Codes in High Order Fields,” IEEE Trans. Commun., vol. 58, no. 5, pp. 1365–1375, May 2010. [9] T. P. Chen and Y.-K. Chen, “Challenges and Opportunities of Obtaining Performance from Multi-Core CPUs and Many-Core GPUs,” in Proc. IEEE ICASSP, Taipei, Taiwan, Apr. 2009, pp. 609–613. [10] C.-I. Colombo Nilsen and I. Hafizovic, “Digital Beamforming Using a GPU,” in Proc. IEEE ICASSP, Taipei, Taiwan, Apr. 2009, pp. 609–613. [11] T. R. Halfhill, “Parallel Processing with CUDA,” Microprocessor Re- port, Jan. 2008. [12] R. M. Tanner, “A Recursive Approach to Low Complexity Codes,” IEEE Trans. Inform. Theory, vol. 27, no. 5, pp. 533–547, Sept. 1981. [13] H. Wymeersch, H. Steendam, and M. Moeneclaey, “Log-Domain De- coding of LDPC Codes over GF(q),” in Proc. IEEE International Con- ference on Communications, June 2004. [14] Shane Cook, CUDA Programming: A Developer’s Guide to Parallel Computing With GPUs, Morgan Kaufmann, 2012. [15] E. Monzó, “Massive Parallel Decoding of Low-Density Parity-Check Codes Using Graphic Cards,” M.S. thesis, RWTH Aachen University & Universidad Politecnica de Valencia, 2010. [16] X. Y. Hu, E. Eleftheriou, and D. M. Arnold, “Regular and Irregular Progressive Edge-Growth Tanner Graphs,” IEEE Trans. Inform. The- ory, vol. 51, no. 1, pp. 386–398, Jan. 2005.