BMA

Parallelization of the
Berlekamp-Massey Algorithm
Using
SIMD Instructions
and
GPU Streams
Presented By: Hamidreza Mohebbi
Advisor: Ming Ouyang

Topics
u BMA Algorithm
u BMA Implementation using SIMD Instructions
u GPU Parallelization using Streams
u Conclusion

BMA Algorithm
u Linear Feedback Shift Register
u Due to ease of implementation, LFSR is widely used
u Cryptography
u GSM cell phone
u Bluetooth
u Scrambling
u PCIe
u SATA
u USB
u GbE
1 1 1

BMA Algorithm
u Given a finite binary sequence, find a shortest LFSR that generates
the sequence
u Elwyn Berlekamp
u A paper/presentation at International Symposium on Information Theory,
Italy, 1967 [1]
u Algebraic coding theory, McGraw-Hill, 1968 [2]
u James Massey
u Shift-register synthesis and BCH decoding, IEEE Transactions on
Information Theory, 1969 [3]
u “LFSR synthesis algorithm”, “Berlekamp iterative algorithm”

BMA Algorithm
u Prof.Ouyang Previous Work[4]
u Reverse S
u Pack 32 bits into one word
u Compute inner product
u Count # of 1-bits in a 32-bit word
u Update C(x)
u GPU is faster than CPU reverse bit version for long input length (more
than 2^22)

BMA Implementation using SIMD Instructions
• A data parallel architecture
• Applying the same instruction to many data
– Save control logic
– A related architecture is the vector architecture
– SIMD and vector architectures offer high performance for vector
operations.
• SSE (vector of size 4)

u BMA-SSE
u Using SSE instructions in computation and copying data
u Doing computation for four consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario ¼
), it depends on the input string
u BMA-AVX
u Using AVX instructions in computation and copying data
u Doing computation for eight consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario
1/8 ), it depends on the input string

0
500
1000
1500
2000
2500
3000
3500
4000
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Executiontime(sec)
Input size (2 power of)
cpuBit
BMASSE
input: random
compile option: -O3

0
1000
2000
3000
4000
5000
6000
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Executiontime(sec)
Input size (2 power of)
cpuBit
BMASSE
BMAAVX
input: random
compile option: -O1

GPU Parallelization using Streams
u Original BMA using GPU Kernels:
u K1, K2, K3, and K4 are kernel functions.
u xi and ki are the time of execution of each part.
Serial K1 Serial K2 Serial K3 Serial K4 Serial
x1 k1 x2 k2 x3 k3 x4 k4 x5

u Ideal BMA using concurrent kernel execution
n * x1 k1 k2 k3 k4 n * x5
.
.
.
.
n * x2 n * x3 n * x4

u The total of running n serial input is:
u The time of ideal concurrent program:
u The ideal speed up:

u In the practical scenario, the hardware supports the S concurrent streams. So
the time of concurrent is equal to:
u The real speed up is:

0
20
40
60
80
100
120
140
160
180
200
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^16

0
500
1000
1500
2000
2500
3000
3500
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^20

0
2000
4000
6000
8000
10000
12000
14000
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^22

0
5000
10000
15000
20000
25000
30000
35000
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^23

Conclusion
u The best performance for input length less than 2^23 belong to SSE
implementation on CPU.
u Using more than 2 streams can reduce the execution time and for input
length 2^23 it is faster than SSE and bit implementation with one stream.
u The best performance between the GPU and SIMD implementation for input
length more than 2^23 belong to GPU stream using 32 concurrent streams.

References
u [1] Berlekamp, Elwyn R. ,"Nonbinary BCH decoding", International Symposium
on Information Theory, San Remo, Italy, 1967.
u [2] Berlekamp, Elwyn R. , "Algebraic Coding Theory" , Laguna Hills, CA:
Aegean Park Press, ISBN 0-89412-063-8. Previous publisher McGraw-Hill, New
York, NY, 1968.
u [3] Massey, J. L., "Shift-register synthesis and BCH decoding", IEEE Trans.
Information Theory, IT-15 (1): 122–127, 1969.
u [4] Ali H, Ouyang M, Sheta W, Soliman A. Parallelizing the Berlekamp-Massey
Algorithm. Proceedings of the Second International Conference on Computing,
Measurement, Control and Sensor Network (CMCSN), 2014.

BMA

More Related Content

What's hot

Viewers also liked

Similar to BMA

BMA