Parallelization of the
Berlekamp-Massey Algorithm
Using
SIMD Instructions
and
GPU Streams
Presented By: Hamidreza Mohebbi
Advisor: Ming Ouyang
Topics
u BMA Algorithm
u BMA Implementation using SIMD Instructions
u GPU Parallelization using Streams
u Conclusion
BMA Algorithm
u Linear Feedback Shift Register
u Due to ease of implementation, LFSR is widely used
u Cryptography
u GSM cell phone
u Bluetooth
u Scrambling
u PCIe
u SATA
u USB
u GbE
1 1 1
BMA Algorithm
u Given a finite binary sequence, find a shortest LFSR that generates
the sequence
u Elwyn Berlekamp
u A paper/presentation at International Symposium on Information Theory,
Italy, 1967 [1]
u Algebraic coding theory, McGraw-Hill, 1968 [2]
u James Massey
u Shift-register synthesis and BCH decoding, IEEE Transactions on
Information Theory, 1969 [3]
u “LFSR synthesis algorithm”, “Berlekamp iterative algorithm”
BMA Algorithm
BMA Algorithm
u Prof.Ouyang Previous Work[4]
u Reverse S
u Pack 32 bits into one word
u Compute inner product
u Count # of 1-bits in a 32-bit word
u Update C(x)
u GPU is faster than CPU reverse bit version for long input length (more
than 2^22)
BMA Implementation using SIMD Instructions
• A data parallel architecture
• Applying the same instruction to many data
– Save control logic
– A related architecture is the vector architecture
– SIMD and vector architectures offer high performance for vector
operations.
• SSE (vector of size 4)
BMA Implementation using SIMD Instructions
u BMA-SSE
u Using SSE instructions in computation and copying data
u Doing computation for four consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario ¼
), it depends on the input string
u BMA-AVX
u Using AVX instructions in computation and copying data
u Doing computation for eight consecutive elements at each time
u Number of iteration is less than input length (in the best case scenario
1/8 ), it depends on the input string
BMA Implementation using SIMD Instructions
0
500
1000
1500
2000
2500
3000
3500
4000
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Executiontime(sec)
Input size (2 power of)
cpuBit
BMASSE
input: random
compile option: -O3
BMA Implementation using SIMD Instructions
0
1000
2000
3000
4000
5000
6000
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Executiontime(sec)
Input size (2 power of)
cpuBit
BMASSE
BMAAVX
input: random
compile option: -O1
GPU Parallelization using Streams
u Original BMA using GPU Kernels:
u K1, K2, K3, and K4 are kernel functions.
u xi and ki are the time of execution of each part.
Serial K1 Serial K2 Serial K3 Serial K4 Serial
x1 k1 x2 k2 x3 k3 x4 k4 x5
GPU Parallelization using Streams
u Ideal BMA using concurrent kernel execution
Serial K1 Serial K2 Serial K3 Serial K4 Serial
n * x1 k1 k2 k3 k4 n * x5
Serial K1 Serial K2 Serial K3 Serial K4 Serial
.
.
.
.
Serial K1 Serial K2 Serial K3 Serial K4 Serial
n * x2 n * x3 n * x4
GPU Parallelization using Streams
u The total of running n serial input is:
u The time of ideal concurrent program:
u The ideal speed up:
GPU Parallelization using Streams
u In the practical scenario, the hardware supports the S concurrent streams. So
the time of concurrent is equal to:
u The real speed up is:
GPU Parallelization using Streams
0
20
40
60
80
100
120
140
160
180
200
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^16
GPU Parallelization using Streams
0
500
1000
1500
2000
2500
3000
3500
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^20
GPU Parallelization using Streams
0
2000
4000
6000
8000
10000
12000
14000
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^22
GPU Parallelization using Streams
0
5000
10000
15000
20000
25000
30000
35000
2 4 8 16 32 64
ExecutionTime(sec)
Number of Inputs
BMASSE-Serial
BMAStream-1
BMAStream-2
BMAStream-4
BMAStream-8
BMAStream-16
BMAStream-32
BMAStream-64
Input Length: 2^23
Conclusion
u The best performance for input length less than 2^23 belong to SSE
implementation on CPU.
u Using more than 2 streams can reduce the execution time and for input
length 2^23 it is faster than SSE and bit implementation with one stream.
u The best performance between the GPU and SIMD implementation for input
length more than 2^23 belong to GPU stream using 32 concurrent streams.
References
u [1] Berlekamp, Elwyn R. ,"Nonbinary BCH decoding", International Symposium
on Information Theory, San Remo, Italy, 1967.
u [2] Berlekamp, Elwyn R. , "Algebraic Coding Theory" , Laguna Hills, CA:
Aegean Park Press, ISBN 0-89412-063-8. Previous publisher McGraw-Hill, New
York, NY, 1968.
u [3] Massey, J. L., "Shift-register synthesis and BCH decoding", IEEE Trans.
Information Theory, IT-15 (1): 122–127, 1969.
u [4] Ali H, Ouyang M, Sheta W, Soliman A. Parallelizing the Berlekamp-Massey
Algorithm. Proceedings of the Second International Conference on Computing,
Measurement, Control and Sensor Network (CMCSN), 2014.

BMA

  • 1.
    Parallelization of the Berlekamp-MasseyAlgorithm Using SIMD Instructions and GPU Streams Presented By: Hamidreza Mohebbi Advisor: Ming Ouyang
  • 2.
    Topics u BMA Algorithm uBMA Implementation using SIMD Instructions u GPU Parallelization using Streams u Conclusion
  • 3.
    BMA Algorithm u LinearFeedback Shift Register u Due to ease of implementation, LFSR is widely used u Cryptography u GSM cell phone u Bluetooth u Scrambling u PCIe u SATA u USB u GbE 1 1 1
  • 4.
    BMA Algorithm u Givena finite binary sequence, find a shortest LFSR that generates the sequence u Elwyn Berlekamp u A paper/presentation at International Symposium on Information Theory, Italy, 1967 [1] u Algebraic coding theory, McGraw-Hill, 1968 [2] u James Massey u Shift-register synthesis and BCH decoding, IEEE Transactions on Information Theory, 1969 [3] u “LFSR synthesis algorithm”, “Berlekamp iterative algorithm”
  • 5.
  • 6.
    BMA Algorithm u Prof.OuyangPrevious Work[4] u Reverse S u Pack 32 bits into one word u Compute inner product u Count # of 1-bits in a 32-bit word u Update C(x) u GPU is faster than CPU reverse bit version for long input length (more than 2^22)
  • 7.
    BMA Implementation usingSIMD Instructions • A data parallel architecture • Applying the same instruction to many data – Save control logic – A related architecture is the vector architecture – SIMD and vector architectures offer high performance for vector operations. • SSE (vector of size 4)
  • 8.
    BMA Implementation usingSIMD Instructions u BMA-SSE u Using SSE instructions in computation and copying data u Doing computation for four consecutive elements at each time u Number of iteration is less than input length (in the best case scenario ¼ ), it depends on the input string u BMA-AVX u Using AVX instructions in computation and copying data u Doing computation for eight consecutive elements at each time u Number of iteration is less than input length (in the best case scenario 1/8 ), it depends on the input string
  • 9.
    BMA Implementation usingSIMD Instructions 0 500 1000 1500 2000 2500 3000 3500 4000 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Executiontime(sec) Input size (2 power of) cpuBit BMASSE input: random compile option: -O3
  • 10.
    BMA Implementation usingSIMD Instructions 0 1000 2000 3000 4000 5000 6000 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Executiontime(sec) Input size (2 power of) cpuBit BMASSE BMAAVX input: random compile option: -O1
  • 11.
    GPU Parallelization usingStreams u Original BMA using GPU Kernels: u K1, K2, K3, and K4 are kernel functions. u xi and ki are the time of execution of each part. Serial K1 Serial K2 Serial K3 Serial K4 Serial x1 k1 x2 k2 x3 k3 x4 k4 x5
  • 12.
    GPU Parallelization usingStreams u Ideal BMA using concurrent kernel execution Serial K1 Serial K2 Serial K3 Serial K4 Serial n * x1 k1 k2 k3 k4 n * x5 Serial K1 Serial K2 Serial K3 Serial K4 Serial . . . . Serial K1 Serial K2 Serial K3 Serial K4 Serial n * x2 n * x3 n * x4
  • 13.
    GPU Parallelization usingStreams u The total of running n serial input is: u The time of ideal concurrent program: u The ideal speed up:
  • 14.
    GPU Parallelization usingStreams u In the practical scenario, the hardware supports the S concurrent streams. So the time of concurrent is equal to: u The real speed up is:
  • 15.
    GPU Parallelization usingStreams 0 20 40 60 80 100 120 140 160 180 200 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^16
  • 16.
    GPU Parallelization usingStreams 0 500 1000 1500 2000 2500 3000 3500 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^20
  • 17.
    GPU Parallelization usingStreams 0 2000 4000 6000 8000 10000 12000 14000 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^22
  • 18.
    GPU Parallelization usingStreams 0 5000 10000 15000 20000 25000 30000 35000 2 4 8 16 32 64 ExecutionTime(sec) Number of Inputs BMASSE-Serial BMAStream-1 BMAStream-2 BMAStream-4 BMAStream-8 BMAStream-16 BMAStream-32 BMAStream-64 Input Length: 2^23
  • 19.
    Conclusion u The bestperformance for input length less than 2^23 belong to SSE implementation on CPU. u Using more than 2 streams can reduce the execution time and for input length 2^23 it is faster than SSE and bit implementation with one stream. u The best performance between the GPU and SIMD implementation for input length more than 2^23 belong to GPU stream using 32 concurrent streams.
  • 20.
    References u [1] Berlekamp,Elwyn R. ,"Nonbinary BCH decoding", International Symposium on Information Theory, San Remo, Italy, 1967. u [2] Berlekamp, Elwyn R. , "Algebraic Coding Theory" , Laguna Hills, CA: Aegean Park Press, ISBN 0-89412-063-8. Previous publisher McGraw-Hill, New York, NY, 1968. u [3] Massey, J. L., "Shift-register synthesis and BCH decoding", IEEE Trans. Information Theory, IT-15 (1): 122–127, 1969. u [4] Ali H, Ouyang M, Sheta W, Soliman A. Parallelizing the Berlekamp-Massey Algorithm. Proceedings of the Second International Conference on Computing, Measurement, Control and Sensor Network (CMCSN), 2014.