AES on modern GPUs

Author(s)
Politehnica
University of
Bucharest
Automatic Control
and Computers
Faculty
Computer
Science
Department
Scientific Advisor
AES encryption using GPU
architectures
Grigore Lupescu Emil Slusanschi
Scientific Student Projects Session - May 2014

AES Encrytion (1)
17.05.2014 Scientific Student Projects Session - May 2014 2
 Algorithm to repeatedly apply a block cipher (e.g. AES) to the input plaintext
 Most operation modes require an initialization vector
 Most used cipher modes: Cipher-block chaining (CBC), Counter (CTR)
 Other cipher modes: Electronic codebook (ECB), Output feedback (OFB)
 Why use ECB ?
 Simple, fast, very well parallelizable, max throughput
 Provides a good estimate of how CTR would perform

AES Encrytion (2)
 KeyExpansion: round keys are derived from the cipher key.
 InitialRound: (AddRoundKey)
 Rounds:
 SubBytes— substitution step where each byte is replaced with another
according to SBOX table.
 ShiftRows— transposition step where the last three rows of the state are
shifted.
 MixColumns—a mixing operation which operates on the columns of the
state. Operations (+,*) are redefined in the Galois Finite Field.
 AddRoundKey - bitwise xor of each byte of the state with the round key.
 Final Round:(SubBytes, ShiftRows, AddRoundKey).

Target System (1)
 SoC CPU – AMD A4 4000K (2 cores @3.0ghz,
Richland architecture, AES-NI), cores denoted
by BLUE
 SoC Integrated GPU HD7480 (iGPU), 2 SIMD
units of 64 cores each (VLIW4 architecture),
SIMD units denoted by RED
 Discrete GPU AMD R7 250 (dGPU), 6 SIMD units
of 64 cores each (GCN architecture), PCIe 16x
2.0 bus, SIMD units denoted by RED
 Data to be encrypted denoted by GREEN
 Software – C/C++/OpenCL, Linux Ubuntu 14.04
x64

Target System (2)

Algorithm Opt_1
• Array “indata” will reside in global device memory (__global)
• Variable “state” which holds transformations will be in GPU cache (__local)
• Simple operation “ShiftRows” is designed with vector addressing
(state.s05AF49E38.. )
• Simple operation “AddRoundKey” is a simple XOR (state ^ key).
• Complex operation “SubBytes” will use precomputed tables of Sbox, stored in
constant memory
• Complex operation “MixColumns” will use precomputed tables of
Galois_FiniteField, stored in constant memory
• Host sample code bellow (simple blocking enqueues)
while(!done()) { writeData(32MB, &offset);
execKernel(32MB, &offset); readData(32MB, &offset); }

Results Opt_1
• AMD CodeXL profiling, initial results – iGPU A4 4000, ~100MB/sec AES ECB128

Algorithm Opt_2
• Array “indata” will reside in global device memory (__global)
• Variable “state” which holds transformations will be in GPU cache (__local)
• Simple operation “ShiftRows” - unchanged
• Simple operation “AddRoundKey” – unchanged
• Complex operation “SubBytes” will use precomputed tables of Sbox, stored in
cache memory (__local)
• Complex operation “MixColumns” compute values instead of using precomputed
(used optimized version of MixColumns)
• Host sample code – unchanged

Results Opt_2
• Profiling, Opt_1 – iGPU A4 4000, ~100MB/sec AES ECB128
• Profiling, Opt_2 – iGPU A4 4000, ~210MB/sec AES ECB128

Algorithm Opt_3
• Array “indata” will reside in global device memory
(__global)
• Variable “state” which holds transformations will be
in GPU cache (__local)
• Simple operation “ShiftRows” - unchanged
• Simple operation “AddRoundKey” – unchanged
• Complex operation “SubBytes” – unchanged
• Complex operation “MixColumns” - unchanged
• Host sample code – overlap execution with I/O by
creating multiple queues (R, W, E)

Algorithm Opt_3 (2)

Results Opt_3
• Right figure - Results AES
ECB128 in MB/sec, of serial
(Opt_2) vs overlap (Opt_3)
• Bellow figure – 3 OpenCL
queues (R, W, E) for async
enqueues hence to achieve
overlap execution with I/O

Conclusions
 iGPU AES performance is good (faster than CPU but CPU AESNI is fastest)
 Prefer cache over constant memory
 Where possible analyze using precomputed tables vs computation on the fly
 Overlaping execution with I/O could improve iGPU performance by 10-20%
 Space of the iGPU occupied in the x86 SoC die increases with each generation and its
contribution in AES throughput will increase as well
 Memory transfers are expected to improve with each new generation and with them
CPU/iGPU performance

AES on modern GPUs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AES on modern GPUs

Similar to AES on modern GPUs (20)

Recently uploaded

Recently uploaded (20)

AES on modern GPUs