2. AES Encrytion (1)
17.05.2014 Scientific Student Projects Session - May 2014 2
Algorithm to repeatedly apply a block cipher (e.g. AES) to the input plaintext
Most operation modes require an initialization vector
Most used cipher modes: Cipher-block chaining (CBC), Counter (CTR)
Other cipher modes: Electronic codebook (ECB), Output feedback (OFB)
Why use ECB ?
Simple, fast, very well parallelizable, max throughput
Provides a good estimate of how CTR would perform
3. AES Encrytion (2)
17.05.2014 Scientific Student Projects Session - May 2014 3
KeyExpansion: round keys are derived from the cipher key.
InitialRound: (AddRoundKey)
Rounds:
SubBytes— substitution step where each byte is replaced with another
according to SBOX table.
ShiftRows— transposition step where the last three rows of the state are
shifted.
MixColumns—a mixing operation which operates on the columns of the
state. Operations (+,*) are redefined in the Galois Finite Field.
AddRoundKey - bitwise xor of each byte of the state with the round key.
Final Round:(SubBytes, ShiftRows, AddRoundKey).
4. Target System (1)
15.05.2014 Scientific Student Projects Session - May 2012 4
SoC CPU – AMD A4 4000K (2 cores @3.0ghz,
Richland architecture, AES-NI), cores denoted
by BLUE
SoC Integrated GPU HD7480 (iGPU), 2 SIMD
units of 64 cores each (VLIW4 architecture),
SIMD units denoted by RED
Discrete GPU AMD R7 250 (dGPU), 6 SIMD units
of 64 cores each (GCN architecture), PCIe 16x
2.0 bus, SIMD units denoted by RED
Data to be encrypted denoted by GREEN
Software – C/C++/OpenCL, Linux Ubuntu 14.04
x64
6. Algorithm Opt_1
• Array “indata” will reside in global device memory (__global)
• Variable “state” which holds transformations will be in GPU cache (__local)
• Simple operation “ShiftRows” is designed with vector addressing
(state.s05AF49E38.. )
• Simple operation “AddRoundKey” is a simple XOR (state ^ key).
• Complex operation “SubBytes” will use precomputed tables of Sbox, stored in
constant memory
• Complex operation “MixColumns” will use precomputed tables of
Galois_FiniteField, stored in constant memory
• Host sample code bellow (simple blocking enqueues)
while(!done()) { writeData(32MB, &offset);
execKernel(32MB, &offset); readData(32MB, &offset); }
15.05.2014 Scientific Student Projects Session - May 2012 6
12. Results Opt_3
15.05.2014 Scientific Student Projects Session - May 2012 12
• Right figure - Results AES
ECB128 in MB/sec, of serial
(Opt_2) vs overlap (Opt_3)
• Bellow figure – 3 OpenCL
queues (R, W, E) for async
enqueues hence to achieve
overlap execution with I/O
13. Conclusions
15.05.2014 Scientific Student Projects Session - May 2012 13
iGPU AES performance is good (faster than CPU but CPU AESNI is fastest)
Prefer cache over constant memory
Where possible analyze using precomputed tables vs computation on the fly
Overlaping execution with I/O could improve iGPU performance by 10-20%
Space of the iGPU occupied in the x86 SoC die increases with each generation and its
contribution in AES throughput will increase as well
Memory transfers are expected to improve with each new generation and with them
CPU/iGPU performance