Encryption Code Generator

961 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
961
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Encryption Code Generator

  1. 1. University of Dublin<br />TRINITY COLLEGE<br />ENCRYPTION CODE GENERATOR<br />Paul Magrath<br />B.A. (Mod.) Computer Science<br />Final Year Project - May 2009Supervisor: David Gregg<br />School of Computer Science and Statistics<br />O’Reilly Institute, Trinity College, Dublin 2, Ireland<br />Declaration<br />I hereby declare that this thesis is entirely my own work and that it has not been submitted as an exercise for a degree at any other university.<br />_________________________________ April 24th, 2009<br />Paul Magrath<br />Permission to Lend<br />I agree that the Library and other agents of the College may lend or copy this thesis upon request.<br />_________________________________ April 24th, 2009<br />Paul Magrath<br />Acknowledgements<br />To David Gregg, for his support and advice throughout this project. <br />To Laura, for the love you have given me and for putting up with me.<br />Table of Contents TOC o " 1-3" h z u Declaration PAGEREF _Toc101616017 h 2Permission to Lend PAGEREF _Toc101616018 h 3Acknowledgements PAGEREF _Toc101616019 h 41. Motivation PAGEREF _Toc101616020 h 7Introduction PAGEREF _Toc101616021 h 7Readers’ Guide to the Report PAGEREF _Toc101616022 h 9Background PAGEREF _Toc101616023 h 9AES Code Generator PAGEREF _Toc101616024 h 9Experimental Results PAGEREF _Toc101616025 h 9Conclusions PAGEREF _Toc101616026 h 9References PAGEREF _Toc101616027 h 92. Background PAGEREF _Toc101616028 h 10Encryption PAGEREF _Toc101616029 h 10AES PAGEREF _Toc101616030 h 11SIMD PAGEREF _Toc101616031 h 12SSE PAGEREF _Toc101616032 h 12Optimization PAGEREF _Toc101616033 h 143. AES Code Generator PAGEREF _Toc101616034 h 17I. Correctness: An AES-256 implementation PAGEREF _Toc101616035 h 17II. The Generator PAGEREF _Toc101616036 h 18A. Steaming store PAGEREF _Toc101616037 h 19B. Unwind inner loop PAGEREF _Toc101616038 h 20C. Use local variables PAGEREF _Toc101616039 h 21D. Unwind outer loop PAGEREF _Toc101616040 h 22E. Interleave PAGEREF _Toc101616041 h 23F. OpenMP PAGEREF _Toc101616042 h 24G. Prefetch to cache PAGEREF _Toc101616043 h 25H. Preload to register PAGEREF _Toc101616044 h 26III Simulating PAGEREF _Toc101616045 h 274. Experimental Results PAGEREF _Toc101616046 h 29Intel Core 2 Quad 2.4Ghz PAGEREF _Toc101616047 h 30Sequential PAGEREF _Toc101616048 h 32Parallel PAGEREF _Toc101616049 h 43Intel Core 2 Duo 2.16Ghz PAGEREF _Toc101616050 h 45Sequential PAGEREF _Toc101616051 h 45Parallel PAGEREF _Toc101616052 h 45Intel Pentium 4 Dual Processor PAGEREF _Toc101616053 h 47Sequential PAGEREF _Toc101616054 h 47Parallel PAGEREF _Toc101616055 h 475. Conclusions PAGEREF _Toc101616056 h 49Contributions PAGEREF _Toc101616057 h 49Future Work PAGEREF _Toc101616058 h 50References PAGEREF _Toc101616059 h 51<br />1. Motivation<br />Introduction<br />In the world today, encryption is vital. Without it, there is no security, freedom or privacy at all. Governments, corporations, strangers and criminals all routinely attempt to gather as much information as possible about us and what we do online and in real life in order to profile us, tempt us, learn about us and defraud us respectively. Every day, with social networking and online records, more and more information is available in order to do this. Even more is available when measures such as deep packet inspection are turned to. Encryption is a solution to these problems in that it allows us to have a modicum of control over who can access the data that we distribute, who can listen to our calls, read our mail, and read our bank statements.<br />This project investigates using a code generator to generate the various variants of an Advanced Encryption Standard (AES) encryption loop. AES (see ‘Background’, chapter 2) is a form of encryption that Intel will begin supporting in hardware in their microprocessors in 2010. An encryption loop is the iterative loop that loops through all the data to be encrypted and performs steps necessary to encrypt the data. As such it is a computationally expensive loop, requiring a large amount of CPU time, and is, hence, a candidate for optimization in order to reduce the time taken. <br />A code generator, in this context, is a program that generates a number of different variants of a piece of code in order to find which combination of optimization techniques yields the best possible result (see ‘AES Code Generator’, chapter 3, for variants employed here). The use of code generators is an established technique for solving problems in optimizing for modern architectures, used primarily in the research community. It is ideal for optimizing a small piece of code that uses a vast amount of processing time and to which the best optimizations are not obvious. Code generators have been successfully applied to several projects such as [1] and [2]. The motivation for a code generator is to avoid the problems with code maintenance and code readability that almost inevitably result from hand-tuned assembly specific to the architecture it is written for. A code generator can tune itself to the architecture it is running on to find the best combination of optimizations for that architecture, while remaining readable and maintainable as it can be written in a high level language, such as C++. <br />In 2010 Intel will release processors that will have AES instructions built in to their instruction set. This will greatly reduce the cost of encryption, as it will be possible to perform the encryption much more quickly and efficiently than previously. This is part of an enhancement, and replacement, of the current Intel SIMD instruction set, the Streaming SIMD Extension (SSE). The replacement instruction set will be known the Intel Advanced Vector Extensions (AVX).<br />This project takes a look at what would be the most likely ways that the use of such instructions as part of a standard AES encryption loop could be optimised, how this can be automated using a code generator, and presents the results of running this AES code generator to generate these optimisation combinations on different architectures and processors.<br />Readers’ Guide to the Report<br />Background<br />A summary of the what readers should understand and be aware of for consideration of the report, particularly Encryption, AES, SIMD, SSE, OpenMP and Optimisation.<br />AES Code Generator<br />An outline and discussion of the various variants supported by the generator that was implemented as part of the project as well as an explanation of how the correctness of the input and output AES encryption loops was confirmed.<br />Experimental Results<br />Tables and diagrams summarising and demonstrating the results obtained by the timing and measurement of the outputs from the Encryption Code Generator using hardware performance counters, or other means as available.<br />Conclusions<br />A discussion of the conclusions that can be drawn from the results and of the possible future works that can build upon this project.<br />References<br />Systemic and complete reference to sources used and a classified list of all sources.<br />2. Background<br />Encryption<br />Encryption is the translation of data into a secret code. This is done in an attempt to keep information secure. To read an encrypted file, you must have access to a secret key or password that enables you to decrypt it. Hence, third parties without access to the shared secret key, such as online criminals, curious neighbours and oppressive governments, are unable to easily (or, in the case of strong encryption, at all) access the data or information that has been encrypted. Unencrypted data is called plain text while encrypted data is referred to as cipher text.<br />There are two main types of encryption: asymmetric encryption (also known as public key encryption) and symmetric encryption. Asymmetric encryption is a form of encryption where keys come in pairs. What one key encrypts, only the other can decrypt. The encryption key is usually made public and distributed freely as only the holder of the decryption key is able to read the data that has been encrypted with the encryption key. Symmetric encryption is a form of encryption where the same key is used for both encryption and decryption. The key must be kept secret, and is shared by the message sender and recipient. This form of encryption can usually be performed much more quickly and efficiently than asymmetric encryption. In practice, asymmetric encryption is usually used to encrypt an insecure communication channel in order to allow for the exchange of the secret key for the symmetric encryption that will be used for the rest of the communications. This technique effectively combines the strengths of the two forms of encryption and is the basis of the SSL/TLS family of encryption protocols that include HTTPS that is used everyday for online banking and shopping.<br />AES<br />AES (Advanced Encryption Standard) is one of the most popular algorithms used in symmetric encryption.<br />Originally published as Rijndael [3], it was adopted as a standard by the U.S. government in November 2001 [4], after a five-year standardization process involving fifteen competing designs. The standard comprises three block ciphers, AES-128, AES-192 and AES-256, adopted from a larger collection. A block cipher is a cipher that operates on fixed-length groups of bits, termed blocks, with an unvarying transformation. The block cipher takes in two inputs, the plaintext of the block and the secret key, and outputs the ciphertext (encrypted text) of the block. Each AES cipher has a 128-bit block size, which means that 128-bits of the plaintext are encrypted into ciphertext in each iteration of the encryption loop. AES-128, AES-192 and AES-256 have secret keys of sizes 128, 192 and 256 bits respectively where AES-128 is the least secure while AES-256 is the most secure.<br />When the length of data to be encrypted exceeds the block size, a mode of operation must be used [5]. The two that we will concern ourselves with are Electronic Code Book (ECB) and Counter (CTR). These will be discussed in detail in the AES Code Generator (chapter 3).<br />An outline of the algorithm for AES-256 encryption is:<br />Key Expansion <br />Using the Rijndael key schedule, the 14 round keys are extracted from the 256-bit secret key.<br />Initial Round (round 0)<br />The initial round is simply the bitwise XOR of the round key to the plaintext (referred to as the ‘state’ during the encryption).<br />Rounds 1 through 12<br />Each of these rounds is comprised of a non-linear substitution step, a transposition step, a mixing step, and a bitwise XOR of the round key to the state.<br />Final Round (round 13)<br />The final round is identical to Rounds 1 through 12 except the mixing step is omitted.<br />It should be noted that the key expansion only has to be performed once for any given secret key. Hence, the encryption loop is composed only of the Initial Round, the Rounds, and the Final Round as the round keys extracted during key expansion can be reused for whatever many iterations of the encryption loop it takes to encrypt the entirety of the plaintext.<br />SIMD<br />SIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level parallelism. Parallelism is when calculations are carried out simultaneously. In SIMD computer architecture, the computer exploits multiple data streams against a single instruction stream in order to perform operations that may be easily parallelized [6]. <br />By processing multiple data elements in parallel, SIMD processors provide a way to utilize data parallelism in applications that apply a single operation to all elements in a data set, such as a vector or matrix [7].<br />SSE<br />Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to the x86 architecture. It was designed by Intel and introduced with the Pentium III microprocessor family.<br />SSE originally added eight new 128-bit registers (a register is a memory device usually used to store local variables) known as XMM0 through XMM7. The 64-bit variant of the x86 architecture, x86-64, has a further eight registers referred to as XMM8 through XMM15. XMM0 through XMM15 can be accessed in 64-bit operating mode, while only XXM0 through XMM7 can be accessed in 32-bit operating mode.<br />Each register packs together four 32-bit single-precision floating point numbers or two 64-bit double-precision floating point numbers or four 32-bit integers or eight 16-bit short integers or sixteen 8-bit bytes or characters. (Hence, each register could hold the contents of an entire 128-bit AES block.) There have been a number of iterations of SSE, each of which has added a number of enhancements in terms of instructions. SSE4 is the version of SSE that is supported by the current Intel Core micro architecture. [8]<br />AVX (Advanced Vector Extensions) is an advanced version of SSE, which will appear in Intel products in 2010, and which features a 256 bit data path (widened from 128 bits in SSE4). AVX will provide six new instructions for symmetric encryption/decryption using the Advance Encryption Standard (AES) and one instruction performing carry-less multiplication (PCMULQDQ) which aids in performing finite field arithmetic (a type of arithmetic used in advanced block cipher encryption). These hardware-based primitives provide a security benefit apart from their speed advantage by avoiding table-lookups and hence protecting against software side channel attacks (attempts to discover the secret key by observing and analyzing the flow of information in the computer during the encryption process). [9]<br />However, we do not need to program in assembly in order to utilise any of these extensions to the instruction set. Instead we can use intrinsics. Intrinsics are special functions for which the compiler generally has a specific optimisation path and which the compiler encodes to one or more machine instructions. They represent a good middle ground between speed of execution and ease of use for the programmer. With modern optimising compilers, particularly with the Intel C++ Compiler, we can get nearly as good results as the best hand tuned assembly code, without compromising code readability or having to bother about register management.<br />OpenMP<br />The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on many architectures, including Linux, Windows and Mac OS X. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. OpenMP gives programmers a simple model and interface for developing parallel applications [10].<br />The Hello World example below demonstrates how easy it is to create additional threads to carry out work using the API provided by OpenMP:<br />int main(int argc, char* argv[])<br /> {<br /> #pragma omp parallel <br /> printf(" Hello World! " );<br /> return 0;<br /> }<br />The above program will print “Hello World!” once for each processor core present on the machine.<br />Similarly, for a for loop:<br />int main(int argc, char **argv) {<br /> const int N = 100000;<br /> int i, a[N];<br /> <br /> #pragma omp parallel for<br /> for (i = 0; i < N; i++)<br /> a[i] = 2 * i;<br /> <br /> return 0;<br />}<br />Optimization<br />Optimization is the process of tuning the output of a compiler to minimize or maximize some attribute of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the amount of memory occupied. Optimization is usually applied by the compiler, which will usually attempt to generate the fastest possible code. However, a programmer will sometimes attempt to help the compiler out by performing some optimizations manually as (s)he, in theory, knows the algorithms better.<br />The general themes of optimizing programs are:<br />Optimize the common case<br />The common case may have unique properties that allow a very quick computation at the expense of a very slow computation for certain less common cases. If the common case is taken most often, the result can be better over-all performance.<br />Avoid redundancy<br />Reuse results that are already computed and store them for use later, instead of re-computing them.<br />Less code<br />Remove unnecessary computations and intermediate values. Less work for the CPU, cache, and memory usually results in faster execution. Alternatively, in embedded systems, less code requires less memory and brings a lower product cost.<br />Straight line code<br />Less complicated code with less jumps and conditional branches will be faster as they interfere with the pre-fetching of instructions, thus slowing down code. <br />Locality<br />Code and data that are accessed closely together in time should be placed close together in memory to increase spatial locality of reference, and hence reduce the amount of register loading.<br />Manage memory efficiently<br />Place the most commonly used items in registers first, then caches, then main memory, before going to disk.<br />Parallelize<br />Reorder operations to allow multiple computations to happen in parallel, either at the instruction (using SIMD instructions such as those in the Intel SSE instruction set), memory, or thread level (using an API like OpenMP). [11]<br />3. AES Code Generator<br />AES Code Generator is discussed here under a number of sections:<br />Correctness: An AES-256 implementation<br />The Generator<br />Simulating<br />Testing<br />I. Correctness: An AES-256 implementation<br />The correctness of the input and output is confirmed by means of a customised AES-256 implementation. <br />The implementation described in this report emulates AES-256 and was based upon a byte-oriented portable C implementation in which all the lookup tables had been replaced with “on-the-fly” calculations [12]. The implementation is fully compliant with the specification and is highly portable. There was no assembler in the original code but I later wrapped the code with vector functions (SSE functions) and vectorised some, but not all, of the code. This was done so that the function signatures (i.e. the names and parameters of the functions) would be compatible with those of the AES instructions that Intel will introduce as part of AVX (the Advanced Vector Extensions, see Background, chapter 2). <br />Since the purpose of including the AES code is to check correctness, the slow speed of this implementation is not a problem. It is provided as a simple means of verifying that the implemented variants do not change the basic algorithm of the AES encryption loop that is the input to the generator. <br />II. The Generator<br />The generator system generates a large number of simple variations of a basic AES encryption loop. These various modifications of the code can then be run on a particular model of processor, and with various compiler switches, to find the best variant for that particular processor. <br />The generator was tested with two block cipher modes of operation of 256 bit AES: Electronic Code Book (ECB) and Counter (CTR). Electronic code book is the simplest of the encryption modes. The data to be encrypted is divided into blocks and each block is encrypted separately. However, identical plaintext blocks are encrypted into identical ciphertext blocks, so data patterns are easily recognised in the ciphertext. Hence, it is not recommended for use in cryptographic protocols at all. On the other hand, counter turns a block cipher into a stream cipher. Instead of encrypting the data itself, successive values of a “counter” are encrypted. This is usually concatenated, added or XORed with an initialization vector to produce the unique counter variable for the encryption. The counter is then XORed with the actual text to be encrypted in order to form the actual ciphertext.<br />The generator can take an input file of an encryption loop of either mode of operation, electronic codebook (ECB) or counter (CTR), as its input file. These two types of block cipher mode of operation were chosen because they were the easiest to parallelise. There are a number of different options available for generating the variants. Each of these options, is, in essence, a distinct optimization, or set of possible configurations of optimizations, that can be performed. The options are:<br />Streaming store<br />Unwind inner loop<br />Use local variables<br />Unwind outer loop<br />Interleave<br />OpenMP (parallel)<br />Prefetch to cache<br />Prefetch to register<br />In the following sections, these options are described in more detail.<br />A. Steaming store<br />In the Steaming Store option, we use an SSE instruction instead of storing the result to memory using a standard memory assignment. This variant uses the _mm_stream_si128 instruction to store the result directly to memory without polluting the caches. Like with all the options to the generator, specifying it will generate the variants with the option enabled and disabled.<br />Before:<br />result = encrypt_final(result, *keys);<br />result = _mm_xor_si128(result,source[i]);<br />dest[i] = result;<br />After:<br />result = encrypt_final(result, *keys);<br />result = _mm_xor_si128(result,source[i]);<br />_mm_stream_si128(&(dest[i]),result);<br />Normally, when we write to memory, the cache is updated with the contents of the write so that if there is a read request for that information soon after the write, it can be recalled quickly. However, in this situation the information that is being written is the cipher text that has just been encrypted, and we want to keep the cache for memory that we are going to be accessing again such as round keys and the plain text to be encrypted.<br />However, this is an option that can be turned on or off as it can happen that the use of these instructions can interfere with the compiler’s own optimizations. The generator can therefore experiment with both versions in combination with lots of other options.<br /> <br />B. Unwind inner loop<br />This variant unwinds the inner loop to the extent specified in the argument. <br />Loop unrolling is a technique that attempts to increase the execution speed of the program at the expense of its size. The loop is rewritten as a sequence of independent statements, hence reducing (in this case, eliminating) the overhead of evaluating the loop condition on each of the iterations and reducing the number of jumps and conditional branches that need to be executed. <br />There are two side effects of loop unrolling. These are an increased register usage in a single iteration to store temporary variables (but not in this case, as we are completely eliminating the loop rather than just unwinding it a little), and the code size expansion after the unrolling. Large code size can lead to an increase in instruction cache misses.<br />In this case, the loop is quite short (equal to the number of keys, 14) and hence a significant speed boost should be observed due to the removal of the control variable check, the 14 jumps and the conditional branches from the control flow.<br />Before:<br /> for (int j = 1; j < nKeys; j++ ) {<br /> result = encrypt_round(result, *(keys+j));<br /> }<br />After:<br /> result = encrypt_round(result, *(keys+1));<br /> result = encrypt_round(result, *(keys+2));<br /> result = encrypt_round(result, *(keys+3));<br /> ...<br /> result = encrypt_round(result, *(keys+12));<br /> result = encrypt_round(result, *(keys+13));<br />Loop unrolling can also aide the compiler and the processor perform their own optimizations, such as instruction scheduling.<br />C. Use local variables<br />This variant uses local variables for the AES round keys instead of memory accesses (the round keys are sub-keys used for the individual rounds extracted from the cipher key using the Rijndael key schedule). This involves defining the variables, assigning them the round keys from their memory locations and updating all references in the input file to the memory location to refer to the variables instead. The idea is that assigning the round keys to variables should be a massive hint to the compiler to keep the round keys in registers rather than doing a memory access (it is faster to access registers than the L1 cache where the round keys will probably be).<br />It is also observable that the number of round keys that are stored in registers can have a negative impact on run time. This is due the impact that storing them in registers has on the number of registers available for other purposes, such as storing temporary variables such as results. This is particularly relevant when a high level of unwinding of the outer loop has also occurred, especially when it has been interleaved as well.<br />As such, up to 214 different variants of the number of round keys stored in local variables as opposed to using memory accesses can be generated. There are 214 different variants as there are 14 round keys that could be stored in local variables and hence 214 different permutations of round keys in local variables and memory accesses. However, for testing purposes, only 14 permutations formed by loading successive round keys into local variables were looked at.<br />Before:<br /> <br /> for ( i = 0; i < limit; i++ ) {<br />...<br />result = encrypt_round(result, *(keys+1));<br /> result = encrypt_round(result, *(keys+2));<br /> result = encrypt_round(result, *(keys+3));<br /> ...<br /> result = encrypt_round(result, *(keys+12));<br /> result = encrypt_round(result, *(keys+13));<br />After:<br />const vector_type key0 = keys[0];<br />const vector_type key1 = keys[1];<br />const vector_type key2 = keys[2];<br />const vector_type key3 = keys[3];<br /> ...<br />const vector_type key12 = keys[12];<br /> const vector_type key13 = keys[13];<br /> <br /> for ( i = 0; i < limit; i++ ) {<br /> ...<br /> result = encrypt_round(result, (key1));<br /> result = encrypt_round(result, (key2));<br /> result = encrypt_round(result, (key3));<br /> ...<br /> result = encrypt_round(result, (key12));<br /> result = encrypt_round(result, (key13));<br />Using local variables for the round keys also has the bonus that it allows the compiler to apply other optimizations much earlier in the process than if the round keys are left as elements of an array as it can be difficult for the compiler to prove that the array is unaliased (i.e. does not reference the same location or variables as another pointer).<br />D. Unwind outer loop<br />This variant unwinds the outer loop to the extent specified in the argument. The technique, and the side effects, is the same as with unwinding the inner loop, except the effect is much greater as a result of the greater number of instructions involved. <br />There is a certain threshold where the returns from the unwinding begin to rapidly diminish and then the effect of the limited number of registers and the instruction cycle misses hits and run time increases again. <br />Before:After:for ( i = 0; i < limit; i++ ) { vector_type result; result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]); dest[i] = result;} // end outer loopfor ( i = 0; i < (limit-2); i+=2) { vector_type result; // iteration 0 result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]); dest[i] = result; // end of original outer loop // iteration 1 result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i+1) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]); dest[i+1] = result; // end of original outer loop} // end unrolled loop<br />E. Interleave<br />This variant interleaves an unwound outer loop to the extent specified in the argument. The idea is that by interleaving the loop, we reduce the number of instructions that are stalling due to a reliance on the result of a previous instruction. <br />Since each iteration is dealing with a different intermediate result, and since each iteration is dealing with its own intermediate result a number of times as it works through the encryption rounds, it makes sense the a number of operations of the same round, rather than the same iteration, after one another. Hence, the delay that would exist in an iteration between the first and the second round while waiting for the result from the first is filled by calculating the result of the first round of the next iteration.<br />Also, it is potentially very important for performance that the keys are used multiple times in sequence so that it is strictly necessary to load them just the once for each sequence of multiple instructions using that key.<br />for ( i = 0; i < (limit-2); i+=2) { vector_type result; result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]); dest[i] = result; // end of original outer loop result = _mm_add_epi64(nonce,_mm_set_epi32(0,0,0,i+1)); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]); dest[i+1] = result; // end of original outer loop} // end unrolled loopfor ( i = 0; i < (limit-2); i+=2) { vector_type result0; vector_type result1; result0 = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); result1 = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i+1) ); // initial round result0 = encrypt_initial(result0, enckey); result1 = encrypt_initial(result1, enckey); // encryption result0 = encrypt_round(result0, (key1)); result1 = encrypt_round(result1, (key1)); result0 = encrypt_round(result0, (key2)); result1 = encrypt_round(result1, (key2)); result0 = encrypt_round(result0, (key3)); result1 = encrypt_round(result1, (key3)); ... result0 = encrypt_round(result0, (key11)); result1 = encrypt_round(result1, (key11)); result0 = encrypt_round(result0, (key12)); result1 = encrypt_round(result1, (key12)); result0 = encrypt_round(result0, (key13)); result1 = encrypt_round(result1, (key13)); // final round result0 = encrypt_final(result0, key0); result1 = encrypt_final(result1, key0); result0 = _mm_xor_si128(result0,source[i]); result1 = _mm_xor_si128(result1,source[i+1]); dest[i] = result0; dest[i+1] = result1; // end of original outer loop} // end unrolled loop<br />F. OpenMP<br />This variant includes the OpenMP pragma directive that are normally commented out for the other variants. This allows for the investigation of parallel versions of the code, for running on processors with multiple cores and/or simultaneous multithreading.<br />Any of the other variants can be generated with the OpenMP variant to create both threaded and non-threaded versions of the same code to investigate which is the most efficient of the options.<br />Before:<br /> for ( i = 0; i < (limit-2); i+=2) {<br /> vector_type result0;<br /> vector_type result1;<br /> vector_type src0;<br /> vector_type src1;<br />After:<br /> #pragma omp parallel for <br /> for ( i = 0; i < (limit-2); i+=2) {<br /> vector_type result0;<br /> vector_type result1;<br /> vector_type src0;<br /> vector_type src1;<br />G. Prefetch to cache<br />This variant uses the _mm_prefetch SSE intrinsic to prefetch source data to the cache. The instruction loads one cache line of data from the given address to a location closer to the processor. The idea is to prime the cache for the next iteration(s) of the encryption loop.<br />However, a balance must be found to ensure that priming the cache too far ahead does not poison the cache (i.e. if we prefetch too many lines into the cache, or prefetch the lines too soon, it may cause lines in the cache that we still need to be removed). Hence, the generator must generate a large number of variants in order to find a version that prefetches the source data just enough instructions ahead.<br />Also, the number of iterations before its use that source data is prefetched can have an impact on the speedup gained from the optimization so a number of variants must be produced in order to find the one with the greatest speed up.<br />Example:<br />_mm_prefetch((char const*)&source[i+2],_MM_HINT_T0);<br />H. Preload to register<br />This variant uses a variable to preload source data to a register. The idea is to touch the L1 cache so it is primed with the line from which the source data value is chosen for the register for the next iteration(s) of the encryption loop. This is important as the prefetch SSE instruction only brings things into the L2 cache.<br />However, there are only a limited number of registers that are available for use. Hence, the generator must generate a large number of variants in order to find the version that will lead to an improvement in runtimes.<br />Example:<br />const vector_type src2 = source[i+2];<br />Most Intel microprocessors are also capable of doing prefetching in hardware, so sometimes attempts at prefetching have little, no or even a negative effect, if the hardware prefetcher does a better job.<br />III Simulating<br />The new architecture that Intel will release in 2010 introduces six Intel SSE instructions that facilitate encryption. Four instructions, namely AESENC, AESENCLAST, AESDEC, and AESDELAST facilitate high performance AES encryption and decryption. The other two, namely AESIMC and AESKEYGENASSIST, support the AES key expansion procedure. Together, these instructions will provide a full hardware to support AES, offering security, high performance, and a great deal of flexibility. [13]<br />In order to simulate the performance effect of the actual AES instructions without the actual hardware that supports them, the generator supports replacing the AES encryption instructions in the input code with instructions that are of various different latencies (provided in #defines in the input code) that we can use as proxies for the purposes of testing the effect of optimizations. This allows the testing of the code with various different latencies to give an idea of the effect of the speedups implemented by the variants. This is done only with the four instructions that facilitate high performance AES encryption and decryption. The two instructions supporting AES key expansion are not emulated since, as key expansion is only done once no matter the amount of plaintext to be encrypted, there is little to no benefit in studying speedups of it especially as it is relatively quick.<br />For the purposes of this project, instructions of latencies two, three and five cycles respectively were chosen for comparison, using the available Intel documentation [12]. Intel documentation published to date [13] indicates that the initial chips supporting AES encryption in hardware will do so in six cycles. As such, the figures for the effect of the various optimizations on five cycle latency instructions are the most relevant for the first generation of hardware supporting the instructions (given that there are not currently any six or seven latency instructions in the SSE instruction set). The figures for two and three latencies, however, represent what we can expect from the second or third generation of hardware.<br />IV Testing<br />In order to test my code and get my experimental results, I ran my generated code under a number of compiler flags and architectures. Specifically, both the GNU C Compiler and Intel C Compiler, 32 and 64 bit multiprocessors and machines with only a single core, dual cores and 8 cores.<br />In order to generate the detailed results I got for my experimental results (see the next section), I used PapiEx. PapiEx is a performance analysis tool designed to transparently and passively measure the hardware performance counters of an application using PAPI [15]. It uses Monitor to effortlessly intercept process/thread creation/destruction. The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. [14] Monitor is a library that gives the user callbacks or traps upon events related to library/process/thread initialization/creation/destruction. [17]<br />Using PapiEx, I was able to get accurate information on the number of instructions issued, executed and completed, the amount of data cache misses and the number of stall cycles used, as well as the total number of cycles execution took.<br />At the time of writing, the Intel C Compiler did not yet support the compilation of AES instructions, however it should be trivial to test the correctness of the output using the Intel Software Development Emulator [18].<br /> <br />4. Experimental Results<br />The Experimental Results illustrated below showcase the effects of the use of some of the variants. Running quite small subsets of the possible output of the generator and graphing the intermediate results – specifically the generator’s medians of the runtimes from twenty-five consecutive runs of each variant – was the source of the data for these graphs. The generator, by default, runs through all the possible combinations unless, as here, it is passed arguments in which case it just generates the combinations of those variants specified. Hence, normally the ‘winner’ variant that is printed out by the generator at the end is all one would be concerned with. The detailed graphing and discussion of the effects of each variant and combination here is simply to demonstrate what is going on inside the generator.<br />The AES Code Generator can be run on practically any system that has a working C++ compiler installed which has been compiled with SSE support appropriate to the microprocessor it is installed on. However, to get the most out of it one should use a compiler that supports OpenMP as well. The GNU C++ Compiler (v4.2+) and the Intel C++ Compiler (v10.1+) are hence the best choices.<br />In general, the AES Code Generator simply has to be deployed to the target machine and executed in order for it to build itself, generate the variants, test them, and report back and the fastest variant. The optimization process is entirely automated, and no further user intervention is needed.<br />To demonstrate this, data generated by the AES Code Generator will be shown from three different architectures:<br />Intel Core 2 Quad 2.4Ghz (4 cores)<br />Intel Core 2 Duo 2.16Ghz (2 cores)<br />Intel Pentium 4 Dual Processors<br />Intel Core 2 Quad 2.4Ghz<br />Figure 1: Effect on runtime in terms of processor cycles of combinations of Streaming Store and of Parallel (on an Intel Core 2 Quad)<br />The graph above shows the base case (Sequential) along with the possible combinations of the OpenMP variant (Parallel) and the Streaming Store variant. <br />As you can see, the effect of using OpenMP can be quite significant with a speedup of about 250% while the Streaming Store seems to be largely ineffective, perhaps due to the work of the hardware prefetcher (part of the processor that speeds up the flow of data accessed by the program [20]) negating the effect of not using the instruction. The effect of using multiple threads is not proportional to the number of threads used as the processor has other bottlenecks that constrict the speed at which it can execute instructions even when there is work for all four cores, such as the memory bus bandwidth which is limited and in a case such as this where a large amount of data is being processed, could quite easily become the bottleneck.<br />A number of variants are possible but we will look next at the effect of putting the round keys (the encryption keys used for the individual rounds in the AES algorithm - see ‘Background’, chapter 2) into local variables.<br />Sequential<br />Figure 2: Effect on runtime in terms of processor cycles of combinations of Streaming Store, of Parallel and of using Local Variables to store round keys (on an Intel Core 2 Quad)<br />In the chart above, you can see the effect of five simple variants: OpenMP, Streaming Store and the use of local variables to store the round keys. The unwinding of the inner loop has also been performed, as that optimization is a prerequisite for using local variables to store the round keys. Each of the four possible combinations of use and non-use of OpenMP (Parallel) and Streaming Store respectively are represented by one of the four colours while the number of local variables used is on the x-axis. The y-axis is the number of CPU cycles taken for the runtime of the encryption loop.<br />As you can see, the use of multiple threads in the parallel version provides a considerable speed boost to which the use of local variables is largely irrelevant whereas in the sequential version, the use of about nine local variables seems to be best. In either case, the use of Streaming Store seems to be largely irrelevant indicating that the hardware prefetcher in the microprocessor and the compiler are doing an excellent job without any help from the Streaming Store.<br /> Figure 3: Effect on runtime in terms of processor cycles of varying levels of Unwinding and numbers of round keys in local variables (on an Intel Core 2 Quad)<br />Figure 4: Effect on runtime in terms of issued instructions of varying levels of Unwinding and numbers of round keys in local variables (on an Intel Core 2 Quad)<br />Figure 5: Effect on runtime in terms of completed instructions of varying levels of Unwinding and numbers of round keys in local variables (on an Intel Core 2 Quad)<br />Figure 6: Effect on runtime in terms of processor stall cycles of varying levels of Unwinding and numbers of round keys in local variables (on an Intel Core 2 Quad)<br />Figure 3 represents the effect of local variables combined with unwinding on the runtime. As you can see, the combination of a high number of round keys in local variables and a high level of unwinding is very effective here. Figures 4 through 6 show the effect of the optimizations using other metrics: issued instructions, completed instructions and stall cycles respectively. Lower is better with all of these, with the minimization of stall cycles the most important of the three. With all of these, we see a similar pattern that is indicating that a high number of round keys in local variables and a high level of unwinding are very effective when combined.<br />The diagram below (Figure 7) demonstrates the effect of the fully unwound outer loop when combined with the Streaming Store option. As you can see, the fastest variants are clearly those with high level of both unrolling and high number of round keys in local variables again but the fastest here are slower than the fastest on the previous page where streaming store was disabled.<br /> Figure 7: Effect on runtime in terms of processor cycles of varying levels of Unwinding and numbers of round keys in local variables when Streaming Store is used (on an Intel Core 2 Quad)<br />Clearly, the streaming store optimization is not worthwhile for this architecture so it will not be looked at again here. We can postulate that the hardware prefetcher in the microprocessor and the compiler are doing an excellent job without any help from the Streaming Store.<br />An optimization of unrolling the outer loop is to interleave that loop as well. The effect of this is shown in the diagram (Figure 8) below.<br />Figure 8: Effect on runtime in terms of processor cycles of varying levels of Interleaving and numbers of round keys in local variables (on an Intel Core 2 Quad)<br /> <br />Figure 9: Effect on runtime in terms of processor cycles of varying levels of Interleaving and numbers of round keys in local variables (on an Intel Core 2 Quad)<br />Figure 10: Effect on runtime in terms of issued instructions of varying levels of Interleaving and numbers of round keys in local variables (on an Intel Core 2 Quad)<br />Figure 11: Effect on runtime in terms of completed instructions of varying levels of Interleaving and numbers of round keys in local variables (on an Intel Core 2 Quad)<br />As you can see from Figure 8, the fastest variants are clearly those with high levels of both interleaving and a fair number of round keys in local variables. This is demonstrated by their runtimes in the range of 200,000 to 300,000 cycles.<br />Comparing the variants with unrolling alone to the variants with interleaving, one can clearly see a considerable speed advantage to the interleaved versions. This is likely due to the reduction in the cache misses that would result from keeping operations with the same round keys together rather than operations with the same state and result. When unrolling alone is used, it will often still be necessary to load the round keys into the registers from the caches and there is a considerable length of time between uses whereas when the code is interleaved, all the operations with the same round key are performed without interruption.<br />From Figures 8 through 11, and comparing to the corresponding graphs for unwinding alone (Figures 3 through 6), we can clearly see that code generated by the AES Code Generator that uses interleaving is significantly faster than that using unrolling alone on this architecture since the number of cycles, instructions and stalls is nearly uniformly lower in the interleaved variants.<br />The next optimization that we will look at is pre-fetching some of the plaintext source data to be encrypted into the caches. For these graphs, we are only looking at the pre-fetching variants with five local variables and interleaved to a factor of ten (the generator would run through all the possible combinations not just the variants of that one).<br />Below is a graph (Figure 12) of the effect of pre-fetching to cache (the level two cache) in terms of cycles. <br />Figure 12: Effect on runtime in terms of processor cycles of varying the source data line and number of iterations ahead to prefetch when prefetching to L2 cache (on an Intel Core 2 Quad)<br />From Figure 12 above, you can see that it is possible to shave off a couple of thousand cycles from this technique but there is little in the way of a general rule. Multiple runs of these results show a distinct level of consistency in this however, indicating that software prefetching can beat the hardware prefetcher. Hence, this is a case where a code generator is perfectly suited as it can generate and test all the permutations. Figure 13 below shows the effect on the L2 cache itself in terms of the misses that resulted. <br />Figure 13: Effect on L2 Cache Misses of varying the source data line and number of iterations ahead to prefetch when prefetching to L2 cache (on an Intel Core 2 Quad)<br />Figure 14: Effect on runtime in terms of processor cycles of varying the source data line and number of iterations ahead to prefetch when prefetching to register (on an Intel Core 2 Quad)<br />Figure 15: Effect on L1 Cache Misses of varying the source data line and number of iterations ahead to prefetch when prefetching to register (on an Intel Core 2 Quad)<br />Figure 14 shows the effect of pre-fetching to register (level 1 cache, effectively) in terms of cycles while Figure 15 shows the effect on the L1 cache itself in terms of the misses that resulted. Although a little less unpredictable, it again shows that a generator is in its element in this kind of search for the most efficient combination of optimizations, especially when dealing with new, unseen architectures such as those that will support AES in hardware from 2010 onwards. <br />To conclude this section, it has looked at the effect of various optimizations that the generator can easily generate, test, measure and compare. What has not been looked at yet though is the effect of using multiple threads to perform the encryption in parallel. It is this that will be looked at in the next section. <br />Parallel<br />In the previous section, we have looked at running optimizations on the sequential version of the encryption code loop. However, the AES Code Generator generates all these optimizations for the parallel, OpenMP, version of the code, just as easily.<br />In the diagram below we see the effect of unrolling and the number of round keys in local variables with the OpenMP variant.<br />Figure 16: Effect of varying levels of Unwinding and numbers of round keys in local variables when multiple threads are used (on an Intel Core 2 Quad)<br />As you can see from the diagram above, the effect of the unwinding, although not as pronounced as in the sequential version, is still significant. One could postulate that this is due to the slower sequential version putting less pressure on the shared memory bus.<br />In the diagram below we see the effect of interleaving and the number round keys in local variables with the OpenMP variant.<br />Figure 17: Effect of varying levels of Interleaving and numbers of round keys in local variables when multiple threads are used (on an Intel Core 2 Quad)<br />Again, interleaved is clearly faster than unrolled alone and, if you compare it to the Interleaved without OpenMP (sequential) that we saw in the previous section, you will see that although it is close, the OpenMP version (parallel) is still distinctly faster. In a machine with multiple processors rather than multiple cores, it is entirely possible that you would see a greater speed up from the use of multiple threads working in parallel in the OpenMP version as the memory bus to/from the processor could well be a bottleneck here.<br />To conclude this section, we have shown that the generator can apply the techniques used to optimize sequential versions to equally optimize parallel versions. It should be noted that although we have broken up the sequential and parallel versions here, the generator considers the OpenMP variant just another variant to be tried in combination with all the other variants it can attempt permutations together of. As such, the generator will conclude with a specific recommendation, as can be seen below.<br />Tail of the AES Code Generator output:<br />Winner is: output-omp-ctr-L5-UI-14LocalVariables-Interleaved11.cpp<br />Intel Core 2 Duo 2.16Ghz<br />In dealing with the final two architectures, we will focus on the most interesting areas rather than going as in-depth as previously, in order to avoid repetition. As such, we will be concentrating on the parallel versions of the unrolled and interleaved outer loop and the effect of different combinations of local variables used for round keys and levels of unwinding or interleaving respectively. <br />Sequential<br />Tail of the AES Code Generator output:<br />Winner is: output-ctr-L5-UI-01LocalVariables-Interleaved10.cpp<br />On this architecture, as you can see above, the fastest sequential version was an interleaved version which only put a single round key into a local variable explicitly. The most likely reason for the low number of local variables used for round keys would be that the compiler performed much the same optimization itself or the hardware prefetcher performed better than the generator’s software efforts.<br />Parallel<br />Figure 18: Effect of varying levels of Unwinding and numbers of round keys in local variables when multiple threads are used (on an Intel Core 2 Duo)<br />The graph above shows the effect of the unrolling of the outer loop while the graph below shows the effect of the interleaving of the outer loop. As you can see, the unrolled versions are significantly quicker on average while the effect of the interleaving is more predictable but with a larger possible range. <br />Figure 19: Effect of varying levels of Interleaving and numbers of round keys in local variables when multiple threads are used (on an Intel Core 2 Duo)<br />Tail of the AES Code Generator output:<br />Winner is: output-omp-ctr-L5-UI-12LocalVariables-Interleaved03.cpp<br />As you can see from the Generator output above though, an interleaved version is the fastest overall.<br />Intel Pentium 4 Dual Processor<br />Sequential<br />Tail of the AES Code Generator output:<br />Winner is: output-ctr-L5-UI-05LocalVariables-Interleaved14.cpp<br />On this architecture, as you can see above, the fastest sequential version was an interleaved version which only put a single round key into a local variable explicitly. The interleaving could be expected from the previous sections and the relatively low use of local variables could be due the lower number of registers available as this is a 32 bit machine whereas the others were 64 bit machines.<br />Parallel<br />Figure 20: Effect of varying levels of Unwinding and numbers of round keys in local variables when multiple threads are used (on an Intel Core 4 Dual Processor)<br />Figure 21: Effect of varying levels of Interleaving and numbers of round keys in local variables when multiple threads are used (on an Intel Pentium 4 Dual Processor)<br />Tail of the AES Code Generator output:<br />Winner is: output-omp-ctr-L5-UI-00LocalVariables-Interleaved07.cpp<br />As you can see, the fastest version was one, which eschewed the use of local variables for round keys. This could be due to it being a 32-bit machine which, consequently, had less registers available so the compiler and hardware were able to do the best job managing them.<br />5. Conclusions<br />AES Code Generator is a valuable means to find the most efficient and optimized AES code for any given architecture. In general for the architectures surveyed here, the interleaved, parallel variants seem to be the most efficient. This is most likely due to the use of all of the available cores by using multiple threads and to the reduction in loads to registers from caches caused by interleaving the outer loop of the encryption loop. The number of local variables used for storing round keys seemed to be consistently low on 32-bit architectures and medium on 64-bit architectures, reflecting the increased number of registers available on the 64-bit architecture.<br />To conclude, this report will outline what contributions this project has made to the current state of the art and briefly discuss what future work could be attempted that could build upon the progress achieved in this project.<br />Contributions<br />This project has made a number of contributions to the state of the art.<br />Firstly, it is one of the very first to deal with the new AES instruction set which will appear in the next generation of Intel processors.<br />Secondly, it provides a proof of the concept of using a code generator to provide all the various optimized variants of the standard AES encryption loop.<br />Thirdly, it extends the idea of self-tuning generators to the area of encryption code generation<br />Future Work<br />Future work will no doubt focus on optimizations taking advantage of the new Single Instruction Multiple Data (SIMD) instructions that will be introduced in the next generation of Intel’s processors. The provision of these instructions enabling fast and secure encryption and decryption using AES on the hardware level will doubtlessly result in the predominant use of the SIMD instructions in all future AES Code Generators and the efforts that had to be taken here to emulate the instructions and estimate the effect of optimizations by using older SIMD instructions of similar latency will be a thing of the past.<br />A minor extension of this work would consider AES-128 and/or AES-192 instead of AES-256 that was dealt with here. However, given that AES is a block cipher that always deals with 128-bit blocks, the only differences are the number of rounds of encryption and the resulting greater number of round keys. As such, the effect of optimizations will be less with AES-128 or AES-192.<br />Another minor extension of this work would investigate AES on a machine with an extremely large number of processors, and comparing the performance of pthreads against OpenMP.<br />References<br />Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo, “SPIRAL: Code Generation for DSP Transforms”, Proceedings of the IEEE special issue on " Program Generation, Optimization, and Adaptation," Vol. 93, No. 2, 2005, pp. 232-275.<br />Matteo Frigo and Steven G. Johnson, " The Design and Implementation of FFTW3," Proceedings of the IEEE 93 (2), 216–231 (2005). Invited paper, Special Issue on Program Generation, Optimization, and Platform Adaptation.<br />Joan Daemen , Vincent Rijmen, “The Block Cipher Rijndael”, Proceedings of the The International Conference on Smart Card Research and Applications, p.277-284, September 14-16, 1998.<br />National Institute for Standards and Technology, “Announcing the Advanced Encryption Standard (AES)”, Federal Information Processing Publication #197, 2001.<br />Michael Flynn, “Some Computer Organizations and Their Effectiveness”, IEEE Trans. Comput., Vol. C-21, pp. 948, 1972.<br />Aart J.C. Bik, “Vectorization with the Intel Compilers”, Intel, 2008.<br />R.M. Ramanthan, “Extending the World’s Most Popular Architecture”, Intel, 2006.<br />Nadeem Firasta, “Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency”, Intel, 2008.<br />Barbara Chapman, “Using OpenMP: Portable Shared Memory Parallel Programming”, The MIT Press, 2007.<br />Unknown, “Compiler optimization”, http://en.wikipedia.org/wiki/Compiler_optimization (last accessed on 3rd April 2009)<br />Ilya O. Levin, “A byte-orientated AES-256 implementation”, http://www.literatecode.com/2007/11/11/aes256/, (last accessed on 4th April 2009).<br />Shay Gueron, Intel Mobility Group Israel, “AES Instructions Set White Paper”, Intel, July 2008.<br />Intel, “Intel 64 and IA-32 Architectures Optimization Reference Manual”, November 2007.<br />P.Mucci, “PapiEx - Execute arbitrary application and measure hardware performance counters with PAPI”, http://icl.cs.utk.edu/~mucci/papiex/ (last accessed on 10th April 2009).<br />P. Mucci et al, “A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters”, Proceedings of Supercomputing 2000, 2000.<br />P. Mucci and N. Tallent, “Monitor - user callbacks for library, process and thread initialization/creation/destruction”, 2004.<br />Mark Charney, “Intel Software Development Emulator”, Intel, http://www.intel.com/software/sde/ (last accessed on 14th April 2009).<br />Robert Konighofer, “A Fast and Cache-Timing Resistant Implementation of the AES”, Proceedings of the Cryptographer’s Track at RSA Conference 2008, 2008.<br />Intel, “Intel 64 and IA-32 Architectures Software Developers Manual”, November 2007.<br />Guido Bertoni et al, “Efficient Software Implementation of AES on 32-Bit Platforms”, Proceedings of Cryptographic Hardware and Embedded Systems 2002: p159-171, 2002.<br />

×