This document presents new software speed records for AES-128 encryption and decryption on various platforms including 8-bit AVR microcontrollers, NVIDIA GPUs, and the Cell broadband engine. The key findings are that the AVR implementation requires 124.6 and 181.3 cycles per byte for encryption and decryption respectively while using less than 2KB of code size. For the Cell, byte-sliced implementations achieve 11.7 and 14.4 cycles per byte. The fastest GPU implementation delivers throughputs of 0.17 and 0.19 cycles per byte for encryption and decryption handling multiple input streams in parallel.
INTEL x86 AND ARM DATA TYPES
⦁ Are instructions set architecture
⦁ Change code into instructions a processor can understand and execute.
⦁ Determines which operating systems and apps to run.
INTEL x86 AND ARM DATA TYPES
⦁ Are instructions set architecture
⦁ Change code into instructions a processor can understand and execute.
⦁ Determines which operating systems and apps to run.
Comparison of AES and DES Algorithms Implemented on Virtex-6 FPGA and Microbl...IJECEIAES
Encryption algorithms play a dominant role in preventing unauthorized access to important data. This paper focus on the implementations of Data Encryption Standard (DES) and Advanced Encryption Standard (AES) algorithms on Microblaze soft core Processor and also their implementations on XC6VLX240t FPGA using Verilog Hardware Description language. This paper also gives a comparison of the issues related to the hardware and software implementations of the two cryptographic algorithms.
IJCER (www.ijceronline.com) International Journal of computational Engineeri...ijceronline
Call for paper 2012, hard copy of Certificate, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJCER, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, research and review articles, IJCER Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathematics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer review journal, indexed journal, research and review articles, engineering journal, www.ijceronline.com, research journals,
yahoo journals, bing journals, International Journal of Computational Engineering Research, Google journals, hard copy of Certificate,
journal of engineering, online Submission
IMPLEMENTATION OF AES AS A CUSTOM HARDWARE USING NIOS II PROCESSORacijjournal
In this paper Advanced Encryption Standard (AES) algorithm has been designed and implemented as custom hardware. The algorithm is controlled through C-code written in NIOS II IDE. AES as a custom hardware is interfaced with the system designed around NIOS II Processor using SOPC builder tool. AES is written in hardware in VHDL language and the interface is through GPIO (General Purpose Input / Output Port). AES implemented using data size of 128 bits, while the length of the key used is of 128 bits. The key size of AES used is of 128 bits, as it is secure from the different attacks in existence. The FPGA used is CYCLONE II from Altera. AES as a custom hardware increases the speed of encryption and serves as an accelerator and hence improves the performance of the system.
Implementation of Viterbi Decoder on FPGA to Improve Designijsrd.com
In the data transmissions over wireless channels are affect by attenuation, distortion, interference and noise, which affects the receiver's ability to receive correct information. Convolution coding with Viterbi decoding is a FEC technique that is particularly suited to a channel in which transmitted signal is corrupted mainly by additive white Gaussian noise (AWGN).Convolutional codes are used for error correction. They have rather good correcting capability and perform well even on very bad channels with error probabilities. Viterbi decoding is the best technique for decoding the Convolutional codes but it is limited to smaller constraint lengths. Viterbi algorithm is a well-known maximum-likelihood algorithm for decoding of convolutional codes.
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,paperpublications3
Abstract: A public domain encryption standard is subject to continuous, vigilant, expert cryptanalysis. AES is a symmetric encryption algorithm processing data in block of 128 bits. Under the influence of a key, a 128-bit block is encrypted by transforming it in a unique way into a new block of the same size. To implement AES Rijndael algorithm on FPGA using Verilog and synthesis using Xilinx, Plain text of 128 bit data is considered for encryption using Rijndael algorithm utilizing key. This encryption method is versatile used for military applications. The same key is used for decryption to recover the original 128 bit plain text. For high speed applications, the Non LUT based implementation of AES S-box and inverse S-box is preferred. Development of physical design of AES-128 bit is done using cadence SoC encounter. Performance evaluation of the physical design with respect to area, power, and time has been done. The core consumes 10.11 mW of power for the core area of 330100.742 μm2.
Keywords: Encryption, Decryption Rijndael algorithm, FPGA implementation, Physical Design.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Arm recognition encryption by using aes algorithmeSAT Journals
Abstract To provide the security of the Military confidential data we use encryption algorithm which take over reward of superior encryption algorithm. The proposed implementation using encryption algorithm was implemented on ARM 7 to encrypt and decrypt the confidential data on data storage devices such as SD card or Pen drive. The main objective of proposed implementation is to provide protection for storage devices. The ARM and encryption algorithm protect the data accessibility, reliability and privacy successfully. Since (AES) Advanced Encryption Standard algorithm is widely used in an embedded system or fixed organization. These AES algorithms are used for proper designs in defense for security. Keywords: Plain text, Cipher text, Data security, AES, Embedded System.ARM, storage device.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Comparison of AES and DES Algorithms Implemented on Virtex-6 FPGA and Microbl...IJECEIAES
Encryption algorithms play a dominant role in preventing unauthorized access to important data. This paper focus on the implementations of Data Encryption Standard (DES) and Advanced Encryption Standard (AES) algorithms on Microblaze soft core Processor and also their implementations on XC6VLX240t FPGA using Verilog Hardware Description language. This paper also gives a comparison of the issues related to the hardware and software implementations of the two cryptographic algorithms.
IJCER (www.ijceronline.com) International Journal of computational Engineeri...ijceronline
Call for paper 2012, hard copy of Certificate, research paper publishing, where to publish research paper,
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJCER, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, research and review articles, IJCER Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathematics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer review journal, indexed journal, research and review articles, engineering journal, www.ijceronline.com, research journals,
yahoo journals, bing journals, International Journal of Computational Engineering Research, Google journals, hard copy of Certificate,
journal of engineering, online Submission
IMPLEMENTATION OF AES AS A CUSTOM HARDWARE USING NIOS II PROCESSORacijjournal
In this paper Advanced Encryption Standard (AES) algorithm has been designed and implemented as custom hardware. The algorithm is controlled through C-code written in NIOS II IDE. AES as a custom hardware is interfaced with the system designed around NIOS II Processor using SOPC builder tool. AES is written in hardware in VHDL language and the interface is through GPIO (General Purpose Input / Output Port). AES implemented using data size of 128 bits, while the length of the key used is of 128 bits. The key size of AES used is of 128 bits, as it is secure from the different attacks in existence. The FPGA used is CYCLONE II from Altera. AES as a custom hardware increases the speed of encryption and serves as an accelerator and hence improves the performance of the system.
Implementation of Viterbi Decoder on FPGA to Improve Designijsrd.com
In the data transmissions over wireless channels are affect by attenuation, distortion, interference and noise, which affects the receiver's ability to receive correct information. Convolution coding with Viterbi decoding is a FEC technique that is particularly suited to a channel in which transmitted signal is corrupted mainly by additive white Gaussian noise (AWGN).Convolutional codes are used for error correction. They have rather good correcting capability and perform well even on very bad channels with error probabilities. Viterbi decoding is the best technique for decoding the Convolutional codes but it is limited to smaller constraint lengths. Viterbi algorithm is a well-known maximum-likelihood algorithm for decoding of convolutional codes.
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASIC,paperpublications3
Abstract: A public domain encryption standard is subject to continuous, vigilant, expert cryptanalysis. AES is a symmetric encryption algorithm processing data in block of 128 bits. Under the influence of a key, a 128-bit block is encrypted by transforming it in a unique way into a new block of the same size. To implement AES Rijndael algorithm on FPGA using Verilog and synthesis using Xilinx, Plain text of 128 bit data is considered for encryption using Rijndael algorithm utilizing key. This encryption method is versatile used for military applications. The same key is used for decryption to recover the original 128 bit plain text. For high speed applications, the Non LUT based implementation of AES S-box and inverse S-box is preferred. Development of physical design of AES-128 bit is done using cadence SoC encounter. Performance evaluation of the physical design with respect to area, power, and time has been done. The core consumes 10.11 mW of power for the core area of 330100.742 μm2.
Keywords: Encryption, Decryption Rijndael algorithm, FPGA implementation, Physical Design.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Arm recognition encryption by using aes algorithmeSAT Journals
Abstract To provide the security of the Military confidential data we use encryption algorithm which take over reward of superior encryption algorithm. The proposed implementation using encryption algorithm was implemented on ARM 7 to encrypt and decrypt the confidential data on data storage devices such as SD card or Pen drive. The main objective of proposed implementation is to provide protection for storage devices. The ARM and encryption algorithm protect the data accessibility, reliability and privacy successfully. Since (AES) Advanced Encryption Standard algorithm is widely used in an embedded system or fixed organization. These AES algorithms are used for proper designs in defense for security. Keywords: Plain text, Cipher text, Data security, AES, Embedded System.ARM, storage device.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
The Advanced Encryption Standard (AES) is a specification for the encryption of electronic data also called Rijndael. The algorithm described by AES is a symmetric-key algorithm, meaning the same key is used for both encrypting and decrypting the data. Hardware-based cryptography is used for authentication of users and of software updates and installations. Software implementations can generally not be used for this, as the cryptographic keys are stored in the PC memory during execution, and are vulnerable to malicious codes. Hardware-based encryption products can also vary in the level of protection they provide against brute force rewind attacks, Offline parallel attacks, or other cryptanalysis attacks. The algorithm was implemented in FPGA due to its flexibility and reconfiguration capability. A reconfigurable device is very convenient for a cryptography algorithm since it allows cheap and quick alterations. The implementation of pipelined cryptography hardware was used to improve performance in order to achieve higher throughput and greater parallelism. The AES hardware was implemented in three modules contains of the encryption, the decryption and the key expansion module.
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...IJMTST Journal
Digital multipliers are among the most critical arithmetic functional units in many applications, such as the Fourier transform, discrete cosine transforms, and digital filtering. The through put of these applications depends on multipliers, if the multipliers are too slow, the performance of entire circuits will be reduced. The negative bias temperature instability effect occurs when a PMOS transistor is under negative bias (Vgs = −Vdd), increasing the threshold voltage of a PMOS transistor and reducing the multiplier speed. Similarly, positive bias temperature instability occurs when an NMOS transistor is under positive bias. Both effects degrade the speed of the transistor and in the long term, the system may be fail due to timing violations. Therefore, it is required to design reliable high-performance multipliers. In this paper, we implement an aging aware multiplier design with a novel adaptive hold logic (AHL) circuit. The multiplier is able to provide the higher throughput through the variable latency and can adjust the adaptive hold logic (AHL) circuit to lessen performance degradation that is due to the aging effect. The proposed design can be applied to the column bypass multiplier.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Design and Implementation of Area Efficiency AES Algoritham with FPGA and ASICpaperpublications3
Abstract: A public domain encryption standard is subject to continuous, vigilant, expert cryptanalysis. AES is a symmetric encryption algorithm processing data in block of 128 bits. Under the influence of a key, a 128-bit block is encrypted by transforming it in a unique way into a new block of the same size. To implement AES Rijndael algorithm on FPGA using Verilog and synthesis using Xilinx, Plain text of 128 bit data is considered for encryption using Rijndael algorithm utilizing key. This encryption method is versatile used for military applications. The same key is used for decryption to recover the original 128 bit plain text. For high speed applications, the Non LUT based implementation of AES S-box and inverse S-box is preferred. Development of physical design of AES-128 bit is done using cadence SoC encounter. Performance evaluation of the physical design with respect to area, power, and time has been done. The core consumes 10.11 mW of power for the core area of 330100.742 μm2.
A design of a fast parallel pipelined implementation of aes advanced encrypti...ijcsit
The Advanced Encryption Standard (AES) algorithm is a symmetric block cipher which operates on a
sequence of blocks each consists of 128, 192 or 256 bits. Moreover, the cipher key for the AES algorithm is
a sequence of 128, 192 or 256 bits. AES algorithm has many sources of parallelism. In this paper, a design
of parallel AES on the multiprocessor platform is presented. While most of the previous designs either use
pipelined parallelization or take advantage of the Mix_Column parallelization, our design is based on
combining pipelining of rounds and parallelization of Mix_Column and Add_Round_Key transformations.
This model is divided into two levels: the first is pipelining different rounds, while the second is through
parallelization of both the Add_Round_Key and the Mix_Column transformations. Previous work proposed
for pipelining AES algorithm was based on using nine stages, while, we propose the use of eleven stages in
order to exploit the sources of parallelism in both initial and final round. This enhances the system
performance compared to previous designs. Using two-levels of parallelization benefits from the highly
independency of Add_Round_Key and Mix_Column/ Inv_Mix_Colum transformations. The analysis shows
that the parallel implementation of the AES achieves a better performance. The analysis shows that using
pipeline increases significantly the degree of improvement for both encryption and decryption by
approximately 95%. Moreover, parallelizing Add_Round_Key and Mix_Column/ Inv_Mix_Column
transformations increases the degree of improvement by approximately 98%. This leads to the conclusion
that the proposed design is scalable and is suitable for real-time applications
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
High throughput FPGA Implementation of Advanced Encryption Standard AlgorithmTELKOMNIKA JOURNAL
The growth of computer systems and electronic communications and transactions has meant that the need for effective security and reliability of data communication, processing and storage is more important than ever. In this context, cryptography is a high priority research area in engineering. The Advanced Encryption Standard (AES) is a symmetric-key criptographic algorithm for protecting sensitive information and is one of the most widely secure and used algorithm today. High-throughput, low power and compactness have always been topic of interest for implementing this type of algorithm. In this paper, we are interested on the development of high throughput architecture and implementation of AES algorithm, using the least amount of hardware possible. We have adopted a pipeline approach in order to reduce the critical path and achieve competitive performances in terms of throughput and efficiency. This approach is effectively tested on the AES S-Box substitution. The latter is a complex transformation and the key point to improve architecture performances. Considering the high delay and hardware required for this transformation, we proposed 7-stage pipelined S-box by using composite field in order to deal with the critical path and the occupied area resources. In addition, efficient AES key expansion architecture suitable for our proposed pipelined AES is presented. The implementation had been successfully done on Virtex-5 XC5VLX85 and Virtex-6 XC6VLX75T Field Programmable Gate Array (FPGA) devices using Xilinx ISE v14.7. Our AES design achieved a data encryption rate of 108.69 Gbps and used only 6361 slices ressource. Compared to the best previous work, this implementation improves data throughput by 5.6% and reduces the used slices to 77.69%.
Design and Implementation A different Architectures of mixcolumn in FPGAVLSICS Design
This paper details Implementation of the Encryption algorithm AES under VHDL language In FPGA by using different architecture of mixcolumn. We then review this research investigates the AES algorithm in FPGA and the Very High Speed Integrated Circuit Hardware Description language (VHDL). Altera Quartus II software is used for simulation and optimization of the synthesizable VHDL code. The set of transformations of both Encryptions and decryption are simulated using an iterative design approach in order to optimize the hardware consumption. Altera Cyclone III Family devices are utilized for hardware evaluation.
Design and Implementation A different Architectures of mixcolumn in FPGA
Aes
1. Fast Implementations of AES on Various
Platforms
Joppe W. Bos1 , Dag Arne Osvik1 , and Deian Stefan2
1
EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland
{joppe.bos, dagarne.osvik}@epfl.ch
2
Dept. of Electrical Engineering, The Cooper Union, NY 10003, New York, USA
stefan@cooper.edu
Abstract. This paper presents new software speed records for encryp-
tion and decryption using the block cipher AES-128 for different ar-
chitectures. Target platforms are 8-bit AVR microcontrollers, NVIDIA
graphics processing units (GPUs) and the Cell broadband engine. The
new AVR implementation requires 124.6 and 181.3 cycles per byte for
encryption and decryption with a code size of less than two kilobyte.
Compared to the previous AVR records for encryption our code is 38
percent smaller and 1.24 times faster. The byte-sliced implementation
for the synergistic processing elements of the Cell architecture achieves
speed of 11.7 and 14.4 cycles per byte for encryption and decryption.
Similarly, our fastest GPU implementation, running on the GTX 295
and handling many input streams in parallel, delivers throughputs of
0.17 and 0.19 cycles per byte for encryption and decryption respectively.
Furthermore, this is the first AES implementation for the GPU which
implements both encryption and decryption.
Key words: AES, AVR, Cell, GPU
1 Introduction
In 2001, as the outcome of a public competition, Rijndael was announced as the
Advanced Encryption Standard (AES) by the US National Institute of Standards
and Technology (NIST). Since then, it has become one of the most extensively
used encryption primitives for various applications. Besides its well-regarded se-
curity properties3 , AES is extremely efficient on many different platforms, rang-
ing from 8-bit microcontrollers to 64-bit processors to FPGAs; the efficiency was
a crucial metric in making Rijndael an encryption standard. Much work on effi-
cient and secure implementation of AES has been done, including the evaluation
of its performance on different architectures; this work further evaluates such
efficient AES implementations.
This work was done when the third author was visiting EPFL.
3
The only attack on the full AES is applicable in the related key scenario to the
192-bit [5] and 256-bit key versions [5,6].
2. We present new, high-speed and small codesize, software implementations
of AES for 8-bit AVR microcontrollers, the Cell Broadband Engine architec-
ture (Cell) and NVIDIA Graphics Processing Units (GPUs). To the best of our
knowledge, our results set new records on these platforms.
It is expected that the use of lightweight devices, i.e., low-end smart cards
and radio frequency identification (RFID) tags, in electronic commerce and iden-
tification will grow rapidly in the near future. The passive RFID tag market is
expected to reach up to US$ 486M by 2013 [10], and AES has already attracted
significant attention due to its capabilities for such devices. This work further
investigates the performance of AES on such devices; specifically, 8-bit micro-
controllers.
The other target platforms, the Cell and the GPU, are chosen because of
their ability to process many streams simultaneously, using single instruction,
multiple data (SIMD) and single instruction, multiple threads (SIMT) techniques
respectively. Due to the low prices and wide availability of these devices it is
interesting to evaluate their performance as cryptologic accelerators.
The paper is organized as follows. Section 2 briefly recalls the design of AES.
In Section 3 our target platforms are described. Section 4 describes the techniques
used and decisions made when porting AES to the different architectures. In
Section 5 we present our results and a comparison is made to other reported
results in literature. Finally, we conclude in Section 6.
2 A Brief Introduction to AES
The AES is a fixed block length version of the Rijndael block cipher [8,14], with
support for 128-, 192-, and 256-bit keys. The cipher operates on an internal state
of 128 bits, which is initially set to the plaintext block, and after transformations,
becomes the output ciphertext block. The state is organized in a 4 × 4 array of
8-bit bytes, which is transformed according to a round function Nr times. The
number of rounds is Nr = 10 for 128-bit keys, Nr = 12 for 192-bit keys, and
Nr = 14 for 256-bit keys. In order to encrypt, the state is first initialized, then
the first 128-bits of the key are xored into the state, after which the state is
modified Nr − 1 times according to the round function, followed by the slightly
different final round.
The round function consists of four steps: SubBytes, ShiftRows, MixColumns
and AddRoundKey (except for the final round which omits the MixColumns step).
Each step operates on the state, at each round r, as follows:
1. SubBytes: substitutes every entry (byte) of the state with an S-box entry,
2. ShiftRows: cyclically left shifts every row i of the state matrix by i, 0 ≤ i ≤ 3,
3. MixColumns: multiplies each column of the state, taken as a polynomial of
degree below 4 with coefficient in F28 , by a fixed polynomial modulo x4 + 1,
4. AddRoundKey: xors the r-th round key into the state.
Each transformation has an inverse from which decryption follows in a straight-
forward way by reversing the steps in each round: AddRoundKey (inverse of itself),
InvMixColumns, InvShiftRows, and InvSubBytes.
3. The key expansion into the Nr 128-bit round keys is accomplished using
a key scheduling algorithm, the details of which can be found in [14] and [8].
The design of the key schedule allows for the full expansion to precede the round
transformations, which is advantageous if multiple blocks are encrypted using the
same key, while also providing the option for on-the-fly key generation, a method
that proves useful in memory constrained environments such as microcontrollers.
For 32-bit (and greater word length) processors, in [8] Daemen and Rijmen
detail a fast implementation method that combines the SubBytes, ShiftRows,
and MixColumns transformations into four 256-entry (each entry is 4 bytes) look-
up tables, Ti , 0 ≤ i ≤ 3. Following [8], the “T -table” approach reduces the round
transformations to updating the j-th column according to:
3
T
s0,j , s1,j , s2,j , s3,j = Ti [si,j+Ci ], for 0 ≤ j ≤ 3, (1)
i=0
where sj,k is the byte in j-th row and k-th column of the state, and Ci is
a constant equivalently doing the ShiftRows in-place. After the columns are
updated, the remaining transformation is AddRoundKey (which is a single 4-byte
look-up and xor per column).
3 Target Platforms
We target AES for low-end lightweight devices and evaluate its performance on
an 8-bit AVR, both in terms of speed and code size; independently, we also target
the many-core, high performing, Cell and GPU platforms. Below, we introduce
the three target platforms and discuss their overall design and execution models.
3.1 8-bit AVR microcontroller
Advanced Virtual Risc (AVR) is a family of 8-bit microcontrollers designed by
Atmel, targeting low-power embedded systems. Although a lightweight micro-
controller, the AVR has 32 8-bit registers, a large number of instructions (125 for
the AT90USB82/162), between 512B and 384KB in-system programmable flash
(ISP), 0 to 4KB of EEPROM and 0 to 32KB SRAM, timers, counters, USART,
SPI, and many other features and peripherals that make it a favorable platform
for embedded development [2].
The AVR CPU is a modified Harvard architecture (program and data mem-
ories are separate) with a two stage single level pipeline supporting instruction
pre-fetching. The (memory) parallelism and pipelining greatly improve the mi-
crocontroller’s performance. Moreover, the majority of AVR instructions take up
only a single 16-bit word and execute with a single cycle latency; only a small
number of instructions require two or four cycles to complete. Features like free
pre-decrement and post-increment of pointer registers also contribute towards
small and efficient program code.
4. The data memory consists of the register file, I/O memory and SRAM, and as
such the various direct and indirect addressing (through 16-bit pointer registers
X, Y, Z) modes can be used to not only access the data memory, but also the
32 registers. The ability to access the register file as memory, in addition to the
optimized direct register access, provides a designer with additional flexibility
in optimizing an application implementation. We note, however, that although
direct addressing can access the whole data space, indirect addressing (with
displacement) is limited to 63 address locations from only one of the pointer
registers [2], and this restriction may require the implementer to use techniques
such as double-jumping. Another limitation is that only the Z register may be
used for addressing flash memory, e.g., for AES S-box lookups, and in some AVR
devices this is not possible at all. Additionally, the flash memory is relatively
small and because in practical applications it is of little interest to dedicate the
whole flash to a cryptographic primitive, it is critical that the code-size of AES
remain small.
Most AVRs are clocked between 0 and 20 MHz, with some of the higher-
end ones reaching 32 MHz, and with the ability to execute one instruction per
cycle the embedded microcontrollers can achieve throughputs up to 20 MIPS (16
MIPS for the AT90USB162). Thus, given the relatively high computation power
of these low-cost and low-power devices, the performance of block ciphers, such
as AES, is of practical consideration for applications requiring cryptographic
primitives (e.g., automobile electronic keys).
3.2 The Cell Broadband Engine
The Cell architecture [12], jointly developed by Sony, Toshiba, and IBM, is
equipped with one dual-threaded, 64-bit in-order “Power Processing Element”
(PPE), which can offload work to the eight “Synergistic Processing Elements”
(SPEs) [24]. The SPEs are the workhorses of the Cell processor. Each consists
of a Synergistic Processing Unit (SPU), 256 kilobyte of private memory called
Local Store (LS) and a Memory Flow Controller (MFC). The latter handles
communication between each SPE and the rest of the machine, including main
memory, as explicitly requested by programs. All code and data must fit within
the LS if one wants to avoid the complexity of sending explicit DMA (Direct
Memory Access) requests to the MFC.
Most SPU instructions are 128-bit wide SIMD operations performing sixteen
8-bit, eight 16-bit, four 32-bit, or two 64-bit computations in parallel. Each SPU
is also equipped with a large register file containing 128 registers of 128 bits
each, providing space for unrolling and software pipelining of loops, hiding the
relatively long latencies of its instructions. Unlike the processor in the PPE, the
SPUs are asymmetric processors, having two pipelines (denoted by the odd and
the even pipeline) which are designed to execute two disjoint sets of instructions.
In the ideal case, two instructions can be dispatched per cycle.
The SPUs are in-order processors with no hardware branch-prediction. The
programmer (or compiler) must instead tell the instruction fetch unit in advance
5. where a (single) branch instruction will jump to. Hence, for most code with in-
frequent jumps and where the target of each branch can be computed sufficiently
early, perfect branch prediction is possible.
One of the first applications of the Cell processor was to serve as the heart of
Sony’s PlayStation 3 (PS3) video game console. The Cell contains eight SPEs,
but in the PS3 one of them is disabled, allowing improved yield in the manufac-
turing process as any chip with a single faulty SPE can still be used. One of the
remaining SPEs is reserved by Sony’s hypervisor, a software layer providing a
virtual machine environment for running e.g., Linux. In the end we have access
to six SPEs when running Linux on (the virtual machine on) the PS3. Fortu-
nately, the virtualization does not slow down programs running on the SPUs, as
they are naturally isolated and protection mechanisms only need to deal with
requests sent to the MFC.
Besides being used in the PS3, (parts of) the Cell has been placed on an PCI-
Express card such that it can serve as an arithmetic accelerator. Such cards are
available with a complete Cell (by Mercury), one PPE and eight SPEs running at
2.8GHz, equipped with 5 GB of memory or with four SPEs (by Leadtek) running
at 1.5GHz with access to 128 MB of memory. The Cell has also established
itself in the high-performance market by occupying the first place, with the
Roadrunner supercomputer, in the top 500 supercomputing list [9].
3.3 GPU Using CUDA
Similar to the PS3, Graphic Processing Units (GPUs) have mainly been game-
and video-centric devices. Moreover, due to the increasing computational re-
quirements of graphics-processing applications, GPUs have become very power-
ful parallel processors and thus incited research interest in computing outside the
graphics-community. Until recently, however, programming GPUs was limited to
graphics libraries such as OpenGL [22] and Direct3D [7], and for many appli-
cations, especially those based on integer-arithmetic, the performance improve-
ments over CPUs was minimal or even degrading. The release of NVIDIA’s G80
series and ATI’s HD2000 series GPUs, both of which implemented the unified
shader architecture, along with the companies’ release of higher-level languages
Compute Unified Device Architecture (CUDA), Close to the Metal (CTM) [19]
and the more recent Open Computing Language (OpenCL) [?], however, fa-
cilitate the development of massively-parallel general purpose applications for
GPUs [16,1]. These general purpose GPUs (GPGPUs) have become a common
target for numerically-intensive applications given their ease of programming
(compared to previous generation GPUs), and ability to outperform CPUs in
data-parallel applications, commonly by orders of magnitude. In this paper we
focus on NVIDIA’s GPU architecture with CUDA, programming ATI GPUs
with OpenCL is part of our ongoing work and results will be reported in the
final version of this paper.
In addition to the common floating point processing capabilities of previous-
generation GPUs, starting with the G80 series, NVIDIA’s GPU architecture
added support for integer arithmetic, including 32-bit addition/subtraction and
6. bit-wise operations, scatter/gather memory access and various memory spaces
(global, texture, constant and shared) [15,16]. Each GPU contains between 10
and 30 streaming multiprocessors (SMs) each equipped with: eight scalar pro-
cessor (SP) cores, fast 16-way banked on-chip shared memory (16KB/SM), a
multithreaded instruction unit, large register file (8192 for G80-based GPUs,
16384 for the newer GT200 series), read-only caches for constant (8KB/SM)
and texture memory (varying between 6 and 8 KB/SM), and two special func-
tion units for transcendentals. We refer to [16] for further details.
CUDA is an extension of the C language that employs the new massively par-
allel programming model, single-instruction multiple-thread. SIMT differs from
SIMD in that the underlying vector size is hidden and the programmer is re-
stricted to writing scalar code that is parallel at the thread-level. CUDA extends
the C language to support SIMT by allowing the programmer to define kernel
functions, which are compiled for and executed on the SPs of each SM, in paral-
lel: each light-weight thread executes the same code, operating on different data.
A maximum of 512 threads can be grouped into a thread block which is sched-
uled on a single SM, the threads of which time-share the SPs. This additional
hierarchy provides for threads within the same block to communicate using the
on-chip shared memory and synchronize their execution using barriers (specif-
ically, the syncthreads() intrinsic blocks thread execution until all threads
within the block have reached the synchronization point). Moreover, multiple
thread blocks can be executed simultaneously on the GPU, a maximum of eight
thread blocks can be scheduled per SM and in order to hide instruction and
memory (among other) latencies, it is recommended that a minimum of two
blocks be available for scheduling on each SM.
4 Porting AES
When porting the AES to our target platforms different implementation deci-
sions have to be made. These decisions are influenced by the features, or restric-
tions, of the instruction sets and the available memory.
First, an optimized 8-bit implementation of AES for the AVR was designed.
This version was implemented in AVR assembler, with each of the transforma-
tions independently optimized to minimize register usage and execution cycles.
Furthermore, this 8-bit version of the AES was used as a framework to create a
byte-sliced 4 , implementation on the SIMD-architecture of the SPE. Hence, 16
instances of the AES are processed in parallel per SPE.
Unlike the Cell and AVR, the GPU does not directly benefit from the byte-
sliced framework and, instead, the T -table approach is used, see Equation (1).
We observe that this approach requires 4 look-ups and 3 xors per column; thus, a
total of 16 look-ups and 12 xors per round. Moreover, for applications targeting
GPUs older than the GT200 series, with additional rotations a single Ti table
can be used as the tables are simply rotations of each other.
4
The notion of bitslicing was introduced by Biham in [4].
7. Algorithm 1 Pseudo-code for the Cell architecture to fetch 16 S-box constants
simultaneously in a SIMD-fashion. All variables are 128-bit wide registers, vari-
able in contains 16 bytes which are used as indices to look up substitution
values in the S-box which is 16 × 16 bytes; the 16-byte rows are denoted by Si ,
0 ≤ i ≤ 15.
Input: in
Output: out
r := and (in, 0x1F);
a0 := shuffle (S0 , S1 , r);
a1 := shuffle (S2 , S3 , r);
a2 := shuffle (S4 , S5 , r);
a3 := shuffle (S6 , S7 , r);
a4 := shuffle (S8 , S9 , r);
a5 := shuffle (S10 , S11 , r);
a6 := shuffle (S12 , S13 , r);
a7 := shuffle (S14 , S15 , r);
sel := cmpgt (in, 0x7F);
a0 := select (a0, a4, sel);
a1 := select (a1, a5, sel);
a2 := select (a2, a6, sel);
a3 := select (a3, a7, sel);
r := shift left (in, 1);
sel := cmpgt (r, 0x7F);
a0 := select (a0, a2, sel);
a1 := select (a1, a3, sel);
r := shift left (r, 1);
sel := cmpgt (r, 0x7F);
out := select (a0, a1, sel);
4.1 Optimizing Code for the Cell
In this work we use the naming convention of the Cell when discussing the
implementation for the SPE. That is, a word consists of 32 bits and a 128-
bit register on the SPE is called either a vector or quadword. The SPE has a
rich instruction set which operates simultaneously on single-, double- or quad-
words. All distinct binary operations f : {0, 1}2 → {0, 1} are available, other
instructions of particular interest are the shuffle and select. The shuffle
instruction can rearrange (shuffle) 16 bytes of the 32-byte input (two quadwords)
specified by a pattern or select one of the constants {0x00, 0xFF, 0x80} to the
output quadword. The select instruction acts as a multiplexer; depending on
the input pattern the corresponding bit from either the first or the second input
quadword is selected as output.
The 16-way SIMD capabilities of the SPE, working on 16 bytes simultane-
ously, are used to create a byte-sliced implementation of AES. In an optimistic
scenario one expects to achieve a 16-fold speedup compared to machines which
natively have a wordsize of one byte and a comparable instruction set. For many
of the operations required by AES, e.g., bitwise operations, this speedup holds
8. on the Cell with the additional advantage that the Cell has a more powerful
instruction set when compared to other architectures.
Some operations, however, are not so trivial to perform in SIMD. A prime
example is the S-box lookup. Typically this can be done in a few instructions,
depending on the index, by calculating the address to load from and then per-
forming the load. Doing this for 16 values in parallel in a straight-forward (naive)
way can become a bottleneck. In [17] some techniques are briefly described on
how to do this efficiently on the SPE architecture. The idea is to use the five least
significant bits as input to eight different table lookups – one for each possible
value of the three most significant bits. Then we extract those three bits one by
one, and for each of them we build a value we use to select one or the other half
of the outputs from those eight table lookups. In the end we have the correct
output for the full 8-bit input, performed 16 times in parallel. See Algorithm 1
for the SPE pseudo-code.
4.2 Optimizing Code for the GPU
As the instruction set of the GPU is substantially less rich than that of the Cell,
when optimizing an implementation using CUDA, it is essential to be able to
execute many threads concurrently and thereby maximally utilize the device.
Hence, our GPU implementation processes thousands of streams. Moreover, we
also consider implementations with on-the-fly key scheduling, key-expansion in
texture memory, key expansion in shared memory, and variants of the first two
with storage of the T -tables and S-box (or inverse S-box) in shared memory.
To maximize the throughput between the device and host, our implementations
use page-locked memory with concurrent memory copies and kernel execution
on the GT200 GPUs; the older G80 series GPUs do not support these features.
Similar to the implementation of [13], for the first three variants we layout
the tables in constant memory. Because the constant memory cache size is 8KB,
the full tables can be cached with no concern for cache misses. Although this
approach has the advantage of simplicity, unless all the threads of a half-warp
(16 threads executing concurrently) access the same memory location, the ac-
cesses must be serialized and thus the gain in parallelism is severely degraded.
Nonetheless, we improve on [13], by including on-the-fly key scheduling and key-
expansion in texture and shared memory variants. The AES implementation
of [13] assumed the availability of the expanded key in global memory, which
is of practical interest for single-stream cryptographic applications; for multi-
stream cryptographic and cryptanalytic applications, however, key scheduling is
critical as deriving many keys on the CPU is inefficient.
For the on-the-fly key scheduling variants, each thread is independent and the
designs are thus ideal for key search applications or multi-stream applications
with many thousand streams. For many applications, however, having on the
order of a few hundred to a few thousand streams is sufficient and thus further
speedup can be achieved by doing the key expansion in texture or shared memory.
Texture memory, unlike constant memory, allows for multiple non-broadcast
accesses and, like constant memory, has the advantage that it can be written
9. once and have the expanded keys live across multiple kernel launches, making
its use very advantageous in the encryption and decryption of multiple blocks.
Regarding shared memory access, we note that although a single instruction
is issued per warp, the warp execution is split into two half-warps and if no
threads in the half-warp access the same shared memory location, i.e., there is
no bank conflict, 16 different reads/writes can be completed in two cycles (per
SM). This further increases the throughput of AES using precomputed round
keys. For the key expansion into shared memory we create 16 stream-groups per
block, each group consisting of multiple threads that share a common expanded
key. Although bank conflicts on the GT200 series results in only serializing the
conflicting accesses, as opposed to the G80s serialization of all the threads in
the half warp, we carefully implemented the shared memory access to avoid
any bank conflicts. Furthermore, although the number of groups per block can
be increased, we instead increased the number of blocks to allow for the hid-
ing of block-dependent latencies. Of course, for the key expansion into texture
and shared memory no additional gain is achieved unless multiple blocks are
encrypted.
As previously mentioned, the throughput of constant memory for random
access is quite low when compared to shared and texture memory and so we
further optimize AES by placing the T -tables and S-box in shared memory. To
avoid bank conflicts one Ti (of size 1KB) must be stored in each bank; this of
course, is not directly possible because kernel arguments are also usually placed
in shared memory, and furthermore if most of the table (save a few entries), as
in [11], is placed in shared memory, the maximum number of blocks assigned to
that SM would be limited to one. Thus, the overall gain would not be much higher
than using constant memory. The authors in [11] further propose a quad-table
approach in shared memory, but limit the details of whether the design contains
bank conflicts. Our shared memory table approach is a “lazy” approach in simply
laying out the tables in-order. Because we are targeting the newer generation
GPUs, a bank conflict is resolved by serializing only the colliding accesses; thus,
although bank conflicts are expected (simulations show that roughly 35% of the
memory accesses are serialized, so 6 of the 16), on average, the gain in using
shared memory is much higher than constant memory. Further optimizing the
T -table lookup method is part of our current ongoing work.
For on-the-fly key generation we buffered the first round key in shared mem-
ory, from which the remaining round keys are derived. Adopting the method
of [8], during each round four S-box lookups and five xors are needed to de-
rive the new round key. Additional caching (e.g., the last round) can further
improve the performance of the implementation. The key scheduling for de-
cryption consists of running the encryption key scheduling algorithm and then
applying InvMixColumns to all, except the first and last, round keys; it is appar-
ent that on-the-fly key generation for decryption is considerably more complex.
For decryption we buffer the first round key, and (after running the encryption
key scheduler) the InvMixColumns of the final key. We derive all successive keys
from the second to last round key using six xors, a MixColumns transforma-
10. Key Encryption Decryption Code size
Reference Notes
scheduling (cycles) (cycles) (bytes)
[25] Fast on-the-fly 1,259 1,259 1,708 Hardware ext.
[25] Compact on-the-fly 1,442 1,443 840 cost: 1.1 kGates
[21] precompute 3,766 4,558 3,410
[20] Fast precompute 2,474 3,411 3,098 Key setup:
[20] Furious precompute 2,739 3,579 1,570 Enc 756 cycles
[20] Fantastic precompute 4,059 4,675 1,482 Dec 4,977 cycles
[18] precompute 2,555 6,764 2,070 Key setup:
[18] precompute 2,555 3,193 2,580 2,039 cycles
Key setup:
New - Low RAM precompute 2,153 2,901 1,912 789 cycles
New - Fast precompute 1,993 2,901 1,912 747 cycles
Table 1. AES-128 implementation results on the 8-bit microcontroller AVR.
tion (of part of the key), and a transformation combing InvMixColumns and
InvSubBytes per round. These complex transformations nonetheless take ad-
vantage of the T -tables, with the additional need for an S-box[S-box[·]] table
to further optimize the memory pressure. This efficient on-the-fly key schedul-
ing is not GPU specific, and can be further applied to any other T -table based
implementations.
5 Results
All our implementations, for the three architectures, are designed to reduce the
number of clock cycles required for encryption and decryption. Table 1 states
AES-128 performance results obtained on the AVR microcontroller. Depending
on the AVR model used, the availability of RAM and flash memory varies.
The code size of the implementation, which needs to fit in the flash memory, is
presented in the table as well. We created two versions: a fast and a compact
version. The compact version only stores the key (176 bytes) in RAM but, no
additional tables. The faster version trades RAM usage for speed by placing the
S-box in RAM, and thus increasing the required size by 256 bytes. Our results
are obtained by running our compact version on the AT90USB162 (16 MHz, 512
byte RAM and 16 kilobyte flash) and the fast version on the larger AT90USB646
(16 MHz, 4 kilobyte RAM and 64 kilobyte flash). Although a direct comparison
is not possible, for completeness, estimates of an AVR implementation using
hardware extensions are also shown in Table 1. Figure 1 graphically shows the
code size versus the required cycles for decryption and encryption of different
AVR implementations of AES-128.
Table 2 gives AES-128 performance results obtained when running on the
various GPUs and the SPE architecture. Not many benchmarking results of
AES on the SPE architecture are reported in the literature. We therefore com-
pare our results to the performance data given by IBM in [23]. This single-stream
11. 3600
[24]
3400 [23]
[21]
3200 New
3000
Code Size in Bytes
2800
2600
2400
2200
2000
1800
1600
1400
2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
Cycles for Decryption
3600
3400
3200
3000
Code Size in Bytes
2800
2600
2400
2200
2000
1800
1600
1400
1500 2000 2500 3000 3500 4000 4500 5000
Cycles for Encryption
Fig. 1. Code size versus cycle count for decryption and encryption of different AES-128
AVR implementations.
12. Cycles/
Reference Algorithm Architecture Gb/sec
byte
[13], 2007 Enc (P) NVIDIA 8800 GTX, 1.35GHz 1.30 8.3
[26], 2007 Enc (F) ATI HD 2900 XT, 750MHz 1.71 3.5
[11], 2008 Enc (P) NVIDIA 8800 GTX, 1.35GHz 1.56 6.9
This article, T -smem Enc (F) 0.94 11.5
This article, T -smem Enc (T) 0.74 14.6
NVIDIA 8800 GTX, 1.35GHz
This article, T -smem Dec (F) 1.43 7.6
This article, T -smem Dec (T) 0.76 14.3
This article Enc (F) 1.21 8.2
This article Enc (T) 1.02 9.7
This article Enc (S) 0.66 15.1
This article, T -smem Enc (F) 0.21 47.1
This article, T -smem Enc (T) 0.17 59.6
NVIDIA GTX 295, 1.24GHz
This article Dec (F) 7.16 1.4
This article Dec (T) 0.70 14.2
This article Dec (S) 0.67 14.9
This article, T -smem Dec (F) 0.34 29.3
This article, T -smem Dec (T) 0.19 52.4
[23], 2005 Enc (P) 12.4 2.1
SPE, 3.2GHz
[23], 2005 Dec (P) 17.1 1.5
This article Enc (P) 11.7 2.2
SPE, 3.2GHz
This article Dec (P) 14.4 1.8
Table 2. Different AES-128 implementations results on a single SPE and various GPU
architectures. (P) = key scheduling is pre-computed, (F) = key scheduling is on-the-fly,
(T) = key expansion in texture memory, (S) = key expansion in shared memory
high-performance implementation is not byte-sliced, as our implementation, but
optimized for the SPE-architecture to take full advantage of the SIMD prop-
erties. Compared to the IBM implementation, our SPE implementation is 6%
and 16% faster for encryption and decryption, respectively. Our byte-sliced key
generation routine runs in 62 clock cycles per stream.
To give an indication how well our C-implementation of AES runs on the SPE
we estimate a rough lower bound of encryption using the techniques employed
in our implementation. Every round the AddRoundKey, SubBytes, ShiftRows
and the MixColumns need to be computed four times. A single instance of the
first three steps can be computed with four loads and four xors and four times
the operations as outlined in Algorithm 1 to load the S-box constants. The
computation of a single instance of the MixColumns is done with 3 odd and
27 even instructions. Hence, nine full rounds and one final round (without the
MixColumns) plus post-processing (the so-called output whitening) require 1752
odd and 2764 even instructions. This is without taking the loading and storing of
the input and output into account. Assuming everything is perfectly scheduled,
dispatching one pair of an odd and even instruction per cycle (if available), the
13. encryption should run in 2764 cycles for sixteen streams in parallel. This gives
a lower bound of 2764/(16 · 16) = 10.8 cycles per byte for our encryption im-
plementation. The 11.7 cycles per byte experienced in practice is slightly higher
and can be explained by the fact that some instructions are imperfectly sched-
uled, thus causing stalls, and due to a possible imbalance in the pipelines which
effectively prevents the dispatching of a pair of odd and even instructions every
clock cycle.
There have been numerous implementations of AES on GPUs, we how-
ever, only compare against GPUs with support for integer arithmetic. Table 2
compares our GTX 295 and GeForce 8800 GTX implementations with those
in [13,26,11]. The results in the table include the memory transfer along with
the kernel execution, each stream encrypting 256 random blocks. We further note
that the GTX 295 contains two GPUs (clocked-down version of the GTX 280
GPU). We point out the poor performance of decryption for the on-the-fly with
tables in constant memory (the simplest method) is due to cache misses; the de-
cryption with on-the-fly key scheduling requires the T -tables for encryption and
decryption in addition to the S-box and inverse S-box, which are overall greater
than 8KB. We, of course, only present these results for comparison in evaluating
AES’s performance on the GPU using the various memory spaces available.
Since the GT200 series GPUs address many of the limitations of the G80
series GPUs a direct comparison is not appropriate, withal, we note that with
the exception of our simplest implementation our results outperform all previ-
ous GPU implementations of AES. Additionally, although our implementations
target the GT200 GPUs, which in addition to the previously-mentioned advan-
tages over the G80 have more relaxed memory accesses pattern restrictions, for
completeness we benchmark our fastest implementations on the 8800 GTX. As
shown in Table 2, our fastest GTX 295 implementation is roughly 8 times faster
than [13], 10 times faster than [26], and 9.4 times faster than [11]. Similarly, our
fastest 8800 GTX implementation is 43%, 57% and 53% faster than [13], [26],
and [11], respectively. Moreover, unlike the implementations of [13] and [11],
our implementations also include key scheduling methods, without which the
throughput would increase by roughly 23% (see [26]). Compared to [11], we do
not limit our implementation to CTR, for which additional improvements can
be made [3]. When compared to the AES implementation in [26], our streams
encrypt different plaintext messages with different keys; tweaking our implemen-
tations for applications of key searching as in [26] would further speed up the
AES implementation by at least 35% as only one message block would be copied
to the device.
6 Conclusion
We presented new software speed records for encryption and decryption when
running AES-128 on the 8-bit AVR microcontroller, the Cell broadband engine
architecture and NVIDIA graphics processing units. To achieve these perfor-
mance records a byte-sliced implementation is employed for the first two ar-
14. chitectures, while a T -table approach is used for our GPU implementations.
The implementations targeting the Cell and GPU architectures process multi-
ple streams in parallel to obtain the results. Furthermore, this is the first AES
implementation for the GPU which implements both encryption and decryption.
References
1. AMD. ATI CTM Reference Guide. Technical Reference Manual, 2006.
2. Atmel Corporation. 8-bit AVR Microcontroller with 8/16K Bytes of ISP Flash
and USB Controller. Technical Reference Manual, 2008.
3. D. J. Bernstein and P. Schwabe. New AES software speed records. In Progress
in Cryptology - INDOCRYPT 2008, volume 5365 of Lecture Notes in Computer
Science, pages 322–336. Springer, 2008.
4. E. Biham. A Fast New DES Implementation in Software. In FSE 1997, volume
1267 of LNCS, pages 260–272, 1997.
5. A. Biryukov and D. Khovratovich. Related-key Cryptanalysis of the Full
AES-192 and AES-256. Cryptology ePrint Archive, Report 2009/317, 2009.
http://eprint.iacr.org/.
6. A. Biryukov and D. K. I. Nikolic. Distinguisher and Related-Key Attack on the
Full AES-256. In Crypto 2009, volume 5677 of LNCS, pages 231–249, 2009.
7. D. Blythe. The Direct3D 10 system. ACM Trans. Graph., 25(3):724–734, 2006.
8. J. Daemen and V. Rijmen. The design of Rijndael. Springer-Verlag New York,
Inc. Secaucus, NJ, USA, 2002.
9. J. Dongarra, H. Meuer, and E. Strohmaier. Top500 Supercomputer Sites.
http://www.top500.org/.
10. Frost & Sullivan. Asia Pacific’s Final Wireless Growth Frontier.
http://www.infoworld.com/t/networking/passive-rfid-tag-market-hit-486m-in-
2013-102.
11. O. Harrison and J. Waldron. Practical Symmetric Key Cryptography on Modern
Graphics Hardware. In USENIX Security Symposium, pages 195–210, 2008.
12. H. P. Hofstee. Power Efficient Processor Architecture and The Cell Processor. In
HPCA 2005, pages 258–262. IEEE Computer Society, 2005.
13. S. A. Manavski. CUDA Compatible GPU as an Efficient Hardware Accelerator for
AES Cryptography. In ICSPC 2007, pages 65–68. IEEE, November 2007.
14. National Institute of Standards and Technology (NIST).
FIPS-197: Advanced Encryption Standard (AES), 2001.
http://www.csrc.nist.gov/publications/fips/fips197/fips-197.pdf.
15. NVIDIA. NVIDIA GeForce 8800 GPU Architecture Overview. Technical Brief
TB-02787-001 v0, 9, 2006.
16. NVIDIA. NVIDIA CUDA Programming Guide 2.3, 2009.
17. D. A. Osvik. Cell SPEED. SPEED 2007, 2007.
http://www.hyperelliptic.org/SPEED/slides/Osvik cell-speed.pdf.
18. D. Otte. AVR-Crypto-Lib, 2009. http://www.das-labor.org/wiki/Crypto-avr-
lib/en.
19. J. Owens. GPU architecture overview. In SIGGRAPH 2007, page 2. ACM, 2007.
20. B. Poettering. AVRAES: The AES block cipher on AVR controllers, 2006.
http://point-at-infinity.org/avraes/.
21. S. Rinne, T. Eisenbarth, and C. Paar. Performance Analysis of Contempo-
rary Light-Weight Block Ciphers on 8-bit Microcontrollers. SPEED 2007, 2007.
http://www.hyperelliptic.org/SPEED/record.pdf.
15. 22. M. Segal and K. Akeley. The OpenGL graphics system: A specification (version
2.0). Silicon Graphics, Mountain View, CA, 2004.
23. K. Shimizu, D. Brokenshire, and M. Peyravian. Cell Broadband Engine Support
for Privacy, Security, and Digital Rights Management Applications. https://www-
01.ibm.com/chips/techlib/techlib.nsf/techdocs/3F88DA69A1C0AC40872570AB00570985,
October 2005.
24. O. Takahashi, R. Cook, S. Cottier, S. H. Dhong, B. Flachs, K. Hirairi, A. Kawa-
sumi, H. Murakami, H. Noro, H. Oh, S. Onish, J. Pille, and J. Silberman. The
circuit design of the synergistic processor element of a Cell processor. In ICCAD
2005, pages 111–117. IEEE Computer Society, 2005.
25. S. Tillich and C. Herbst. Boosting AES Performance on a Tiny Processor Core.
In CT-RSA, volume 4964 of LNCS, pages 170–186, 2008.
26. J. Yang and J. Goodman. Symmetric Key Cryptography on Modern Graphics
Hardware. In Asiacrypt 2007, volume 4833 of LNCS, pages 249–264, 2007.