Published on

Ultra-low power and high speed design and implementation of AES and SHA1 hardware cores in 65 nanometer CMOS technology

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Ultra-Low power and High Speed Design and Implementation of AES and SHA1 Hardware cores in 65 Nanometer CMOS Technology Feng Ge, Pranjal Jain and Ken Choi Department of Electrical and Computer Engineering Illinois Institute of Technology Email: {fge2, pjain13, and kchoi12} Abstract— This paper describes a design and being packed into a chip. This leads to the steady growth implementation of low-power and high-speed security of the operating frequency and processing capacity per hardware cores for the Advanced Encryption Standard chip, resulting in increased power dissipation. Now-a- (AES) and the Secure Hash Algorithm (SHA1). We propose days, power-aware design techniques at the early stage of three Register Transfer Level (RTL) circuit techniques, the design abstraction hierarchy such as register transfer namely, Application Specific Register Reduction (ASRR), level (RTL) are getting more attention. Locally Explicit Clock Enabling (LECE), and Bus Specific In this paper, we have implemented ultra-low power Clock (BSC). LECE and BSC can be used directly to any ASIC design flow and can be applied for any technology AES and SHA1 hardware cores with emphasize on power nodes. With 65 nanometer industry technology, our reduction techniques at RTL so that system designers can proposed schemes demonstrated at RTL and gate level that easily implement very low-power and high-performance for AES, 44.57% total power reduction (dynamic and cell security systems toward fabrication in CMOS by using leakage power), 10.43% area reduction, and 5.78 Gbps our soft-IP at RTL at an early stage of ASIC design flow. throughput with 452 MHz circuit speed are achieved and for SHA1, 63.26% total power reduction, 12.72% area reduction with 1.28 GHz circuit speed are achieved. II. BACKGROUND A. AES Algorithm I. INTRODUCTION AES [1, 2] is a symmetric cipher that processes data in As the demand for secure communications increases, 128-bit blocks. It supports key sizes of 128, 192 and 256 high-throughput, low-power en/decryption on both wired bits and consists of 10, 12 and 14 iterations. Each round and wireless networks is growing more necessary. Today, providing security is one of the major concerns, especially mixes the data with a roundkey, which is generated from for wireless systems design such as wireless sensor the encryption key. We are considering only 128 bits and networks and RFID. 10 iterations. In general, such applications require development of The Cipher maintains an internal, 4x4 matrix of bytes, scalable, ultra-low power and low cost architecture [3, called state, on which operations are performed. Initially, 11]. Security systems form backbone of such sensor state is filled with the input data block and XORed with network and require protection from threats such as data the encryption key. Regular rounds consist of operations integrity, eavesdropping and impersonation. So, the main called Subbytes, Shiftrows, MixColumns, and aim is to implement a low power, high throughput and AddRoundkey as shown in Figure 1 (a). The last round low area cryptography algorithms like Advanced bypasses MixColumns. Encryption Standard (AES) [1, 2] and Secure Hash (1) SubBytes: The SubBytes transformation is a Algorithm (SHA1) [10] effectively. nonlinear substitution operation that works on bytes. Each Conventionally, research findings mainly focused on byte of the input state is replaced using the same developing pipelined and loop-unrolled AES designs [5, substitution function (called S-Box). The S-Box is defined 6]. There has been research done in implementing AES S- as the multiplicative inverse in the Galois Field GF (28) Box full-custom design, AES ASIC designs with varying data paths [7]. Roundkey are generated on fly either by with the irreducible polynomial m(x) = x8 + x4 + x3 + x + sharing S-Box with main data path [7] or by dedicating S- 1 followed by an affine transformation. The InvSubBytes box for the Key expansion. Architectures are exploited in transformation, which is needed for decryption, is the feedback modes of operation in SHA1. Thus, we observe inverse of the affine transformation followed by the same that above references mainly focus on area efficient inversion as in the SubBytes transformation as shown in implementation or increasing throughput using Figure 1 (b). architecture reconfigurations [4, 8, and 9]. (2) ShiftRows: The ShiftRows transformation rotates Traditionally, power dissipation of VLSI chips was each row of the input state to the left, whereby the offset neglected. The device density and operating frequency of the rotation corresponds to the row number. The were low enough to form a constraining factor in the InvShiftRows of this transformation is computed by chips. As the scale of integration improves, more performing the corresponding rotations to the right as transistors, faster and smaller than their predecessors, are shown in Figure 1 (b). 978-1-4244-3355-1/09/$25.00©2009 IEEE 405 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  2. 2. Plaintext Ciphertext Word E: C3 D2 El FO Round Round Key 0 ADDROUNDKEY Initial Round Key Nr ADDROUNDKEY Initial Round SubBytes InvSubBytes ShiftRows InvShiftRows Nr 1 Round Key Nr 1 MixColumns Nr 1….1 InvMixColumns Round Key ADDROUNDKEY InvMixColumns ADDROUNDKEY 1….Nr 1 SubBytes InvSubBytes ShiftRows Final InvShiftRows Final Round Round Round Round Key Nr ADDROUNDKEY Key 0 ADDROUNDKEY Ciphertext Plaintext Figure 1(a): Encryption Figure 1(b): Decryption Figure 2. Secure Hash Algorithm (SHA1) Algorithm (4) Process message in 16-word blocks: The heart of the algorithm is a module that consists of four rounds of Figure 1: AES Algorithm processing 20 steps each. The four rounds have a similar structure, but each uses a different primitive logical (3) MixColumns: The MixColumns transformation function. These logical functions are defined as follows: maps each column of the input state to a new column in These rounds take as input the current 512-bits block and the output state. Each input column is considered as a 8 the 160-bits buffer value (A, B, C, D, E), and then update polynomial over GF (2 ) and multiplied with the constant these buffers. polynomial a(x) = {03} x3 + {01} x2 + {01} x + {02} ⎧( B ∧ C ) ∨ ( B ∧ D ) 4 modulo x - 1. The coefficients of a(x) are also elements 0 ≤ t ≤ 19 ⎪B ⊕ C ⊕ D 8 of GF (2 ) and are represented by hexadecimal values in this equation. The InvMixColumns transformation is the ⎪ 20 ≤ t ≤ 39 -1 f ( B, C , D) = ⎨ multiplication of each column with a (x) = {0B} x3 + 4 {0D} x2 + {09} x + {0E} modulo x – 1 as shown in ⎪( B ∧ C ) ∨ ( B ∧ D ) ∨ (C ∧ D ) 40 ≤ t ≤ 59 Figure 1 (b). ⎪B ⊕ C ⊕ D ⎩ 60 ≤ t ≤ 79 (4) AddRoundKey: The AddRoundKey transformation Each round also makes use of an additive constant KT. In is self-inverting. It maps a 128-bit input state to a 128-bit hex the values are shown below. output state by XORing the input state with a 128-bit round key. Please refer Figure 1. ⎧5 A827999 0 ≤ t ≤ 19 ⎪6 ED 9 EBA1 ⎪ 20 ≤ t ≤ 39 B. SHA1 Algorithm: KT = ⎨ The algorithm takes as input a message with a ⎪8 F 1BBCDC 40 ≤ t ≤ 59 maximum length of less than 264 bits and produces as ⎪C 862C1D 6 ⎩ 60 ≤ t ≤ 79 output a 160-bits message digest as shown in Figure 2. The input is processed in 512 bits blocks. The algorithm processing includes the following steps: III. PROPOSED APPROACHES AND IMPLEMENTATIONS (1) Padding: The purpose of message padding is to We have implemented both AES and SHA1 at RTL by make the total length of a padded message congruent to using the following three techniques for low power and 448 modulo 512(length = 448 mod 512). The number of synthesized them. The performance is demonstrated in padding bits should be between 1 and 512. Padding terms of power, area, speed, and throughput at RTL and consists of single 1-bit followed by the necessary number also gate level: of 0-bits. A) Application Specific Register Reduction (ASRR) (2) Appending Length: A 64-bits binary representation B) Locally Explicit Clock Enabling (LECE) of the original length of the message is appended to the C) Bus Specific Clock (BSC) end of the message. (3) Initialize the SHA-1 buffer: The 160-bits buffer is A. Application Specific Register Reduction (ASRR): represented by five four-word buffers (A, B, C, D, E) Figure 3 illustrates our implementation for the used to store the middle or final results of the message decryption part of AES core. The AES takes a 128-bit digest for SHA-I functions. They are initialized to the data block as input and performs several different following values in hexadecimal. Low-order bytes are put transformations on this block. AES encryptions and first. decryptions are based on four different transformations Word A: 67 45 23 01; that are performed repeatedly in a Word B: EF CD AB 89; Word C: 98 BA DC EF; Word D: 10 32 54 16; 406 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  3. 3. paper, we proposed a novel way to reduce the number of registers tremendously by generating all sub-keys in encryption Key Expansion Module and storing them into registers or RAMs before decryption begins. Figure 5. Original Key Reverse Buffer Figure 3. Application Specific Register Reduction (ASRR) In our proposed architecture, we share maximum similarity with encryption circuit and the registers can be certain sequence as shown in Figure 1. Each of these reduced as shown in Figure 6. Sub-key Ki is generated transformations, which are described in the section I, and stored into Regi at i-th clock cycle, where i equals to 1 maps a 128-bit input state to a 128-bit output state. to 11. Notice that these 11 registers are only used once in For an AES-128 encryption, the 128-bit cipher key decryption, therefore, we can reduce their number to 6. th needs to be expanded to eleven 128-bit round keys. The Sub-keys are stored into registers from 5 clock cycle. principle idea of this key expansion is that the first round Sub-keys K0 to K4 are generated and stored into registers key, Roundkey (k0) corresponds to the cipher key. All after decryption begins. The multiplexers before registers subsequent round keys are derived from their respective are controlled by decryption begin signal “de”. predecessor using a function f. So, Roundkey (ki) = f (Roundkey (ki) – 1) for all 0 < i < 11. For an AES-128 decryption, the same round keys are used in reversed order. Using the inverse of the key expansion function, f - 1 , the round keys can be derived recursively from RoundKey (k10) and are stored in Key Reverse Buffer, using just 6 registers instead of 10. In AddRoundKey step, a new sub-key is generated according to the previous sub-key. The Key Generation Schedule is shown in Figure 4. According to round numbers, there are 10, 12, 14 sub-keys involved in encryption. We have implemented 10 sub-keys generation. Figure 6. ASRR for the Key Reverse Buffer Timing Sequence of the Registers is as shown Figure 7. At the fifth clock, we store the key K5 to R0 and at the next clock, the key K6 to R1 until we store the key K10 to R5. Now, decryption starts and we use the key K10 previously stored in R5 at the first clock cycle of the decryption. At the same time, the key K1 is generated and stored in R5. In the next cycle, the key K9 previously Figure 4. Key Generation Block stored in R4 at the second clock cycle of the decryption. The decryption process is the reverse of encryption. At the same time, the key K2 is generated and stored in Sub-keys are used in a reverse order. Conventional way to R4. We repeat the operation until the key K6 previously implement this is to generate the last key with encryption stored in R1 at the fourth clock cycle of the decryption Key Expansion Module, and then use a reverse Key and the key K4 is generated and stored in R1. By using Expansion Module to generate each sub-key in reverse this mechanism, we can save 5-128bits registers which is called in this paper ASRR (Application Specific Register order as shown in Figure 5. However, this method Reduction) scheme. requires large extra circuit and a large S-box. In this 407 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  4. 4. C. Bus Specific Clock (BSC): Schematic and timing diagram in Figure 9 shows a register where the data input is active during one phase of operation only, and does not change for a long period of time. The main goal of this technique is to find buses in the design that have low switching activity first and then Figure 7. Timing and Waveform View for the Proposed ASRR Scheme if we can create a clock enable signal by detecting changes on the bus, we can save power. B. Locally Explicit Clock Enabling (LECE): In general, a RTL code which has the output dependent on some initial condition, after synthesis results into a flip-flop with a MUX in feedback. We have removed the MUX in feedback loop by implementing a gated clock. Main difference between LECE and traditional clock gating is in two folds; i) Traditional clock gating during synthesis inserts clock gating cells globally based on maximum fanout number and maximum bus width, so it is far from the optimal solution and ii) LECE investigates judiciously the clock signal and the enable signal, and then find which registers should be clock gated for the Figure 9. Data Bus Specific Clock optimal solution in terms of total power, dynamic and leakage power. We have implemented this technique in In the security algorithm AES, there is a potential mainly Key Expansion Unit and Key Reverse Buffer block candidate residing inside Key Expansion Unit. For of the decryption module of AES. generating sub-keys in Roundkey[i], we XOR the Control block of AES core performs several functions, previous key generated in Roundkey[i-1] with Rcon[i] from it one of its important function is to keep track of and subword. number of rounds and sub-keys generated using key expansion unit. We have considered 128-bit key and hence have to keep count of 10. Consider figure 8 (a), in which we get ‘kcnt’ output on a rising edge of ‘clk’, but only when the signal ‘kld’ or ‘kb_ld’ is high. Now if the enable signal is low for a significant amount of circuit operation and if ‘D = 10’ and ‘Kcnt’ are multi-bit buses which they are, then a substantial amount of power dissipated by the clock driver is wasted. We have implemented a technique, which will gate the clock and thus reduce the power dissipation by significant percentage. Figure 10. RCON Implementation As shown in Figure 8 (b), we replace the clock input to flip-flop with an AND gate whose inputs are the clock Here Rcon[i] consists of 32-bit bus having output and the ‘EN = kld | kb_ld’ signal. We have used a latch so ‘out[31:0]’ values 0X01, 0X02, 0X04, 0X08, 0X10, that when the clock is high, no activity on the enable will 0X20, 0X40, 0X80, 0X1b, 0X36 for 10 rounds be transferred to the clock input. We implemented our respectively. Thus, we can observe that out[23:0] has 24- technique at RTL so that we obtain a new module as bit LSB bus infrequently used. In the Figure 11 (a), we shown in figure 8 (b). can see that out[31:0] (data) is active for a very small Clk Clk amount of time, while we are applying clock continuously. Thus, this results a lot of power dissipation E in clock driver as well as circuitry inside of the register. EN = kld or kb ld We can avoid this bottleneck by constructing an enable D signal by detecting changes on the bus. Please refer figure kcnt 11 (b). We XOR the next state of each bit with the 10 or Kcnt-1 kcnt previous one to check whether they are same, and then N- bit OR is used to determine if any bits changed. Now if Kcnt - 1 there are no bits changed then there is no point in Figure 8 (a) Figure 8 (b) enabling the clock. The latch is used to avoid any glitches at AND output, otherwise there would be an accidental clock signal applied to the register making it ON, which is Figure 8. Implementation of Locally Explicit Clock Enabling (LECE) undesirable. (a) 1-bit of initial control block (b) After implementing LECE 408 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  5. 5. Clk Power Dissipation AES RTL Static Power Dissipation Dynamic Power Dissipation Total Internal Clock Total Internal Clock Total Clk Leakage Leakage Leakage Dynamic Dynamic Dynamic E dd[23:0] Original 5.72uW 74.4nW 5.79uW 14.9mW 2.99mW 17.8mW 17.9mW D Power Register 5.41uW 59.5nW 5.47uW 13.5mW 2.38mW 15.8mW 15.8mW en Reduction Reduction D[31 0] Out[31 0] Techniques Explicit Clock 5.42uW 70.5nW 5.49uW 8.91mW 999uW 9.91mW 9.92mW Enable Out[23:0] Bus Specific 5.79uW 73.8nW 5.86uW 14.7mW 2.93mW 17.7mW 17.7mW D[23:0] Clock Combining above all three 5.31uW 56.2nW 5.37uW 9.12mW 916uW 10mW 10mW D[31 0] power reduction techniques LECE & BSC 5.49uW 69.9nW 5.56uW 8.77mW 947uW 9.72mW 9.73mW Out[31 0] Clk Wasted Clk Table 2. Gate-level Power Dissipation Comparisons for AES (after synthesis with 65 nm tech.) D[30] D[29] D[28] D[0] D[31] D[30] D[29] D[28] D[0] Out Out Out Out Out Out Out Out Out [30] [29] [28] [0] [31] [30] [29] [28] [0] Figure 11 (a) Figure 11 (b) Figure 11. Implementation of Bus Specific Clock (BSC) (a) 32-bit of initial RCON block (b) After implementing BSC Table 3. Gate-level Area Comparisons for AES (after synthesis with IV. SIMULATION RESULTS 65 nm tech.) Area We designed and implemented the AES and the SHA1 AES GATE Combinational Sequential Total (Min Inverter Area: 1.08) core in Verilog at the RTL and synthesized it to the gate Original 58352.609375 23870.750000 82222.562500 level using a 65 nm, 1.0 Volt, standard-cell CMOS Traditional Clock Gating 58340.003906 18760.798828 77100.484375 technology. We used PowerTheater for power analysis, Power Reduction Register Reduction 57282.164062 19723.982422 77005.445312 NC-Verilog for RTL simulation, Design Compiler for Techniques Explic t Clock 58326.691406 18769.437500 77095.804688 10.43% Enable synthesis, and Power Compiler for traditional clock- Bus Spec fic 58588.035156 23880.830078 82468.078125 gating implementation. We have included the results from Clock Combining above all three 57556.742188 16087.537109 73643.757812 power, area and speed at RTL and also gate level. The power reduction techniques following tables compare our results with the previous LECE & BSC 58562.148438 18779.515625 77341.320312 compact ASIC designs for AES and SHA1. Table 4. Gate-level Delay and Throughput Comparisons for AES A. Comparison Results for AES (after synthesis with 65 nm tech.) After doing initial power analysis at RTL, we applied Critical Path Delay and Throughput Frequency (with 10% Throughput AES-GATE three power reduction techniques to AES core at RTL and (ns) slack margin) (MHz) (Gb/sec) results are tabulated in Table 1-4. We can observe that Original 1.99 452 5.78 with 65 nanometer industry technology, our proposed Traditional Clock Gating 1.99 452 5.78 Power Register 2.03 443 5.67 schemes demonstrated 45.6% total power reduction Reduction Reduction Techniques (dynamic and cell leakage power) at RTL and 44.57% Explicit Clock Enable 1.99 452 5.78 total power reduction, 10.43% area reduction, and 5.78 Bus Specific 1.99 452 5.78 Clock Gbps throughput with 452 MHz circuit speed at gate Combining above all three 1.99 452 5.78 level. Table 1 shows the power reduction results at RTL, power reduction techniques LECE & BSC 1.99 452 5.78 Table 2 shows the power reduction results at gate level, Table 3 shows the area reduction, and Table 4 shows max circuit speed and throughput of AES implementation, B. Comparison Results for SHA1 comparing with conventional design method and We applied three power reduction techniques to SHA1 traditional clock-gating design. core at RTL and results are tabulated in Table 5-8. We can observe that with 65 nanometer industry technology, Table 1. RTL Power Dissipation Comparisons for AES our proposed schemes demonstrated 65.33% total power reduction (dynamic and cell leakage power) at RTL and 63.26% total power reduction, 12.72% area reduction without compromising the speed, 1.28 GHz at gate level. Table 5 shows the power reduction results at RTL, Table 6 shows the power reduction results at gate level, Table 7 shows the area reduction, and Table 8 shows max circuit speed of SHA1 implementation, comparing with 409 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.
  6. 6. conventional design method and traditional clock-gating reduction, 12.72% area reduction in 1.28 GHz circuit design. speed for SHA1. Table 5. RTL Power Dissipation Comparisons for SHA1 VI. ACKNOWLEDGMENT Power Dissipation Static Power Dissipation Dynamic Power Dissipation Total The authors gratefully acknowledge the contribution of SHA1 RTL Internal Clock Total Internal Clock Total Leakage Leakage Leakage Dynamic Dynamic Dynamic reviewers' comments. Original 2.34uW 66.7nW 2.41uW 4.15mW 1.09mW 5.24mW 5.25mW Power Reduction Exp ic t Clock Enable 2.15uW 64.9nW 2.21uW 1.79mW 333uW 2.12mW 2.12mW VII. REFERENCES Techniques Bus Specific 2.33uW 66.4nW 2.4uW 3.87mW 1.02mW 4.9mW 4.9mW Clock [1] National Institute of Standards and Technology (U.S.), Combining above two power reduction techniques 2.14uW 64.8nW 2.21uW 1.56mW 266uW 1.82mW 1.82mW Advanced Encryption Standard. [2] J. Dijmen and V. Rijmen. AES Proposal: Rijndael. NIST Table 6. Gate-level Power Dissipation Comparisons for SHA1 (after AES Proposal, June 1998. synthesis with 65 nm tech.) [3] MooSeop Kim, Juhan Kim, Yongje Choi, “Low Power Power Dissipation Circuit Architecture of AES Crypto Module for Wireless SHA1-GATE Static Power Dissipation Internal Clock Total Dynamic Power Dissipation Internal Clock Total Total Sensor Network” In Proc. of world academy of science, Leakage Leakage Leakage Dynamic Dynamic Dynamic engineering and technology volume 8 october 2005 issn Original Trad tional 1.99uW 1.97uW 65.8nW 52.4nW 2.05uW 2.02uW 5.01mW 2.17mW 1.36mW 501uW 6.37mW 2.67mW 6.37mW 2.68mW 1307-6884 Power Clock Gating Explic t Clock 1.96uW 63.9nW 2.02uW 2.16mW 520uW 2.68mW 2.68mW [4] Alireza Hodjat, David D. Hwang, Bocheng Lai, Kris Tiri, Reduction Techniques Enable Bus Spec fic 2.08uW 64.8nW 2.15uW 4.75mW 1.28mW 6.03mW 6.03mW Ingrid Verbauwhede, “A 3.84 Gbits/s AES Crypto Clock Combining above all three 2.02uW 63.1nW 2.08uW 1.9mW 441uW 2.34mW 2.34mW Coprocessor with Modes of Operation in a 0.18-μm CMOS power reduction techniques Technology” GLSVLSI’05 April 17–19, 2005, Chicago, Illinois, USA. Table 7. Gate-level Area Comparisons for SHA1 (after synthesis [5] T. Good and M. Benaissa. AES on FPGA from the fastest to with 65 nm tech.) the smallest. In Proc. 7th Int. Workshop on SHA1 GATE Combinational Area Sequential Total CryptographicHardware and Embedded Systems (CHES (Min Inverter Area:1.08) 2005), pages 427–440, Edinburgh, UK, Aug. 29–Sept. 1, Original Traditional Clock Gating 6671.159668 6381.723145 21681.474609 17775.910156 28352.880859 24157.800781 2005. Power Exp icit Clock 6350.041504 17784.912109 24135.121094 [6] P. Hämäläinen, M. Hännikäinen, and T. Hämäläinen. Reduction Enable Techniques Bus Specific 7391.526855 21691.556641 29083.320312 Efficient hardware implementation of security processing for Clock IEEE 802.15.4 wireless networks. In Proc. 48th IEEE Int. Combining above all three 6934.683105 17809.392578 24744.240234 power reduction techniques Midwest Symp. On Circuits and Systems (MWSCAS 2005), pages 484–487, Cincinnati, OH, USA. Aug. 7–10, Table 8. Gate-level Delay and Throughput Comparisons for AES 2005. (after synthesis with 65 nm tech.) [7] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. “A Delay compact Rijndael hardware architecture with S-box SHA1 GATE Critical Path (ns) Frequency (with 10% slack margin) optimization” In Proc. 7th Int. Conf. on Theory and (GHz) Original 0.7 1.28 Application of Cryptology and Inf. Secur., Advances in Traditional Clock Gating 0.7 1.28 Cryptology (ASIACRYPT2001), pages 239–254, Gold Power Explicit Clock 0.7 1.28 Coast, Australia, Dec.9–13, 2001. Reduction Techniques Enable Bus Specific 0.7 1.28 [8] C. Su, T. Lin, C. Huang, and C. Wu, “A High-Throughput Clock Combining above all three power 0.7 1.28 Low-cost AES processor,” IEEE Communication Magazine, reduction techniques Vol. 41, Issue 12, pp. 86-91, December 2003. [9] S. Morioka, A. Satoh, “A 10-Gbps Full- AES Design with V. CONCLUSION aTwisted BDD S-Box Architecture”, IEEE Transaction on VLSI, Vol.12, No. 7, July 2004. In this paper we presented the design and implementation [10] FIPS 180-1, Secure hash standard, NIST, US Department of of a compact AES and SHA1 ASIC core suitable for Commerce, Washington D. C., April I995 wireless sensor networks and RFID. Compared to [11] G. Asada, M. Dong, T. S. Lin, F. Newberg, G. Pottie, W. J. previous designs, we achieved significantly lower power Kaiser, “Wireless Integrated Network Sensors: Low Power and lower area in both AES and SHA1 case by using Systems on a Chip”, Solid-State Circuits Conference, 1998. proposed novel design techniques. We implemented the ESSCIRC '98. Proceedings of the 24th European. proposed ASRR (application specific register reduction), LECE (locally explicit clock enabling), and BSC (bus specific clock) at RTL and evaluated at gate level in ASIC flow. Generated RTL soft-Intellectual Property by using those techniques in this paper can be used directly to any ASIC design flow and can be applied for any technology nodes. With 65 nanometer industry technology, our proposed schemes demonstrated 44.57% power reduction, 10.43% area reduction, and 5.78 Gbps throughput with 452 MHz circuit speed for AES, and 63.26% power 410 Authorized licensed use limited to: Illinois Institute of Technology. Downloaded on September 27, 2009 at 21:32 from IEEE Xplore. Restrictions apply.