Your SlideShare is downloading. ×
II
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

II

185
views

Published on

Published in: Technology, Business

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
185
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cryptographic Algorithms and their Implementations
    • Discussion of how to map different algorithms to our architecture
      • Public-Key Algorithms (Modular Exponentiation)
      • Rijndael
      • Serpent
      • Others (Mars, RC6, Twofish, etc.)
  • 2. Modular Exponentiation
    • Square and Multiply Algorithm for Modular Exponentiation
  • 3. Modular Exponentiation
    • Montgomery Modular Multiplication
  • 4. Modular Exponentiation
    • Several Approaches to implementing Modular Multiplication:
      • Redundant Representation based (e.g. Carry-save)
      • Residue Number System based.
      • Systolic Array Based.
    • Word-based implementations preferable, due to similarity with Symmetric-key
      • Rules out systolic arrays
  • 5. Modular Exponentiation
    • Most popular and fastest were Carry-Save representation based implementations.
    • Carry-save based were also word-oriented.
    • We selected fastest, simplest implementation:
      • Extremely beneficial to have simplicity and homogeneity in algorithms when designing a custom reconfigurable fabric.
      • Performance when implemented on Xilinx Virtex FPGAs: almost 5 Mb/s !!! (highest reported that we could find)
  • 6. Modular Exponentiation
    • Five-to-two Multiplier Modular Exponentiation (P, E, M)
    • K = 22k mod M … computed externally
    • 1. P1 0 , P2 0 = 5to2_MontMult(K , 0 , 1 , 0 , M),
    • Z1 0 , Z2 0 = 5to2_MontMult(K , 0 , P , 0 , M);
    • 2. FOR i = 0 to n-1 DO
    • 3. Z1 i+1 , Z2 i+1 = 5to2_MontMult(Z1 i , Z2 i , Z1 i , Z2 i , M)
    • 4. IF e i = 1 THEN
    • P1 i+1 , P2 i+1 = 5to2_MontMult(P1 i , P2 i , Z1 i , Z2 i , M)
    • ELSE
    • P1 i+1 , P2 i+1 = P1 i , P2 i
    • 5. ENDFOR
    • 6. P1 n , P2 n = 5to2_MontMult(1 , 0 , P1 n-1 , P2 n-1 , M)
    • 7. P = P1 n + P2 n
    • 8. RETURN P
  • 7. Modular Exponentiation
    • Five-to-two CSA Montgomery Multiplication (A1 , A2 , B1 , B2 , M)
    • 1. S1 0 , S2 0 = 0 , 0
    • 2. FOR i = 0 to m-1 DO
    • 3. q i = [(S1 i + S2 i ) + A i *(B1+B2)] mod 2
    • 4. S1 i+1 , S2 i+1 = CSR [(S1 i + S2 i ) + A i *(B1+B2) + q i *M] div 2
    • 5. ENDFOR
  • 8. Modular Exponentiation
    • Their Implementation of MM
  • 9. Modular Exponentiation
    • Implementing MM on our design
  • 10. Modular Exponentiation
    • Each of the 64-CSA blocks maps to a single basic block
    • Outputs of the last basic block are registered.
    • q i is generated by random-logic block at the second basic-block
      • Broadcast to all groups
    • A i is generated in a similar manner, utilizing two more basic-blocks:
      • Also broadcast to all groups
  • 11. Modular Exponentiation
    • Efficient and scalable mapping to our design
      • 1024-bit RSA will need to use 16 groups, while
      • 2048-bit will use 32, and 4096-bits will use 64 groups
    • Primary concern : clock rate may be limited by bit-broadcasts of q i and A i
      • Potential impediment to scalability
      • We are exploring methods for pipelining these broadcasts as well, to increase cycle-time and scalability.
  • 12. Rijndael
    • Primary operations:
      • Sub-Bytes
      • Shift-Rows
      • Mix-Columns
      • Add-Round-Key
  • 13. Rijndael
    • Representation of Data: 128-bit state.
    32-bits 128-bits 8-bits each 32-bits 32-bits 128-bits of state
  • 14. Rijndael
    • Add-Round-Key
        • Simple 128-bit XOR operation: uses 1 basic-block
    • Sub-Bytes:
        • Simple operation: byte-wise table lookup from S-Box
        • Each S-box is 2kbits.
        • 16 parallel S-boxes required !
        • No basic-blocks required, ALL memory-blocks required !
    • Shift-Rows
        • Simple operation: 4 x 32-bit permutations
        • Uses only 1 basic-block
  • 15. Rijndael
    • Mix-Columns
        • Somewhat complicated: can be implemented using table lookups, but we’re out of Memory !
        • Alternative implementation:
  • 16. Rijndael
    • Mix-Columns
      • Operation may be expressed in terms of “xtime()” function
      • Mix-columns implementation requires “xtime()” operation on each byte, followed by 4 XOR operations
  • 17. Rijndael
    • Mix-Columns
      • In order to efficiently implement “xtime()”, we modified it this way
      • In this form, only 2 basic-blocks are needed to apply “xtime()” to all 16 bytes
      • A single basic-block will take the 128-bit data as input, and generate the “xtime()” mask (0000 x 7 x 7 0 x 7 ) for each of the 16 bytes at the permute unit.
      • Another basic-block will now first perform the XOR operation, followed by a left shift (and substitute LSB with x 7 ) at the permute unit.
  • 18. Rijndael
    • Mix-Columns
      • After generating output from the “xtime()” function, 4 x 128-bit XOR operations need to be performed
        • 4 basic-blocks will be used
      • Note that the mix-column operation is carried out in parallel on all 4 columns.
    Xtime masks for all bytes XOR operation
  • 19. Rijndael
    • Implementation summary
      • 8 basic-blocks required only
        • 2 (1 each) for Add-Round-Key and Shift-Rows
        • 6 for Mix-Columns (2 for xtime(), 4 for XOR operations)
      • 16 Memory-blocks required !!
        • All memory blocks used up in a single round!
      • In-efficient implementation due to memory intensive implementation of Rijndael
        • Only 10% logic used, versus 100% memory usage.
  • 20. Rijndael
    • Potential Solutions
      • Add lots of memory !!
        • At least 10 times more
        • Issues with memory placement
      • Consider memory-less implementations of Sub-Byte
        • Requires GF() constant multiplication and Inverse Affine Transforms
        • Currently under study as the more efficient and practical option.
  • 21. Serpent
    • Substitution-permutation cipher comprised of
      • Key Mixing,
      • S-Box Substitution, and
      • Linear Transformation.
    • S-boxes: 4 x 4 bit
      • 32 copies required each round
      • 16 x 4 x 32 = 2048 bits per round.
  • 22. Serpent
    • The Linear Transformation step consists of:
      • 8 fixed permute operations, and
      • 8 XOR operations
    • All operands are 32-bits wide
  • 23. Serpent
    • Serpent is an ideal match for our architecture:
      • 8 x 32-bit fixed shifts and rotates can be easily implemented by the permute units of 2 basic-blocks.
      • Additional 2 basic-blocks required to implement the 8 x 32-bit XOR operations.
      • 128-bit key mixing stage per round would require 1 more basic-block
    • Total of 5 basic-blocks and 2kbits of memory required per round.
    • Each round perfectly fits in a single group of our architecture!
    • 16 rounds of Serpent’s total of 32 may be unrolled in our architecture
  • 24. Other Algorithms
    • DES
      • Implementation of a single round is trivial: a single group may implement multiple rounds !
    • Twofish
      • Complex structure, requires more time to define implementation on our architecture.
      • However, all its basic operations are directly supported.
    • RC6 and MARS
      • Involve complicated operations requiring special purpose logic:
        • Data-dependent rotations
        • Multiplication Modulo 2 32
  • 25. Other Algorithms
    • RC6 and MARS
      • This special-purpose logic was not incorporated because:
        • Algorithms are more suitable for software implementations than in hardware
        • Lack of support and popularity of these algorithms
        • Addition of special-purpose logic would occur overhead beyond its area, as additional supporting interconnect must be provided.
  • 26. Comparison with Related Work
    • Although we cannot provide results based on empirical evaluation, we can present a logical framework for comparison of individual features
    • Through deductive reasoning, we identify what possible advantages one approach may have over the other, assuming all other factors normalized.
  • 27. Comparison with Related Work
    • Comparison with FPGA based implementations
      • Area Efficiency
        • Use of basic gates instead of LUTs
        • Basic-blocks with limited flexibility, thus fewer configuration bits
        • Basic units (full adders) combined into clusters of 64, and programmed as a single entity – further savings in configuration memory elements
      • Performance
        • Use of basic gates instead of LUTs
        • Simpler Interconnect, with fewer routing-switches
        • Hierarchical organization – no long wires (except for bit-broadcast)
        • Far smaller configuration data required – faster reconfiguration time
  • 28. Comparison with Related Work
    • Comparison with FPGA based implementations
      • Potential pitfalls
        • Design dedicates considerable amount of area to inter-block interconnect.
        • Until actual area can be quantified, we are unsure of area efficiency estimates.
        • Need to identify most suitable Performance/Area tradeoff.
  • 29. Comparison with Related Work
    • Comparison with COBRA Architecture
      • Uses multiple copies of special purpose logic blocks, couples with extremely simple interconnect.
  • 30. Comparison with Related Work
    • Comparison with COBRA Architecture
      • Low logic-utilization – we have more generic blocks,
      • Fixed latency operations
      • Intermediate values registered only at RCE boundary.
  • 31. Programming Methodology
    • Reconfigurable Computing devices suffer from following two critical issues:
      • Lack of a comprehensive programming model
      • Lack of hardware virtualization
    • First issue implies the difficulty of programming RC architectures such as FPGAs
    • Second issue deals with exposition of hardware resource limitations to programmer.
  • 32. Programming Methodology
    • How COBRA deals with these issues
      • Essentially a special-purpose programmable architecture than a configurable one
      • VLIW like instructions alleviate some of the programming model related issues
      • Also resolve the virtualization aspect.
  • 33. Programming Methodology
    • The programming methodology and the impact of the issues mentioned can be seen in terms of a spectrum:
    COBRA [3] Microprocessor Our Approach FPGAs
  • 34. Programming Methodology
    • Programming model issue less severe for us because:
      • Simple, highly specialized architecture
    • Hardware Virtualization is still a concern.
  • 35. Programming Methodology
    • Programming model:
      • Provide basic primitives that are supported by our architecture.
      • Programming is to be accomplished by expressing an algorithm using these primitives and interconnecting these primitives together using 32-bit interconnect.
      • Mapping such a description onto our design should be a trivial software challenge.
      • Due to special purpose nature, primitives are limited in number and thus programming should be an easy task.
  • 36. Programming Methodology
    • 32-bit Carry Save Adder
    • 32-bit XOR
    • 32-bit AND
    • 32-bit OR
    • 32, 64, and 128-bit Ripple Carry Adder
    • 32, 64, and 128-bit Fixed Shifts
    • 32 bit Rotates and random permutes.
    • 64-bit, 128-bit limited permutes (TBD).
    • ANDing 32-bit value with a single bit
    • 128-bit shift-register
    • Random bit-logic implementation, since each block is also capable of implementing:
    • single 4-input function
    • two 3-input functions
    • four 2-input functions
    • 4 global bit-broadcast lines
    • 32-bit interconnect, point to point.
    Programming Primitives:
  • 37. Conclusion: Work in Progress
    • Following areas of design still under consideration and not completely defined yet:
      • Configurable Memory-block Architecture
      • VLSI Design to evaluate performance metrics and fine-tuning of logical design
        • i.e. if found to be too slow, reduce no of switches, use longer wires, minimize the amount of interconnect to that which is necessary, etc.
      • Furthermore, the iterative process of evaluating more symmetric-key algorithms and refining the architecture is still in progress.

×