Bs25412419

314 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
314
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bs25412419

  1. 1. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 Optimized Elliptic Curve Cryptography A.Durga Bhavani*, P.Soundarya Mala** Dept of Electronics and Communication Engg,GIET College,Rajahmundry.Abstract— This paper details the design of a new number of required cycles. In particular, the workshigh-speed pipelined application-specific which have duplicate arithmetic blocks to exploit theinstruction set processor (ASIP) for elliptic parallelism in the underlying operations. Yet for mostcurve cryptography (ECC) technology. Different of these implementations, efforts are concentrated onlevels of pipelining were applied to the data path algorithm optimization or improved arithmeticto explore the resulting performances and find architectures and rarely on a processor architecturean optimal pipeline depth. Three complex particularly suited for ECC point multiplication.instructions were used to reduce the latency by Some of the design techniques used in modern highreducing the overall number of instructions, and performance processors were incorporated into thea new combined algorithm was developed to design of an application-specific instruction setperform point doubling and point addition using processor (ASIP) for ECC. Pipelining was applied tothe application specific instructions In the present the design, giving improved clock frequencies. Datawork, by using pipeline techniques are optimized forwarding and instruction reordering werethe point multiplication speed by implementing on incorporated to exploit the inherent parallelism of themodern Xilinx Virtex-4 and the same to be Lopez and Dahab point multiplication algorithm,compare with the Xilinx Virtex-E. reducing pipeline stalls.Key Terms—Complex instruction set, efficient In this paper, thorough treatment is given tohardware implementation, elliptic curve the design of an ASIP for ECC, yielding a newcryptography (ECC), pipelining. combined algorithm to perform point doubling and point addition based on the instruction setI. INTRODUCTION developed, further reducing the required number of FIRST introduced in the 1980s, elliptic curve instructions per iteration. The same data path is used,cryptography (ECC) has become popular due to its but the performance is explored for different levelssuperior strength- per-bit compared to existing public of pipelining, and a superior choice of pipelinekey algorithms. This superiority translates to depth is found. The resulting processor has highequivalent security levels with smaller keys, clock frequencies and low latency, and it has only abandwidth savings, and faster implementations, single instance of each of the arithmetic units. Anmaking ECC very appealing. The IEEE proposed FPGA implementation over GF 2163is presented,standard P1363-2000 recognizes ECC-based key which is by far the fastest implementation reported inagreement and digital signature algorithms and a list the literature to date.of secure curves is given by the U.S. Government In Section II, a background is given in terms ofNational Institute of Standards and Technology elliptic curve operations, Galois fields arithmetic and(NIST). the state-of-the-art hardware implementations. Intuitively, there are numerous advantages of Section III develops the ASIP architecture,using field- programmable gate-array (FPGA) beginning with the application-specific instructionstechnology to implement in hardware the and a new combined algorithm to perform pointcomputationally intensive operations needed for doubling and point addition (crux ECC operations)ECC. Indeed these advantages are comprehensively based on the new algorithms. In Section IV, thestudied and listed by Wollinger, et al. In particular, FPGA implementation of the processor is described,performance, cost efficiency, and the ability to easily and the performance is analyzed and compared toupdate the cryptographic algorithm in fielded devices the state-of-the-art in Section V. This paper isare very attractive for hardware implementations for concluded in Section VI.ECC. Numerous ECC hardware accelerators and II. BACKGROUNDcryptographic processors have been presented in the A. ECCliterature. More recently, these have included a ECC is performed over one of twonumber of FPGA architectures, which present underlying Galois fields: prime order fields oracceleration techniques to improve the performance of characteristic two fields. Both fields are consideredthe ECC operations. to provide the same level of security, but arithmetic in GF(2m) will be the focus of this paper because it The optimization goal is usually to reduce the can be implemented in hardware more efficientlylatency of a point multiplication in terms of the using modulo-2 arithmetic. An elliptic curve over 412 | P a g e
  2. 2. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419the field is the set of solutions to the equation division architectures exist in the literature, but they are usually expensive in terms of area. In- version using y2+xy = x3+ax2+b (1) Fermat’s little theorem, as in the Itoh–Tsujii algorithm, consists of multiplication and squaring only, so canwhere a,b € GF(2m),b≠0 often be implemented without additional hardware resource. It has been shown that similar performance to an EEA-based inverter can be achieved using the Itoh–Tsujii algortihm if the multiplier latency is sufficiently low and repeated squaring can be performed efficiently. Hence, multiplication is considered to be the most resource-sensitive operation in the field. Algorithm 1: Lopez-Dahab Point Multiplication Hence, when = we have the point-doubling operation (DBL), and when ≠ wehave the point-adding operation (ADD). Theseoperations in turn constitute the crux of any ECC-based algorithm, known as point multiplication orscalar multiplication. Due to the computationalexpense of inversion compared to multiplication,several projective coordinate methods have been Multipliers are usually implemented usingproposed, which use fractional field arithmetic to one of three approaches: bit-serial, bit-parallel, ordefer the in- version operation until the end of the digit-serial. Bit-serial multi- pliers are small, butpoint multiplication; In this paper, the high- they require m steps to perform a multiplication. Bit-performance, generic projective-coordinate algorithm parallel multipliers are large but perform aproposed by Lopez and Dahab is used, which is an multiplication in a single step. Digit-serialefficient implementation of Montgomery’s method for multipliers are most commonly used in cryptographiccomputing kP. No precomputations or special hard- ware, as performance can be traded againstfield/curve properties are required. area. Recently, in a drive for increased speed, some In this, procedures to perform DBL and ADD are methods have been proposed, and implementationsderived from efficient formulas which use only the presented , for ECC hardware based on bit-parallelcoordinate of the points. In the projective coordinate multipliers.version of the formulas, the x-coordinate of is Squaring is a special case of multiplication, and itrepresented by Xi/Zi, for i € {1,2,3}; is only a little more complex than reduction modulo The corresponding DBL and ADD computations are the irreducible polynomial. Refer to finite fieldshown respectively, and are used in the projective multiplier for efficient architectures to performcoordinate point multiplication algorithm shown in polynomial basis squaring, which only requireAlgorithm 1 combinational logic. C. Previous Work Several recent FPGA-based hardware implementations of ECC have achieved high- performance throughput. Various acceleration techniques have been used, usually based on parallelism or pre computation. The work introduced by Orlando and Paar[2] isB. Arithmetic Over GF(2163) for ECC based on the Montgomery method for computing kp Addition and subtraction in the Galois developed by Lopez and Dahab and operates over afield GF(2m) are equivalent, performed by modulo-2 single field. A point multiplication over GF(2167) isaddition, i.e., a bit-XOR operation. As a result, performed in 210 micro seconds, using a Galois fieldarithmetic in the field is implemented more multiplier with an eleven-cycle latency. This providesefficiently because it is carry free. an excellent benchmark for all high-performance Inversion is the most computationally expensive architectures.operation in the field, based either on the extendedEuclidean algorithm (EEA) or Fermat’s littletheorem. Many efficient EEA-based inversion and 413 | P a g e
  3. 3. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419Acceleration techniques based on precomputation can Algorithmsbe particularly effective for a class of curves known Complex instruction set computers (CISCs)as anomalous binary curves (ABCs) or Koblitz reduce the number of instructions, and consequentlycurves. The implementation from Lutz and Hasan the overall latency, by performing multiple tasks in aimplements some of these techniques achieving a single instruction. It was shown that complexpoint-multiplication time of 75 µs over, using a instructions could be used to reduce latency in theGalois field multiplier with a four-cycle latency. design of an ASIP for ECC. Three new instructionsHow- ever, efforts here will be concentrated on high- were introduced, which will now be described andperformance architectures for generic curves, where their use justified.such acceleration techniques cannot be used. An Considering the projective-coordinateimportant result of , relevant regardless of the point- formula for point addition, an obvious instruction tomultiplication algorithm used, is that efficiently combine operations is a multiply and accumulateperforming repeated squaring operations greatly instruction (MULAD), which will save tworeduces the cost of multiplicative inversion, which is instruction executions. Also, rewriting the projective-required at the end of the point multiplication. coordinate formula for point doubling , we have An ECC processor capable of operating overmultiple Galois fields was presented by Gura, et al. ,which performs a point multiplication overGF(2163)in 143 µs, using a Galois field multiplierwith a three-cycle latency. Gura, et al. stated two so a multiply- and- square operation (MULSQ)important conclusions: the efficient implementation would be beneficial, saving three instructionof inversion has a significant impact on overall executions.performance and, as the latency of multiplication is These two extra instructions reduce the number ofimproved, system tasks such as reading and writing instructions executions required to perform the point-have a significant impact on performance. The work adding and point- doubling algorithms from 14 tointroduced by Jarvinen, et al. uses two bit-parallel 9—a 35.7% reduction.multipliers to perform multiplications concurrently. The Itoh–Tsujii inversion algorithm is based onThe multipliers have several registers in the critical Fermat’s little theorem and is used to compute thepath in order to operate at high clock frequencies, inversion at the end of the point multiplication. Thebut the operations are not pipelined, resulting in long algorithm uses addition chains to reduce thelatencies (between 8 and 20 cycles). Hence, while this required number of multiplications, but it containsimplementation offers high performance, it is not the exponentiations that must be performed throughfastest reported in the literature but it is one of the repeated squaring operations—as many as 64 repeatedlargest. Rodriguez, et al. introduced an FPGA squaring operations for the field GF(2163). It wasimplementation that performs DBL and ADD in shown, that relatively low latencies can be achievedparallel, containing multiple in- stances of circuits to using this algorithm if repeated squaring can beperform the arithmetic functions. It would appear that performed efficiently. Hence, a third application-the inversion required for coordinate conversion at specific instruction will be used to performthe end of the point multiplication is not performed, repeated squaring (RESQR) in order to accelerate theand that the quoted point multiplication time does Itoh–Tsujii inversion algorithm.not include the co- ordinate conversion, but the Using these instructions, a combined algorithm tostated point multiplication time is one of the fastest in perform point doubling and point addition can bethe literature. However, the complex structure of the developed, which has only nine arithmeticmultiplier has a long critical path, and as a result the instructions, shown in Algorithm 2.overall performance is let down by quite a low clockfrequency (46.5 MHz). B. Pipelined Data Path More recently, Cheung, et al. presented a A data path capable of performing the newhardware design that uses a normal basis application specific instructions is required.representation. The customizable hardware offers a Any of the approaches to implementingtrade between cost and performance by varying multiplication mentioned in the background sectionthe level of parallelism through the number of could be used, but because this work develops high-multipliers and level of pipelining. To the authors’ throughput hardware a bit-parallel multiplier is used.knowledge, this is the fastest implementation in the The data path must be constructed such that the resultliterature performing a point multiplication in of a multiplication can be squared or added to theapproximately 55 µs, although once again the low previous result. Furthermore, feedback is required soclock frequency (43 MHz) limits the potential that the result can be repeatedly squared. Thisperformance. functionality is implemented using the SQR/ADD block, described in Section III-D.III. PIPELINED ASIP FOR ECC To maintain high-throughput performance, high A.Application-Specific Instruction Set and clock frequencies are required and one 414 | P a g e
  4. 4. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419instruction/cycle or more is desirable. Therefore, the notation asdata path must be pipelined. The pipelined data pathis shown in Fig. 1. The performance-criticalcomponent is the multiplier, so to achieve high clockfrequencies, we must sub pipeline themultiplier. The Z matrix is referred to as the product matrix and the functions f i,j are linear functions of the components of A(x) .The columns of represent the consecutive states of the Galois type LFSR with the initial state given by the multiplicand A(x), i.e., the jth column is given by xj A(x) mod G(x). Therefore, the product matrix can be realized by cascading instances of the logic to perform a left shift modulo G(x). The circuit to perform this modulo shift is referred to as an alpha cell. The same notation will be used here. The product matrix can be divided into a number of smaller submatrices that calculate the partial products, which give the final product when summed together. Fig. 1. Pipelined ASIP data path. C. Subpipelined Bit-Parallel Multiplier Fig. 2. Pipeline stage of subpipelined bit-parallel Some recent hardware implementations of multiplier. elliptic-curve accelerators, in addition to some proposed architectures, have used bit-parallel multipliers. Numerous architectures for bit-parallel multiplication exist in the literature. However, very few actual implementations of bitParallel multiplication have been presented for the large fields used in public-key cryptography. Where they have been implemented, poor performance has often been reported either in terms of clock frequency or latency. This is not surprising given the deep sub micrometer fabrics used in modern FPGA devices. The major component of delay is due to routing, and the number of interconnections in a bit-parallel More formally, the product matrix can be divided multiplier implemented for ECC will be in the region into D submatrices, which will contain at most =[m- of many tens of thousands or more making routing 1/D] columns,i.e, very complex. Therefore, to improve routing Hence, the multiplier can be easily pipelined into complexity and to reduce the amount of logic in the D stages, each computing the sum of at most d+1 critical path, a bit-parallel multiplier that is amenable terms, which are the outputs of alpha cells and the to pipelining is desirable. input from the previous multiplication pipeline stage. As a result the logic in the critical path is The well-known Mastrovito multiplier is such reduced and the routing is simplified. Note that thean architecture; the reader is referred to the original number of logic gates in the critical path of eachwork for full details. The multiplication stage will be similar to that of a digit serial multiplierC(x)=A(x)B(x) mod G(x) is expressed in matrix with equivalent , but the routing is simpler as no feedback is required. The resulting architecture 415 | P a g e
  5. 5. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419of the i th pipeline stage 0≤i≤D-1, is shown in Fig. 2; IV. FPGA IMPLEMENTATION When the alpha input is A(x) and the sum input is The processor was implemented over theA(x)b0, when i = D-1 the sum output is the smallest field recommended by NIST [8] GF(2163)multiplication result and the alpha output is using the irreducible polynomial given,unconnected. G(x)=x163+x7+x6+x3+1.Two FPGA devices were used for implementation: the older Virtex-E de- viceD. SQR/ADD Block (XCV2600E-FG1156-8),for fair comparison with the The extra functionality required for the new other architectures presented in the literature, and theinstructions MULAD, MULSQ, and RESQ is more modern Virtex-4 (XC4VLX200-FF1513-11) toprovided by the SQR/ADD block, which is demonstrate the performance of the proposed ASIPappended to the multiplier output as shown in Fig. with modern technology. Each component was1. A block diagram showing the functionality of the implemented and optimized individually, particularlySQR/ADD block is shown in Fig. 3; note that the the multiplier, to examine the optimal pipeline depthflip-flops shown in Fig. 3 are the pipeline flip- before the complete processor was implemented.flops placed after the SQR/ADD block in the data Aggressive optimization for speed was performed,path diagram shown in Fig. 1. including timing-driven packing and placement. No Selection 0 on the MUX shown in Fig.3 sets the detailed floor planning was performed, only a fewdata path to perform standard multiplication over simple constraints were applied.GF(2m). Selection 1 sets the data path to performMULSQ. Selection 2 sets the data path to perform A. Register FileMULAD. Selection 3 sets the data path to perform Xilinx FPGA devices support three mainRESQ; note that RESQ can be performed after any types of storage element: flip-flops, block RAM andother operation, e.g., MULAD-RESQ is a valid distributed RAM. Block RAM is the dedicatedinstruction sequence. memory resource of FPGA devices, which has different sizes and locations depending on the device being used. TABLE I DECOMPOSITION OF POINT MULTIPLICATION TIME IN CYCLES Fig.3. SQR/ADD block. Distributed RAM uses the basic logic elements of E. Data Forwarding the FPGA device, lookup tables (LUTs), to form The optimal depth of a pipeline is a trade RAMs of the desired size, function and location, between clock frequency and instruction throughput. making it far more flexible. On Virtex-E devices, Ideally, one instruction per cycle (or more) will be Block RAMs are 4096 bits each, which increase to performed, while sufficient pipelining to achieve the 18 k bits if the Virtex-4 is used. So the use of block desired clock frequency is implemented. However, RAM to store relatively small amounts of data is as the depth of the pipeline is increased, data inefficient. The processor presented in this paper dependencies can lead to pipeline stalls and even requires only 13 storage locations including all pipeline flushes, resulting in less than the ideal one temporary storage, so distributed RAM was used. instruction/cycle being performed. On Xilinx FPGA devices, 16-bits of storage can Data forwarding is a common technique in be gained from a single LUT. However, as reading processor design, used to avoid pipeline stalls that from two ports and writing to a third port is occur due to data dependencies. Typically, data is required independently and concurrently, a single forwarded to either input of the arithmetic logic unit dual-port distributed RAM is not sufficient. (ALU) from all subsequent pipeline stages. The Therefore, the register file utilized in this design is scheme used here differs slightly because such formed by cross-connecting two dual-port generality is not necessary: the order of the distributed RAMs. This approach is still far more instructions and the instances when forwarding is efficient than using block RAMs or flip-flops. required are known and do not change. Therefore, the control signals may be explicitly defined for each B. Subpipelined Multiplier instruction, and, as can be seen in Fig. 3, the data The multiplier dominates the processor, path can be simplified because data forwarding to both in terms of area (accounting for over 95% of both ALU inputs from all sub- sequent stages is not the total) and clock frequency, as the critical path of required. 416 | P a g e
  6. 6. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419the other components is considerably shorter than architecture has a critical path similar to a digit-serialthat of the multiplier. The area of a bit parallel multiplier, but the routing is simplified because nomultiplier is O(m2) and the critical path is O(log(m)). feedback is required.However, when implemented on FPGA, these The multiplier was implemented with severalmeasures have less significance. To implement a bit- different depths of pipeline; the results are shown inparallel multiplier without a pipelining scheme will Table II. When timing- driven FPGA placement islead to low clock frequencies, due to the amount of performed, the placer may be forced to heavilylogic in the critical path and the complex routing. replicate certain areas of the circuit to meet timing goals, particularly if certain nets have high fan-outs. The Mastrovito multiplier architecture was This can be seen in the implementation results ofchosen not only because it was suitable for the multiplier, where the multiplier with threepipelining, but also because its regularity simplifies pipeline stages requires more resources than its four-placement and, consequently, routing. Given the stage equivalent because heavy replication wasdeep sub micrometer fabrics used by FPGAs, the performed to meet timing goals. In terms of area-timerouting is the largest component of delay, and, as performance, it is clear that the four-stage multipliershown in Section III-C, the Mastrovito multiplier can offers superior performance, which is an importantbe divided into very regular blocks to improve result when the overall performance of the processorrouting. When pipelined, the implemented is considered. performance penalty in terms of speed, though more C. SQR/ADD Block area resources would be required. By implementing the functionality as a single Boolean function, improved logic optimization can V. PERFORMANCE ANALYSIS be achieved because the hierarchy will be flattened Let be the depth of the pipeline, then the leading to resource sharing and improved logic number of cycles required to perform a point minimization. multiplication (excluding data I/O and coordinate conversion) is given by D. Control Typically, a general-purpose processor with a Latency cycles = 2L+3 pipelined data path would use an instruction ROM for control. However, for applications such as this one The latency in cycles of the point where the mode of operation is fixed and the control multiplication is decomposed in Table I. To ascertain is relatively simple, a state machine is far more the optimal pipeline depth for the processor, the bit- efficient, particularly in terms of area, and thus is the parallel multiplier was implemented with different choice for this design. pipeline depths. The results of these implementations A Moore state machine is most suitable for the (see Table II) and the corresponding number of proposed architecture, as any extra area required is cycles to perform a point multiplication (see Table negligible compared to the overall area of the I) can be used to estimate the point multiplication processor, and the improved clock frequency times for each pipeline depth, as the critical path of compared to a Mealy state machine is desirable. The the processor is in the multiplier. key is stored in a special-purpose register, Fig. 4 plots throughput (point multiplications per independent of the data path, and loaded through a second) against area (FPGA slices) for each of the separate channel. Thus, the state machine has no multiplier variations implemented. Note that the area dependency on the data path and the control signals figure shown is the area of the multiplier not that of can be registered to achieve the desired delay the complete processor, but the multiplier accounts performance. The total resource usage for control is for over 95% of the total area. only 93 slices. However, if extra flexibility is required, an instruction ROM can be used with no 417 | P a g e
  7. 7. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419 offers superior performance. Therefore, this was the architecture implemented on both the Virtex-E and Virtex-4 devices, the results of which are detailed in Table II and summarized in Table III for comparison with the state of the art. It appears that none of the architectures in the comparison were floor planned, so for a fairer comparison the implementation of the proposed architecture was not floor planned either. It is important to note for FPGA implementations, being target specific, a detailed comparison of resource usage is not always straightforward, in particular when it is not clear whether the quoted figures are actual post place and route It is clear that the relationship between implementation results as in our case or merelypipeline depth and area throughput performance is synthesis estimates.approximately linear (as we would expect from theprevious equation), but the seven-stage pipeline The proposed architecture compares very complex instructions. The further seventh-stagefavorably with the state of the art, being smaller and pipelining has increased the clock frequency of theconsiderably faster than similar works. The architecture without prohibitively increasing theimplementation of the proposed architecture is latency, which leads to the fastest pointseveral times faster than the first three, lower-resource multiplication time reported in the literature. Hence,table entries, which are based on digit-serial the use of an application specific instruction set inmultiplication. It is interesting to note that the conjunction with pipelining has better exploitedproposed architecture achieved a higher clock operation parallelism compared with duplicatingfrequency than all three, demonstrating that the arithmetic circuits as proposed.simplified routing resulting from the pipelinedarchitecture does indeed improve the critical path. Comparing against our recently reported pipelined ASIP architecture, a 10.12% improvement in point Comparing against alternative high-speed multiplication time was attained, but the resourcearchitectures, the implementation of the proposed usage was only increased by around 2%. Thearchitecture performs a point multiplication in only improvement comes as a result of the new60.01% of the estimated performance time of the combined algorithm to perform DBL and ADDfastest alternative. The area resource of is not stated fewer instructions are required, which reduces thefor the field, only the performance time, but the latency and the increased pipeline depth, whichresource usage (slices) of the proposed architecture increases the clock frequency.is significantly less than the other alternative high-speed architectures. The improvements over theimplementation from Jarvinen, et al. are due oncemore to the reduced area requirements but also due tothe reduced latency resulting from pipelining and the 418 | P a g e
  8. 8. A.Durga Bhavani, P.Soundarya Mala / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 5, September- October 2012, pp.412-419VI. CONCLUSION A high-performance ECC has been [11] C.H.Kim and C. P. Hong, ―High-speedimplemented using FPGA technology. A combined division architecture for GF(2m),‖ Electron.algorithm to perform DBL and ADD was developed Lett., vol. 38, pp. 835–836, 2002.based on the Lopez Dahab Point multiplication [12] T. Itoh and S. Tsujii, ―A fast algorithm foralgorithm .The data path was pipelined, allowing computing multiplicative inverses inoperation parallelism to be perform fastly and taking GF(2/sup m/) using normal bases,‖ Inf.less time. Consequently, an implementation with a Computation, vol.78, pp. 171–177, 1988.four-stage pipeline achieved a point multiplication [13]P.A.Scott,S.E. Tavares, and L. E. Peppard,on Xilinx Virtex-4 device, making it the fastest ―A fast VLSI multi- plier for GF(2/sup m/),‖implementation. This work has confirmed the IEEE J. Sel. Areas Commun., vol. 4, no. 1.suitability of a pipelined data path and an efficient [14] H. Wu, ―Bit-parallel finite field multiplierGalois field multiplier (2^163) is developed and and squarer using polyno- mial basis,‖ IEEEimplementations of ECC over GF(2^163) performs Trans. Comput., vol. 51, no. 7, pp. 750–758,the better security with less key size . Jul.2002. [15] A. Reyhani-Masoleh and M. A. Hasan,REFERENCES ―Low complexity bit parallel architectures [1] J. Lutz and A. Hasan, ―High performance for polynomial basis multiplication over FPGA based elliptic curve cryptographic GF(2/sup m/),‖IEEE Trans. Comput., vol. 53, co-processor,‖ in Proc. Int. Conf. Inf. no. 8, pp. 945–959, Aug. 2004. Technol.: Coding Comput. (ITCC), 2004, [16] L. Song and K. K. Parhi, ―Low-energy p. 486. digit-serial/parallel finite field multipliers,‖ J. [2] F. Rodriguez-Henriquez, N. A. Saqib, and VLSI Signal Process. Syst., vol. 19, pp. 149– A. Diaz-Perez, ―A fast parallel 166, 1998. implementation of elliptic curve point [17] M. C. Mekhallalati, A. S. Ashur, and multiplication over GF(2m),‖ M. K. Ibrahim, ―Novel radix finitefield Microprocessors Microsyst., vol. 28, pp. multiplier for GF(2/sup m/),‖ J. VLSI Signal 329–339, 2004. Process., vol. 15,pp.233–245, 1997. [3] N. Mentens, S. B. Ors, and B. Preneel, ―An [18] P. K. Mishra, ―Pipelined computation of FPGA implementation of an elliptic curve scalar multiplication in elliptic curve processor GF(2/sup m/),‖ presented at the cryptosystems,‖ presented at the CHES, 14th ACM Great Lakes Symp. VLSI, Cambridge, MA, 2004. Boston, MA, 2004. [4] J. Lopez and R. Dahab, ―Fast multiplication on elliptic curves over GF(2/sup m/) without precomputation,‖ presented at the Workshop on Cryptographic Hardware Embedded Syst. (CHES), Worcester, MA,1999. [5] G. Seroussi and N. P. Smart, Elliptic Curves in Cryptography. Cam- bridge, U.K.: Cambridge Univ. Press, 1999. [6] C. H. Kim and C. P. Hong, ―High-speed division architecture for GF(2m),‖ Electron. Lett., vol. 38, pp. 835–836, 2002. [7] W. Chelton and M. Benaissa, ―High-speed pipelined ECC processor over GF(2m),‖ presented at the IEEE Workshop Signal Process. Syst.(SiPS), Banff, Canada, 2006. [8] G. Seroussi and N. P. Smart, Elliptic Curves in Cryptography. Cam- bridge, U.K.: Cambridge Univ. Press, 1999. [9] P. L. Montgomery, ―Speeding the Pollard and elliptic curve methods of factorization,‖ 1987. [10] Z. Yan and D. V. Sarwate, ―New systolic architectures for inversion and division in GF(2m),‖ IEEE Trans. Comput., vol. 52, no. 11, pp.1514–1519, Nov. 2003. 419 | P a g e

×