AlgorithmStatus2005FEB

Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
PowerPC Algorithm
Development Update
Bo Lin, CPD/NCSG
FEB 2005

What’s New?
• Vectorised RSA and modular exponetiation
• AltiVec and Scalar CRC32 (IEEE 802.3 FCS) for
wireless LAN
• Other vendor’s information on AES128
• Re-arranged presentation
 Code available
 Code in planning

Theme
Key Messages:
• PowerPC (core and AltiVec) is very efficient –
beyond imagination.
• Software solution can provide surprisingly good
results.
• The key is to develop algorithms best suitable for
PowerPC.
2005 focus:
• Vectorised RSA and modular exponentiation
• Networking support – “dictionary”

The Impact of Innovative Algorithm Design
By properly designing algorithms for PowerPC,
performance can be dramatically increased.
A good example is the EEMBC’s Convolutional
Encoder (see www.eembc.com) where AltiVec
enabled algorithm out-perform the “out-of-box”
version 600 ~ 1000 times.
Another example is that many popular crypto
algorithms are very efficiently executed on
PowerPC.

Benchmark information
All figures are tested on silicon
Small variances are expected on different test platforms
The figures on 8641(D) are expected better since AltiVec
improvement on the device provides “out-of-order” execution
Devices & Algorithms
• G4 & AES, DES, 3DES, Kasumi, tdmCRC16, CRC32, RSA
 MPC7457 @ 1000MHz

What is benchmarked?
• The cycle count of a function call. For example, we
measure how many cycles are needed for G4 or
other PowerPC processor to complete a function call
such as AES128(data, cipher, key).
• All CPU resources are used - the same convention of
benchmark practice as other vendors.

POPI Guideline
Please share figures with customers
but not the whole set of slides.

Benchmark
information on
Encryption
(Public key Cryptos)

vRSA Performance figures on G4
vRSA
• Data block length: 1024 bits
• Key length: 1024 bits
• IEEE P1363 test vector
Relative performance (cycle count)
• Signing: 3,720,000 cycles
Absolute performance on Test Setup
• 3.72 ms per signing
Other info: CRT, arbitrary length
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux

1024 modular exponentiation on G4
Y = XE
mod N
• all parameters are of 1024 bits
• Modular exp: 12,850,000 cycles
• 12.8 ms per mod exp
Other info
• E usually is 160 bits or 256 bits
• Arbitrary length
Test Setup:

Benchmark
information on
Encryption
(Conventional Cryptos)

AES128 Performance figures on G4
AES128
• Encryption: 336 cycles per 128 bit data block
• → 0.381 bits/cycle → 2.6 cycles/bit
• Decryption: 335 cycles per 128 bit data block
• 380.9 Mbps for encryption
Test Setup:

AES128 Benchmarked by Others (reference
only)
http://www.tcs.hut.fi/~helger/aes/rijndael.html
• Relative performance (cycle count) on 7457@1.25 GHz with gcc
 Encryption: 385 cycles per 128 bit data block
 Decryption: 391 cycles per 128 bit data block
• Relative performance (cycle count) on Pentium M@1.33 GHz
with gcc
 Decryption: 376 cycles per 128 bit data block
http://www.eskimo.com/~weidai/benchmarks.html
• Relative performance (cycle count) on Pentium 4 @ 2.1GHz
with Microsoft Visual C++ .NET 2003 (whole program optimization, optimize for
speed, P4 code generation

AES192 Performance Figures on G4
AES192
Test Setup:

AES256 Performance Figures on G4
AES256
Test Setup:

Kasumi Performance Figures on G4
Kasumi
• Kasumi operation: 427 cycles per 64 bit data block
• 149.9 Mbps
Test Setup:

DES Performance Figures on G4
DES
• DES function: 423 cycles per 64 bit data block
Test Setup:

3DES Performance Figures on G4
3DES
• 3DES function: 1214 cycles per 64 bit data block
Test Setup:

Absolute Performance (Mbps)
0
50
100
150
200
250
300
350
400
G4/1000
AES128
AES192
AES256
Kasumi
DES
3DES
Tested
performance
(Mbps)

Core & Compiler Efficiency (bits/cycle)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
G4/1000
AES128
AES192
AES256
Kasumi
DES
3DES
Core & compiler
efficiency

Benchmark
Information on CRC

G4 Performance for CRC32 (IEEE 802.3 FCS)
Frame Size cycles bits/cycle cycles bits/cycle
60 281 1.71 250 1.92
62 303 1.64 277 1.79
64 247 2.07 264 1.94
71 287 1.98 299 1.90
80 277 2.31 320 2.00
89 320 2.23 352 2.02
96 307 2.50 367 2.09
105 350 2.40 408 2.06
112 337 2.66 432 2.07
128 367 2.79 488 2.10
161 429 3.00 604 2.13
192 487 3.15 712 2.16
225 549 3.28 828 2.17
256 607 3.37 936 2.19
384 847 3.63 1384 2.22
449 970 3.70 1613 2.23
512 1088 3.76 1832 2.24
768 1568 3.92 2728 2.25
895 1864 3.84 3183 2.25
1020 2081 3.92 3610 2.26
1024 2048 4.00 3624 2.26
1514 3004 4.03 5354 2.26
1536 3008 4.09 5417 2.27
2048 3968 4.13 7208 2.27
4096 7808 4.20 14377 2.28
AltiVec-enabled Scalar
* Frame Size is in BYTE
* The coloured Frame Sizes
are within the specification of
IEEE 802.3
Absolute Asymptotical Performance
on Test Setup:
AltiVec: 4200 Mbps
Scalar: 2280 Mbps
Test Setup:

G4 Performance figures for multi-channel
CRC16
tdmCRC16
• 16.35 Gbps for processing 16
channels simultaneously, AltiVec
enabled
• 16.35 bits/cycle → 0.061 cycles/bit
BL: This function reads 16 equal-
length packets arranged in
columns and calculates all the 16
CRC16 simultaneously. One
potential application will be CRC8
to calculate ATM’s cell header
HEC’s. Even higher performance
is expected.
Other test case benchmark figures
are available upon request
Test Setup:

Absolute Asymptotical Performance (Mbps)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
G4/1000
tdmCRC16
vCRC32
sCRC32Tested
performance
(Mbps)

Core & Compiler Efficiency (bits/cycle)
0
2
4
6
8
10
12
14
16
18
G4/1000
tdmCRC16
vCRC32
sCRC32Core & compiler
efficiency

Other Information on
Benchmark, Codes, etc

Performance Comparison with scalar version routines:
checksum()
0
500
1000
1500
2000
2500
32
448
864
1280
1696
2112
2528
2944
3360
3776
4192
4608
5024
5440
5856
6272
6688
7104
7520
7936
bytes checked
MB/second AltiVec
Linux assembly
C
Test setup: 7445 1GHz, MPC107 with 133MHz bus
36.5 bits/cycle
More improvement
possible

Code for benchmarking and evaluation
1. Encryption
1. DES/3DES
2. AES128/192/256
3. Kasumi
4. RC5
5. RSA/modular exp
2. CRC’s
1. CRC32/24/16/12/8
3. EEMBC www.eembc.com
1. Telecom suite
2. Networking suite
3. Consumer suite
4. 3GPP TS25.212 (baseband symbol rate processing)
1. Report available upon request

AltiVec/e600 offering
1. Code for Web download – ready to use
• Fully tested code and API complying with, say, ANSI C or Linux
• http://www.freescale.com/altivec
2. Prototype code – customer evaluation & integration
• Develop, benchmarked, and verified against a set of test vectors.
• For example, the AES code can re-produce test vectors in FIPS
197
• IEEE FCS/CRC32 generates correct result against IEEE 802.11
• 1024 RSA generates IEEE P1363 test vectors
• Some restrictions may apply.
• For example, AltiVec-enabled code may require data buffer quad-
ward aligned.
• Documentation may not completed
3. Code in pipeline – under development and planned
• Code committed to be in “prototype” state
• Customer input welcome

What is in the AltiVec-enabled libc library?
*Not ANSI C but present in many implementation (e.g.
VxWorks)
strlen
strcmp
strcpy
strcmp
strncpy
memcpy
memmove
memcmp
memchr
bcopy*
bzero*

Other networking functions from Linux that can be done in
AltiVec
__copy_tofrom_user
csum_partial
csum_partial_copy_generic
page_copy

How are the AltiVec-enabled libc libraries used?
Freescale provides an archive (binary file) that can be
linked with existing object files.
Dhrystone example:
ld dhry21a.o dhry21b.o c:/sw/libmoto/libmotovec.a
c:/gcc/lib/gcc-lib/powerpc-eabisim2.95.2libgcc.a -( -lsim -lc )
-o gccBM.elf
Put libmotovec.a on linker command line before the
compiler’s libc library
Binary library is eabi compliant, i.e. independent of
compiler used.
• Proven with gcc, Green Hills, Diab, Metaware, Metrowerks

Planned Code
(customer input welcome)

s/w solution table lookup (“Dictionary”) performance estimate
• Main feature
• One underlying data structure to support mixed instructions consisting of
lookup, LPM, range search, insert, delete
• Key technology
• Planar subdivided dictionaries
• Linear hashing ax mod M under 100 cycles
• For x with variable length (say, 64 or 128 bytes)
 64 bytes = 4 vectors
 4 vec_msum()’s + 4 vec_sums() + 3 mul()’s + 3 add()’s + 1 div() ≈ 20 cycles
• AltiVec enabled
• LPM and range search under 1000 cycles
• The above estimate is based on other’s implementation on Pentium M
• No restriction on key length, i.e. works for IPv4, IPv6, or very long keys.
 Performance is related to # of entries
• On-the-fly table update
• Benchmark information scheduled in July 2005
 Co-operation with customers?

Freescale Semiconductor Confidential Proprietary. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.

AlgorithmStatus2005FEB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to AlgorithmStatus2005FEB

Similar to AlgorithmStatus2005FEB (20)

AlgorithmStatus2005FEB

Editor's Notes