SlideShare a Scribd company logo
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
PowerPC Algorithm
Development Update
Bo Lin, CPD/NCSG
FEB 2005
Slide 2
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
What’s New?
• Vectorised RSA and modular exponetiation
• AltiVec and Scalar CRC32 (IEEE 802.3 FCS) for
wireless LAN
• Other vendor’s information on AES128
• Re-arranged presentation
 Code available
 Code in planning
Slide 3
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Theme
Key Messages:
• PowerPC (core and AltiVec) is very efficient –
beyond imagination.
• Software solution can provide surprisingly good
results.
• The key is to develop algorithms best suitable for
PowerPC.
2005 focus:
• Vectorised RSA and modular exponentiation
• Networking support – “dictionary”
Slide 4
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
The Impact of Innovative Algorithm Design
By properly designing algorithms for PowerPC,
performance can be dramatically increased.
A good example is the EEMBC’s Convolutional
Encoder (see www.eembc.com) where AltiVec
enabled algorithm out-perform the “out-of-box”
version 600 ~ 1000 times.
Another example is that many popular crypto
algorithms are very efficiently executed on
PowerPC.
Slide 5
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Benchmark information
All figures are tested on silicon
Small variances are expected on different test platforms
The figures on 8641(D) are expected better since AltiVec
improvement on the device provides “out-of-order” execution
Devices & Algorithms
• G4 & AES, DES, 3DES, Kasumi, tdmCRC16, CRC32, RSA
 MPC7457 @ 1000MHz
Slide 6
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
What is benchmarked?
• The cycle count of a function call. For example, we
measure how many cycles are needed for G4 or
other PowerPC processor to complete a function call
such as AES128(data, cipher, key).
• All CPU resources are used - the same convention of
benchmark practice as other vendors.
Slide 7
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
POPI Guideline
Please share figures with customers
but not the whole set of slides.
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Benchmark
information on
Encryption
(Public key Cryptos)
Slide 9
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
vRSA Performance figures on G4
vRSA
• Data block length: 1024 bits
• Key length: 1024 bits
• IEEE P1363 test vector
Relative performance (cycle count)
• Signing: 3,720,000 cycles
Absolute performance on Test Setup
• 3.72 ms per signing
Other info: CRT, arbitrary length
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 10
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
1024 modular exponentiation on G4
Y = XE
mod N
• all parameters are of 1024 bits
Relative performance (cycle count)
• Modular exp: 12,850,000 cycles
Absolute performance on Test Setup
• 12.8 ms per mod exp
Other info
• E usually is 160 bits or 256 bits
• Arbitrary length
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Benchmark
information on
Encryption
(Conventional Cryptos)
Slide 12
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
AES128 Performance figures on G4
AES128
• Data block length: 128 bits
• Key length: 128 bits
Relative performance (cycle count)
• Encryption: 336 cycles per 128 bit data block
• → 0.381 bits/cycle → 2.6 cycles/bit
• Decryption: 335 cycles per 128 bit data block
• → 0.321 bits/cycle → 2.6 cycles/bit
Absolute performance on Test Setup
• 380.9 Mbps for encryption
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 13
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
AES128 Benchmarked by Others (reference
only)
http://www.tcs.hut.fi/~helger/aes/rijndael.html
• Relative performance (cycle count) on 7457@1.25 GHz with gcc
 Encryption: 385 cycles per 128 bit data block
 Decryption: 391 cycles per 128 bit data block
• Relative performance (cycle count) on Pentium M@1.33 GHz
with gcc
 Encryption: 348 cycles per 128 bit data block
 Decryption: 376 cycles per 128 bit data block
http://www.eskimo.com/~weidai/benchmarks.html
• Relative performance (cycle count) on Pentium 4 @ 2.1GHz
with Microsoft Visual C++ .NET 2003 (whole program optimization, optimize for
speed, P4 code generation
 Encryption: 542 cycles per 128 bit data block
Slide 14
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
AES192 Performance Figures on G4
AES192
• Data block length: 128 bits
• Key length: 192 bits
Relative performance (cycle count)
• Encryption: 390 cycles per 128 bit data block
• → 0.328 bits/cycle → 3.0 cycles/bit
• Decryption: 387 cycles per 128 bit data block
• → 0.331 bits/cycle → 3.0 cycles/bit
Absolute performance on Test Setup
• 328.2 Mbps for encryption
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 15
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
AES256 Performance Figures on G4
AES256
• Data block length: 128 bits
• Key length: 256 bits
Relative performance (cycle count)
• Encryption: 444 cycles per 128 bit data block
• → 0.288 bits/cycle → 3.5 cycles/bit
• Decryption: 438 cycles per 128 bit data block
• → 0.292 bits/cycle → 3.4 cycles/bit
Absolute performance on Test Setup
• 288.2 Mbps for encryption
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 16
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Kasumi Performance Figures on G4
Kasumi
• Data block length: 64 bits
• Key length: 128 bits
Relative performance (cycle count)
• Kasumi operation: 427 cycles per 64 bit data block
• → 0.150 bits/cycle → 6.67 cycles/bit
Absolute performance on Test Setup
• 149.9 Mbps
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 17
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
DES Performance Figures on G4
DES
• Data block length: 64 bits
• Key length: 56 bits
Relative performance (cycle count)
• DES function: 423 cycles per 64 bit data block
• → 0.151 bits/cycle → 6.61 cycles/bit
Absolute performance on Test Setup
• 151.1 Mbps for encryption
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 18
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
3DES Performance Figures on G4
3DES
• Data block length: 64 bits
• Key length: 112 bits
Relative performance (cycle count)
• 3DES function: 1214 cycles per 64 bit data block
• → 0.52 bits/cycle → 18.9 cycles/bit
Absolute performance on Test Setup
• 52.9 Mbps for encryption
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 19
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Absolute Performance (Mbps)
0
50
100
150
200
250
300
350
400
G4/1000
AES128
AES192
AES256
Kasumi
DES
3DES
Tested
performance
(Mbps)
Slide 20
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Core & Compiler Efficiency (bits/cycle)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
G4/1000
AES128
AES192
AES256
Kasumi
DES
3DES
Core & compiler
efficiency
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Benchmark
Information on CRC
Slide 22
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
G4 Performance for CRC32 (IEEE 802.3 FCS)
Frame Size cycles bits/cycle cycles bits/cycle
60 281 1.71 250 1.92
62 303 1.64 277 1.79
64 247 2.07 264 1.94
71 287 1.98 299 1.90
80 277 2.31 320 2.00
89 320 2.23 352 2.02
96 307 2.50 367 2.09
105 350 2.40 408 2.06
112 337 2.66 432 2.07
128 367 2.79 488 2.10
161 429 3.00 604 2.13
192 487 3.15 712 2.16
225 549 3.28 828 2.17
256 607 3.37 936 2.19
384 847 3.63 1384 2.22
449 970 3.70 1613 2.23
512 1088 3.76 1832 2.24
768 1568 3.92 2728 2.25
895 1864 3.84 3183 2.25
1020 2081 3.92 3610 2.26
1024 2048 4.00 3624 2.26
1514 3004 4.03 5354 2.26
1536 3008 4.09 5417 2.27
2048 3968 4.13 7208 2.27
4096 7808 4.20 14377 2.28
AltiVec-enabled Scalar
* Frame Size is in BYTE
* The coloured Frame Sizes
are within the specification of
IEEE 802.3
Absolute Asymptotical Performance
on Test Setup:
AltiVec: 4200 Mbps
Scalar: 2280 Mbps
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 23
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
G4 Performance figures for multi-channel
CRC16
tdmCRC16
• 16.35 Gbps for processing 16
channels simultaneously, AltiVec
enabled
• 16.35 bits/cycle → 0.061 cycles/bit
BL: This function reads 16 equal-
length packets arranged in
columns and calculates all the 16
CRC16 simultaneously. One
potential application will be CRC8
to calculate ATM’s cell header
HEC’s. Even higher performance
is expected.
Other test case benchmark figures
are available upon request
Test Setup:
CPU: 7457 without L3 @1GHz
Language: ANSI C Compiler: gcc 3.3 on Linux
Slide 24
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Absolute Asymptotical Performance (Mbps)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
G4/1000
tdmCRC16
vCRC32
sCRC32Tested
performance
(Mbps)
Slide 25
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Core & Compiler Efficiency (bits/cycle)
0
2
4
6
8
10
12
14
16
18
G4/1000
tdmCRC16
vCRC32
sCRC32Core & compiler
efficiency
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Other Information on
Benchmark, Codes, etc
Slide 27
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Performance Comparison with scalar version routines:
checksum()
0
500
1000
1500
2000
2500
32
448
864
1280
1696
2112
2528
2944
3360
3776
4192
4608
5024
5440
5856
6272
6688
7104
7520
7936
bytes checked
MB/second AltiVec
Linux assembly
C
Test setup: 7445 1GHz, MPC107 with 133MHz bus
36.5 bits/cycle
More improvement
possible
Slide 28
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Code for benchmarking and evaluation
1. Encryption
1. DES/3DES
2. AES128/192/256
3. Kasumi
4. RC5
5. RSA/modular exp
2. CRC’s
1. CRC32/24/16/12/8
3. EEMBC www.eembc.com
1. Telecom suite
2. Networking suite
3. Consumer suite
4. 3GPP TS25.212 (baseband symbol rate processing)
1. Report available upon request
Slide 29
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
AltiVec/e600 offering
1. Code for Web download – ready to use
• Fully tested code and API complying with, say, ANSI C or Linux
• http://www.freescale.com/altivec
2. Prototype code – customer evaluation & integration
• Develop, benchmarked, and verified against a set of test vectors.
• For example, the AES code can re-produce test vectors in FIPS
197
• IEEE FCS/CRC32 generates correct result against IEEE 802.11
• 1024 RSA generates IEEE P1363 test vectors
• Some restrictions may apply.
• For example, AltiVec-enabled code may require data buffer quad-
ward aligned.
• Documentation may not completed
3. Code in pipeline – under development and planned
• Code committed to be in “prototype” state
• Customer input welcome
Slide 30
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
What is in the AltiVec-enabled libc library?
*Not ANSI C but present in many implementation (e.g.
VxWorks)
strlen
strcmp
strcpy
strcmp
strncpy
memcpy
memmove
memcmp
memchr
bcopy*
bzero*
Slide 31
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Other networking functions from Linux that can be done in
AltiVec
__copy_tofrom_user
csum_partial
csum_partial_copy_generic
page_copy
Slide 32
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
How are the AltiVec-enabled libc libraries used?
Freescale provides an archive (binary file) that can be
linked with existing object files.
Dhrystone example:
ld dhry21a.o dhry21b.o c:/sw/libmoto/libmotovec.a
c:/gcc/lib/gcc-lib/powerpc-eabisim2.95.2libgcc.a -( -lsim -lc )
-o gccBM.elf
Put libmotovec.a on linker command line before the
compiler’s libc library
Binary library is eabi compliant, i.e. independent of
compiler used.
• Proven with gcc, Green Hills, Diab, Metaware, Metrowerks
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
Planned Code
(customer input welcome)
Slide 34
Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004
s/w solution table lookup (“Dictionary”) performance estimate
• Main feature
• One underlying data structure to support mixed instructions consisting of
lookup, LPM, range search, insert, delete
• Key technology
• Planar subdivided dictionaries
• Linear hashing ax mod M under 100 cycles
• For x with variable length (say, 64 or 128 bytes)
 64 bytes = 4 vectors
 4 vec_msum()’s + 4 vec_sums() + 3 mul()’s + 3 add()’s + 1 div() ≈ 20 cycles
• AltiVec enabled
• LPM and range search under 1000 cycles
• The above estimate is based on other’s implementation on Pentium M
• No restriction on key length, i.e. works for IPv4, IPv6, or very long keys.
 Performance is related to # of entries
• On-the-fly table update
• Benchmark information scheduled in July 2005
 Co-operation with customers?
Freescale Semiconductor Confidential Proprietary. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc.
All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004

More Related Content

What's hot

1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw
videos
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Odinot Stanislas
 
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and CephCeph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Danielle Womboldt
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
Jim St. Leger
 
DPDK Summit 2015 - Intro - Tim O'Driscoll
DPDK Summit 2015 - Intro - Tim O'DriscollDPDK Summit 2015 - Intro - Tim O'Driscoll
DPDK Summit 2015 - Intro - Tim O'Driscoll
Jim St. Leger
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
Ganesan Narayanasamy
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
Michelle Holley
 
Orc technical training
Orc technical trainingOrc technical training
Orc technical trainingsequoiacapfm
 
Performance out of the box developers
Performance   out of the box developersPerformance   out of the box developers
Performance out of the box developers
Michelle Holley
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
Intel Software Brasil
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
Deepak Shankar
 
Automated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and RedfishAutomated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and Redfish
Jose De La Rosa
 
Intel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewIntel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology Overview
Michelle Holley
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
Jim St. Leger
 
PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...
PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...
PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...
PROIDEA
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
Jim St. Leger
 
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
Mandie Quartly
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
inside-BigData.com
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
inside-BigData.com
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Deepak Shankar
 

What's hot (20)

1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw1 intro to_dpdk_and_hw
1 intro to_dpdk_and_hw
 
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
Hands-on Lab: How to Unleash Your Storage Performance by Using NVM Express™ B...
 
Ceph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and CephCeph Day Beijing - Storage Modernization with Intel and Ceph
Ceph Day Beijing - Storage Modernization with Intel and Ceph
 
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel ArchitectureDPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
DPDK Summit - 08 Sept 2014 - Intel - Networking Workloads on Intel Architecture
 
DPDK Summit 2015 - Intro - Tim O'Driscoll
DPDK Summit 2015 - Intro - Tim O'DriscollDPDK Summit 2015 - Intro - Tim O'Driscoll
DPDK Summit 2015 - Intro - Tim O'Driscoll
 
OpenCAPI Technology Ecosystem
OpenCAPI Technology EcosystemOpenCAPI Technology Ecosystem
OpenCAPI Technology Ecosystem
 
What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?What are latest new features that DPDK brings into 2018?
What are latest new features that DPDK brings into 2018?
 
Orc technical training
Orc technical trainingOrc technical training
Orc technical training
 
Performance out of the box developers
Performance   out of the box developersPerformance   out of the box developers
Performance out of the box developers
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
Automated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and RedfishAutomated Out-of-Band management with Ansible and Redfish
Automated Out-of-Band management with Ansible and Redfish
 
Intel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology OverviewIntel(r) Quick Assist Technology Overview
Intel(r) Quick Assist Technology Overview
 
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro NakajimaDPDK Summit 2015 - NTT - Yoshihiro Nakajima
DPDK Summit 2015 - NTT - Yoshihiro Nakajima
 
PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...
PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...
PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
OpenPOWER Solutions overview session from IBM TechU Rome - April 2016
 
NNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for SupercomputingNNSA Explorations: ARM for Supercomputing
NNSA Explorations: ARM for Supercomputing
 
Hardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and MLHardware & Software Platforms for HPC, AI and ML
Hardware & Software Platforms for HPC, AI and ML
 
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
 

Viewers also liked

General didactics
General didacticsGeneral didactics
General didactics
Anilú Ayví
 
Creating Meaningful Digital Experiences
Creating Meaningful Digital ExperiencesCreating Meaningful Digital Experiences
Creating Meaningful Digital Experiences
Mark Badger
 
Dal prodotto all'esperienza. Verso un design sistemico
Dal prodotto all'esperienza. Verso un design sistemicoDal prodotto all'esperienza. Verso un design sistemico
Dal prodotto all'esperienza. Verso un design sistemico
Luca Rosati
 
Audiweb @ IAB Seminar 2016 "People are mobile"
Audiweb @ IAB Seminar 2016 "People are mobile"Audiweb @ IAB Seminar 2016 "People are mobile"
Audiweb @ IAB Seminar 2016 "People are mobile"
Audiweb
 
Hiperconectado y Exigente
Hiperconectado y ExigenteHiperconectado y Exigente
Hiperconectado y Exigente
Exceda
 
Presentación de Servicios Prevengest
Presentación de Servicios PrevengestPresentación de Servicios Prevengest
Presentación de Servicios Prevengest
Xavier Fillol de Blas
 

Viewers also liked (10)

гей
гейгей
гей
 
Weather page
Weather pageWeather page
Weather page
 
General didactics
General didacticsGeneral didactics
General didactics
 
Creating Meaningful Digital Experiences
Creating Meaningful Digital ExperiencesCreating Meaningful Digital Experiences
Creating Meaningful Digital Experiences
 
Dal prodotto all'esperienza. Verso un design sistemico
Dal prodotto all'esperienza. Verso un design sistemicoDal prodotto all'esperienza. Verso un design sistemico
Dal prodotto all'esperienza. Verso un design sistemico
 
Audiweb @ IAB Seminar 2016 "People are mobile"
Audiweb @ IAB Seminar 2016 "People are mobile"Audiweb @ IAB Seminar 2016 "People are mobile"
Audiweb @ IAB Seminar 2016 "People are mobile"
 
Hiperconectado y Exigente
Hiperconectado y ExigenteHiperconectado y Exigente
Hiperconectado y Exigente
 
Resume
ResumeResume
Resume
 
Ejercicio 3 de word
Ejercicio 3 de wordEjercicio 3 de word
Ejercicio 3 de word
 
Presentación de Servicios Prevengest
Presentación de Servicios PrevengestPresentación de Servicios Prevengest
Presentación de Servicios Prevengest
 

Similar to AlgorithmStatus2005FEB

Security a SPARC M7 CPU
Security a SPARC M7 CPUSecurity a SPARC M7 CPU
Security a SPARC M7 CPU
MarketingArrowECS_CZ
 
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 serveryOracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
MarketingArrowECS_CZ
 
Konsolidace Oracle DB na systémech s procesory M7
Konsolidace Oracle DB na systémech s procesory M7Konsolidace Oracle DB na systémech s procesory M7
Konsolidace Oracle DB na systémech s procesory M7
MarketingArrowECS_CZ
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Jen Aman
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
Hannes Tschofenig
 
Why_Oracle_Hardware.ppt
Why_Oracle_Hardware.pptWhy_Oracle_Hardware.ppt
Why_Oracle_Hardware.ppt
EverestMedinilla2
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
MarketingArrowECS_CZ
 
Cерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 fin
Cерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 finCерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 fin
Cерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 fin
DEPO Computers
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Community
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Community
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Community
 
Sparc t4 systems customer presentation
Sparc t4 systems customer presentationSparc t4 systems customer presentation
Sparc t4 systems customer presentation
solarisyougood
 
6° Sessione Oracle - CRUI: Oracle Database Appliance: Il potere dell’ingegner...
6° Sessione Oracle - CRUI: Oracle Database Appliance:Il potere dell’ingegner...6° Sessione Oracle - CRUI: Oracle Database Appliance:Il potere dell’ingegner...
6° Sessione Oracle - CRUI: Oracle Database Appliance: Il potere dell’ingegner...
Jürgen Ambrosi
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Databricks
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
C4Media
 
Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Community
 
A2 a peep into the fastest servers for database middleware and enterprise j...
A2   a peep into the fastest servers for database middleware and enterprise j...A2   a peep into the fastest servers for database middleware and enterprise j...
A2 a peep into the fastest servers for database middleware and enterprise j...Dr. Wilfred Lin (Ph.D.)
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Community
 
POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overview
Alexander Grudanov
 

Similar to AlgorithmStatus2005FEB (20)

Security a SPARC M7 CPU
Security a SPARC M7 CPUSecurity a SPARC M7 CPU
Security a SPARC M7 CPU
 
Oracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 serveryOracle SPARC T7 a M7 servery
Oracle SPARC T7 a M7 servery
 
Konsolidace Oracle DB na systémech s procesory M7
Konsolidace Oracle DB na systémech s procesory M7Konsolidace Oracle DB na systémech s procesory M7
Konsolidace Oracle DB na systémech s procesory M7
 
Pedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon InnovationPedal to the Metal: Accelerating Spark with Silicon Innovation
Pedal to the Metal: Accelerating Spark with Silicon Innovation
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
 
Why_Oracle_Hardware.ppt
Why_Oracle_Hardware.pptWhy_Oracle_Hardware.ppt
Why_Oracle_Hardware.ppt
 
Session 307 ravi pendekanti engineered systems
Session 307  ravi pendekanti engineered systemsSession 307  ravi pendekanti engineered systems
Session 307 ravi pendekanti engineered systems
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
 
Cерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 fin
Cерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 finCерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 fin
Cерверы Depo storm 3400 на базе новейших процессоров intel xeon e5 2600v3 fin
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
 
Sparc t4 systems customer presentation
Sparc t4 systems customer presentationSparc t4 systems customer presentation
Sparc t4 systems customer presentation
 
6° Sessione Oracle - CRUI: Oracle Database Appliance: Il potere dell’ingegner...
6° Sessione Oracle - CRUI: Oracle Database Appliance:Il potere dell’ingegner...6° Sessione Oracle - CRUI: Oracle Database Appliance:Il potere dell’ingegner...
6° Sessione Oracle - CRUI: Oracle Database Appliance: Il potere dell’ingegner...
 
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneUnleashing Data Intelligence with Intel and Apache Spark with Michael Greene
Unleashing Data Intelligence with Intel and Apache Spark with Michael Greene
 
“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core“Quantum” Performance Effects: beyond the Core
“Quantum” Performance Effects: beyond the Core
 
Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph Ceph Day Beijing - Storage Modernization with Intel & Ceph
Ceph Day Beijing - Storage Modernization with Intel & Ceph
 
A2 a peep into the fastest servers for database middleware and enterprise j...
A2   a peep into the fastest servers for database middleware and enterprise j...A2   a peep into the fastest servers for database middleware and enterprise j...
A2 a peep into the fastest servers for database middleware and enterprise j...
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
POLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overviewPOLYTEDA PowerDRC/LVS overview
POLYTEDA PowerDRC/LVS overview
 

AlgorithmStatus2005FEB

  • 1. Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 PowerPC Algorithm Development Update Bo Lin, CPD/NCSG FEB 2005
  • 2. Slide 2 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 What’s New? • Vectorised RSA and modular exponetiation • AltiVec and Scalar CRC32 (IEEE 802.3 FCS) for wireless LAN • Other vendor’s information on AES128 • Re-arranged presentation  Code available  Code in planning
  • 3. Slide 3 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Theme Key Messages: • PowerPC (core and AltiVec) is very efficient – beyond imagination. • Software solution can provide surprisingly good results. • The key is to develop algorithms best suitable for PowerPC. 2005 focus: • Vectorised RSA and modular exponentiation • Networking support – “dictionary”
  • 4. Slide 4 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 The Impact of Innovative Algorithm Design By properly designing algorithms for PowerPC, performance can be dramatically increased. A good example is the EEMBC’s Convolutional Encoder (see www.eembc.com) where AltiVec enabled algorithm out-perform the “out-of-box” version 600 ~ 1000 times. Another example is that many popular crypto algorithms are very efficiently executed on PowerPC.
  • 5. Slide 5 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Benchmark information All figures are tested on silicon Small variances are expected on different test platforms The figures on 8641(D) are expected better since AltiVec improvement on the device provides “out-of-order” execution Devices & Algorithms • G4 & AES, DES, 3DES, Kasumi, tdmCRC16, CRC32, RSA  MPC7457 @ 1000MHz
  • 6. Slide 6 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 What is benchmarked? • The cycle count of a function call. For example, we measure how many cycles are needed for G4 or other PowerPC processor to complete a function call such as AES128(data, cipher, key). • All CPU resources are used - the same convention of benchmark practice as other vendors.
  • 7. Slide 7 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 POPI Guideline Please share figures with customers but not the whole set of slides.
  • 8. Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Benchmark information on Encryption (Public key Cryptos)
  • 9. Slide 9 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 vRSA Performance figures on G4 vRSA • Data block length: 1024 bits • Key length: 1024 bits • IEEE P1363 test vector Relative performance (cycle count) • Signing: 3,720,000 cycles Absolute performance on Test Setup • 3.72 ms per signing Other info: CRT, arbitrary length Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 10. Slide 10 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 1024 modular exponentiation on G4 Y = XE mod N • all parameters are of 1024 bits Relative performance (cycle count) • Modular exp: 12,850,000 cycles Absolute performance on Test Setup • 12.8 ms per mod exp Other info • E usually is 160 bits or 256 bits • Arbitrary length Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 11. Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Benchmark information on Encryption (Conventional Cryptos)
  • 12. Slide 12 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 AES128 Performance figures on G4 AES128 • Data block length: 128 bits • Key length: 128 bits Relative performance (cycle count) • Encryption: 336 cycles per 128 bit data block • → 0.381 bits/cycle → 2.6 cycles/bit • Decryption: 335 cycles per 128 bit data block • → 0.321 bits/cycle → 2.6 cycles/bit Absolute performance on Test Setup • 380.9 Mbps for encryption Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 13. Slide 13 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 AES128 Benchmarked by Others (reference only) http://www.tcs.hut.fi/~helger/aes/rijndael.html • Relative performance (cycle count) on 7457@1.25 GHz with gcc  Encryption: 385 cycles per 128 bit data block  Decryption: 391 cycles per 128 bit data block • Relative performance (cycle count) on Pentium M@1.33 GHz with gcc  Encryption: 348 cycles per 128 bit data block  Decryption: 376 cycles per 128 bit data block http://www.eskimo.com/~weidai/benchmarks.html • Relative performance (cycle count) on Pentium 4 @ 2.1GHz with Microsoft Visual C++ .NET 2003 (whole program optimization, optimize for speed, P4 code generation  Encryption: 542 cycles per 128 bit data block
  • 14. Slide 14 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 AES192 Performance Figures on G4 AES192 • Data block length: 128 bits • Key length: 192 bits Relative performance (cycle count) • Encryption: 390 cycles per 128 bit data block • → 0.328 bits/cycle → 3.0 cycles/bit • Decryption: 387 cycles per 128 bit data block • → 0.331 bits/cycle → 3.0 cycles/bit Absolute performance on Test Setup • 328.2 Mbps for encryption Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 15. Slide 15 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 AES256 Performance Figures on G4 AES256 • Data block length: 128 bits • Key length: 256 bits Relative performance (cycle count) • Encryption: 444 cycles per 128 bit data block • → 0.288 bits/cycle → 3.5 cycles/bit • Decryption: 438 cycles per 128 bit data block • → 0.292 bits/cycle → 3.4 cycles/bit Absolute performance on Test Setup • 288.2 Mbps for encryption Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 16. Slide 16 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Kasumi Performance Figures on G4 Kasumi • Data block length: 64 bits • Key length: 128 bits Relative performance (cycle count) • Kasumi operation: 427 cycles per 64 bit data block • → 0.150 bits/cycle → 6.67 cycles/bit Absolute performance on Test Setup • 149.9 Mbps Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 17. Slide 17 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 DES Performance Figures on G4 DES • Data block length: 64 bits • Key length: 56 bits Relative performance (cycle count) • DES function: 423 cycles per 64 bit data block • → 0.151 bits/cycle → 6.61 cycles/bit Absolute performance on Test Setup • 151.1 Mbps for encryption Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 18. Slide 18 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 3DES Performance Figures on G4 3DES • Data block length: 64 bits • Key length: 112 bits Relative performance (cycle count) • 3DES function: 1214 cycles per 64 bit data block • → 0.52 bits/cycle → 18.9 cycles/bit Absolute performance on Test Setup • 52.9 Mbps for encryption Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 19. Slide 19 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Absolute Performance (Mbps) 0 50 100 150 200 250 300 350 400 G4/1000 AES128 AES192 AES256 Kasumi DES 3DES Tested performance (Mbps)
  • 20. Slide 20 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Core & Compiler Efficiency (bits/cycle) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 G4/1000 AES128 AES192 AES256 Kasumi DES 3DES Core & compiler efficiency
  • 21. Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Benchmark Information on CRC
  • 22. Slide 22 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 G4 Performance for CRC32 (IEEE 802.3 FCS) Frame Size cycles bits/cycle cycles bits/cycle 60 281 1.71 250 1.92 62 303 1.64 277 1.79 64 247 2.07 264 1.94 71 287 1.98 299 1.90 80 277 2.31 320 2.00 89 320 2.23 352 2.02 96 307 2.50 367 2.09 105 350 2.40 408 2.06 112 337 2.66 432 2.07 128 367 2.79 488 2.10 161 429 3.00 604 2.13 192 487 3.15 712 2.16 225 549 3.28 828 2.17 256 607 3.37 936 2.19 384 847 3.63 1384 2.22 449 970 3.70 1613 2.23 512 1088 3.76 1832 2.24 768 1568 3.92 2728 2.25 895 1864 3.84 3183 2.25 1020 2081 3.92 3610 2.26 1024 2048 4.00 3624 2.26 1514 3004 4.03 5354 2.26 1536 3008 4.09 5417 2.27 2048 3968 4.13 7208 2.27 4096 7808 4.20 14377 2.28 AltiVec-enabled Scalar * Frame Size is in BYTE * The coloured Frame Sizes are within the specification of IEEE 802.3 Absolute Asymptotical Performance on Test Setup: AltiVec: 4200 Mbps Scalar: 2280 Mbps Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 23. Slide 23 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 G4 Performance figures for multi-channel CRC16 tdmCRC16 • 16.35 Gbps for processing 16 channels simultaneously, AltiVec enabled • 16.35 bits/cycle → 0.061 cycles/bit BL: This function reads 16 equal- length packets arranged in columns and calculates all the 16 CRC16 simultaneously. One potential application will be CRC8 to calculate ATM’s cell header HEC’s. Even higher performance is expected. Other test case benchmark figures are available upon request Test Setup: CPU: 7457 without L3 @1GHz Language: ANSI C Compiler: gcc 3.3 on Linux
  • 24. Slide 24 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Absolute Asymptotical Performance (Mbps) 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 G4/1000 tdmCRC16 vCRC32 sCRC32Tested performance (Mbps)
  • 25. Slide 25 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Core & Compiler Efficiency (bits/cycle) 0 2 4 6 8 10 12 14 16 18 G4/1000 tdmCRC16 vCRC32 sCRC32Core & compiler efficiency
  • 26. Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Other Information on Benchmark, Codes, etc
  • 27. Slide 27 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Performance Comparison with scalar version routines: checksum() 0 500 1000 1500 2000 2500 32 448 864 1280 1696 2112 2528 2944 3360 3776 4192 4608 5024 5440 5856 6272 6688 7104 7520 7936 bytes checked MB/second AltiVec Linux assembly C Test setup: 7445 1GHz, MPC107 with 133MHz bus 36.5 bits/cycle More improvement possible
  • 28. Slide 28 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Code for benchmarking and evaluation 1. Encryption 1. DES/3DES 2. AES128/192/256 3. Kasumi 4. RC5 5. RSA/modular exp 2. CRC’s 1. CRC32/24/16/12/8 3. EEMBC www.eembc.com 1. Telecom suite 2. Networking suite 3. Consumer suite 4. 3GPP TS25.212 (baseband symbol rate processing) 1. Report available upon request
  • 29. Slide 29 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 AltiVec/e600 offering 1. Code for Web download – ready to use • Fully tested code and API complying with, say, ANSI C or Linux • http://www.freescale.com/altivec 2. Prototype code – customer evaluation & integration • Develop, benchmarked, and verified against a set of test vectors. • For example, the AES code can re-produce test vectors in FIPS 197 • IEEE FCS/CRC32 generates correct result against IEEE 802.11 • 1024 RSA generates IEEE P1363 test vectors • Some restrictions may apply. • For example, AltiVec-enabled code may require data buffer quad- ward aligned. • Documentation may not completed 3. Code in pipeline – under development and planned • Code committed to be in “prototype” state • Customer input welcome
  • 30. Slide 30 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 What is in the AltiVec-enabled libc library? *Not ANSI C but present in many implementation (e.g. VxWorks) strlen strcmp strcpy strcmp strncpy memcpy memmove memcmp memchr bcopy* bzero*
  • 31. Slide 31 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Other networking functions from Linux that can be done in AltiVec __copy_tofrom_user csum_partial csum_partial_copy_generic page_copy
  • 32. Slide 32 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 How are the AltiVec-enabled libc libraries used? Freescale provides an archive (binary file) that can be linked with existing object files. Dhrystone example: ld dhry21a.o dhry21b.o c:/sw/libmoto/libmotovec.a c:/gcc/lib/gcc-lib/powerpc-eabisim2.95.2libgcc.a -( -lsim -lc ) -o gccBM.elf Put libmotovec.a on linker command line before the compiler’s libc library Binary library is eabi compliant, i.e. independent of compiler used. • Proven with gcc, Green Hills, Diab, Metaware, Metrowerks
  • 33. Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 Planned Code (customer input welcome)
  • 34. Slide 34 Freescale Semiconductor Internal Use Only. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004 s/w solution table lookup (“Dictionary”) performance estimate • Main feature • One underlying data structure to support mixed instructions consisting of lookup, LPM, range search, insert, delete • Key technology • Planar subdivided dictionaries • Linear hashing ax mod M under 100 cycles • For x with variable length (say, 64 or 128 bytes)  64 bytes = 4 vectors  4 vec_msum()’s + 4 vec_sums() + 3 mul()’s + 3 add()’s + 1 div() ≈ 20 cycles • AltiVec enabled • LPM and range search under 1000 cycles • The above estimate is based on other’s implementation on Pentium M • No restriction on key length, i.e. works for IPv4, IPv6, or very long keys.  Performance is related to # of entries • On-the-fly table update • Benchmark information scheduled in July 2005  Co-operation with customers?
  • 35. Freescale Semiconductor Confidential Proprietary. Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2004

Editor's Notes

  1. Unfortunately libmotovec is currently a library without many “books.” However, we believe that the functions that we do have are important for applications that do a lot of data movement through the processor’s register files. The wider width and improved bandwidth to cache of the AltiVec register set allow functions like memcpy or memset in AltiVec to be faster than the same function using the general purpose register file.
  2. Some data movement functions are very similar to memcpy but are called by a different name such as __copy_tofrom_user which is found in the Linux kernel and copies data from kernel to user space and vice versa. Our AltiVec-enabled memcpy can easily be modified to speed up these functions as well. More importantly, if there is other work to be done on the data being moved – like calculating a checksum – that work can largely be completely hidden under the memory latency of memcpy. Many applications already know this and provide functions like the Linux checksum calculation or checksum-while-copying function. Page_copy is a really trivial example of copying 4K byte pages of memory from one location to another.
  3. The library has proven very easy to use. A customer’s existing object modules can be linked with the AltiVec-enabled library by inserting the library on the linker command line ahead of the compiler’s libc library. The customer’s object modules then link to the symbols in the AltiVec library instead of the compiler-supplied functions.