The webinar provided an overview of how the Intel Xeon E5-2600 processor and Solarflare network adapters can achieve the lowest latencies at the highest message rates. The agenda included details on the Intel Xeon E5-2600 platform features like integrated I/O and Data Direct I/O that reduce latency. Solarflare's adapters and OpenOnload software were presented as optimizing performance. It was emphasized that the combination of the Intel processors and Solarflare products can deliver the best performance through features like reduced jitter and increased message rates. The webinar concluded with a Q&A session.
3. AGENDA
• Intel
– Xeon® Processor E5-2600
– Platform I/O enhancements
• Solarflare
– 10GbE server adapters
– OpenOnload
• How to achieve the best performance
– Intel Xeon E5-2600 + Solarflare SFN6122F: winning combination
• Q&A
June 7, 2012 Slide 3
4. Intel® Xeon® Processor E5-2600 Product Family
The Heart of a Next-Generation Data Center
Leading Performance
Up to 80% performance boost
over Intel® Xeon® processor
5600 series-based servers1
Best combination
of performance,
power efficiency,
and cost
Flexible & Efficient
Advanced features automate
power consumption across the
platform
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to intel.com/performance”
1 Performance comparison using best submitted/published 2-socket server results on the SPECfp*_rate_base2006 benchmark as of 6 March 2012. Configuration details in backup
4
5. Intel® Xeon® Processor E5-2600 Product Family
Reduce Bottlenecks With Intel® Integrated I/O
Would you put a Intel® Integrated I/O
racecar engine in this…
Xeon E5 2600
CORE 1 CORE 2
CORE 3 CORE 4
CORE 5 CORE 6
CORE 7 CORE 8
…or this? CACHE
Integrated
PCI Express*
3.0
* Other names and brands may be claimed as the property of others
5
6. Intel® Xeon® Processor E5-2600 Product Family
New Intel® Integrated I/O
Intel® Integrated I/O
1st server processor
with Intel® Integrated I/O
Reduces I/O latency
by as much as 30%1
Improves IO bandwidth
by as much as 2x with
PCI Express* 3.0 support2
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
1 Source: Intel internal measurements of average time for an I/O device read to local system memory under idle conditions comparing Intel® Xeon® processor E5-2600 product family (230 ns) vs. Intel®
Xeon® processor 5500 series (340 ns). See notes in backup for configuration details
2 Source: 8 GT/s and 128b/130b encoding in PCIe* 3.0 specification enables double the interconnect bandwidth over the PCIe* 2.0 specification
6 (www.pcisig.com/news_room/November_18_2010_Press_Release/ ).
* Other names and brands may be claimed as the property of others
7. Intel® Xeon® Processor E5-2600 Product Family
New Intel® Data Direct I/O Technology
(Intel® DDIO)
Can more than Double
I/O Performance1
Send I/O directly to and from
processor cache for all I/O
traffic types Xeon
2600
Family
Can allow system memory to
remain in low power state
Xeon
Reduce latency by eliminating 5600
Series
unneeded trips to memory
[ Transactions per second ]
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
1 Up to 2.3x I/O performance is 1S with a Xeon processor 5600 series vs. 1S Xeon Processor E5-2600 data for L2 forwarding test using 8x10GbE ports .See notes in backup for configuration details
7
8. Intel® Xeon® Processor E5-2600 Product Family
The Heart of a Next-Generation Data Center
Up to 80% performance
boost vs. prior gen1
Dramatically reduce
compute time with Intel®
Advanced Vector Extensions
Up to 4 channels
DDR3 1600 Mhz
memory Performance when you
need it with Intel® Turbo
Up to 8 cores
Boost Technology 2.0
Up to 20 MB cache
Integrated
PCI Express*
3.0
Intel® Integrated I/O with
Up to 40 Intel® Data Direct I/O
lanes
per socket cuts latency2 while
adding capacity & bandwidth
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
For more information go to intel.com/performance
1 Performance comparison using best submitted/published 2-socket server results on the SPECfp*_rate_base2006 benchmark as of 6 March 2012.
2 Source: Intel internal measurements of average time for an I/O device read to local system memory under idle conditions comparing Intel® Xeon® processor E5-2600 product family (230 ns) vs.
Intel® Xeon® processor 5500 series (340 ns). See notes in backup for configuration details
8
* Other names and brands may be claimed as the property of others
9. Introducing Solarflare
• Focused on high performance network
solutions
– Server adapters and software
– Supporting mission critical applications
• Trading / Market Data
• HPC Storage
• Cloud / Virtualization
“Solarflare’s product, EnterpriseOnload is a
• Big Data
robust, rigorously tested and fully supported
solution that addresses our demanding support
• Leader in the Financial Services and service level requirements. In addition to
– Powering Tier1 global exchanges providing the highest-performance, lowest-
– Many top commercial banks / trading firms latency hardware, Solarflare’s unique and
innovative application acceleration software
can be used to deploy quickly without any need
• Growing position in Media / HPC / Oil & Gas to re-write our applications.”
Andrew Bach
• World class delivery Senior Vice President of Network Services for NYSE Euronext
– Global OEM/VAR and distributors
– Direct 24x7 Global support
June 7, 2012 Slide 9
10. Solarflare Server Adapters
• Full range of products
– Common driver support
– Onload Server Adapter product line
Dual Port SFP+ Single Port SFP+
• Delivers best latency performance
– Performant Server Adapter product line
• Optimized for Virtualization, Cloud, HPC, Grid
• High performance
Dual Port 10GBASE-T Single Port 10GBASE-T – Rich set of stateless off-loads
• LRO, TSO, RSS, RFS
– Microarchitecture designed for low latency
– Cut Through State Machine Centric Data Path
Dual Port SFP+ Quad Port IBM • Highly scalable virtualized architecture
Precision Time Mezzanine Card
– 2048 virtual NIC instances
– SR-IOV
• Lowest power in the industry
Dual Port Dell
DCS Card HP Blade Mezz Card – <2.5W/port SFP+
June 7, 2012 Slide 10
11. Precision Time Adapters
• Adapters implement IEEE 1588 PTP to provide precision
host clock synchronization
– Hardware time stamping of PTP packets
– Stratum 3 oscillator maintains high degree of precision
– Solarflare provided (and maintained) PTPd stack
– Open Platform (for 3rd party PTPd stack compatibilty)
– Compatible with standard Solarflare drivers
• Two stage approach provides unmatched accuracy and
stability
– Server clock synchronized to precision Stratum 3 adapter clock
– Adapter clock synchronized to server clock
SFN6322F – Maintains <+/- 200ns accuracy
• SFN6322F PTP server adapter
– Based on SFN6122F
• Same performance and latency characteristics
• Compatible with OpenOnload
June 7, 2012 Slide 11
12. OpenOnload® Application Acceleration Software
• Application Acceleration
• TCP/IP, UDP and multicast acceleration
• Streamlines and reduces interrupts, context
switches and data copies
• Reduces latency by 50%, increases message
rates 3x or more
• Seamlessly integrates into existing infrastructure
• Binary compatible with industry standard APIs
• No software modifications are needed
• Standards-based solution uses TCP/IP and UDP
• No specialized protocols needed
• Compatible with existing Ethernet infrastructure
• Open source GPLv2 / LGPL
• Global 24x7 support available
June 7, 2012 Slide 12
13. SFN6122F & Xeon E5-2600 Deliver Winning Combination
“Lowest latency at highest message rate”
• SFN6122F single-stream
latency is superb over all
message rates on Romley
platforms, right up to the
point of CPU core utilization
• Ultra-low jitter (sub-micro at
99Percentile)
• Benefits from Intel® Data
Direct I/O (DDIO) and
chipset IO – memory
bandwidth
• Message rate headroom –
sfnt-stream / openonload-201109-u2
20Mpps with 4x sfnt-streams
“Westmere” = 2x Xeon 5687 (3.6GHz)
“Romley” = 2x E5-2687W (3.1GHz) – DDR 1333
June 7, 2012 Slide 13
17. What are the causes of latency jitter?
• Resource contention
– Threads fighting for access to CPU
– Threads fighting for access to critical sections
– Running out of memory!
– Fix this by dedicating resources to critical threads, including:
• Memory
• CPU cores
• Onload stacks
• Queuing delays
– If you’re keeping up with incoming rate latency is generally good
– If you fall behind, you get queuing delays
– Fix this by:
• Making each thread more efficient (hard)
• Going parallel / hardware assist (very hard)
June 7, 2012 Slide 17
18. Moving to the new platform?
• Switching from SFN5xxx to SFN6xxx or Westmere to Romley ?
– Then first-order nothing changes
• Same methodology for Onload tuning
– But be aware of PCIe slot affinitisation
• Westmere 2Proc machines shared IOH / symmetric performance
• Romley 2Proc machines have asymmetric performance
S1 S2
S1 S2
IOH
N N
N N 1 2
1 2
Westmere 2xCPU Romley 2xCPU
June 7, 2012 Slide 18
19. Additional Romley Tuning
• Check NIC is plugged into PCIe slot which is NUMA local to the application
threads which are processing data from that NIC
• If using interrupts, check that interrupts are directed to a core on the same
NUMA node
• If running RT ensure soft-irq threads are pinned to the same core as the
interrupts (start with nothing pinned!)
S1 S2
S1 S2
IOH
N N
N N 1 2
1 2
Westmere 2xCPU Romley 2xCPU
June 7, 2012 Slide 19
20. How to achieve the best performance - Intel
Maximizing Performance involves “System Level” optimizations
• OEM BIOS Settings: SMI, HyperThreading, C-States- All Off
– Experiment with EIST & Turbo On/Off
• On the application: Maximize your resources by…
1. Pin Threads, Interrupts, and Processes to individual cores using CPU_ID
2. Place “communication” functions threads on adjacent cores
3. Use PCM to determine L3 Cache Misses & Keep data in L3 Cache
http://software.intel.com/file/41604
4. Compile w/Performance Settings, Use PGO, Evaluate IPP / SSE 4.2 Strings
http://software.intel.com/en-us/articles/using-avx-without-writing-avx-code/
• Determine how many cores your trading strategy requires
1. Can it run on 8 cores? If so, match up CPU+NIC per strategy
https://access.redhat.com/knowledge/solutions/53031
Enlist Solarflare and Intel for help. We are eager to engage.
June 7, 2012 Slide 20
21. Join The Conversation & Find Support
• Find support from Intel & Others @finteligent
• Debate critical industry questions
• Interact with your peers across the globe.
June 7, 2012 Slide 21