O que há de novo na plataforma x86 para High Performance por Jefferson de A Silva

546 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
546
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

O que há de novo na plataforma x86 para High Performance por Jefferson de A Silva

  1. 1. O que há de novo na plataforma x86 para High Performance Jefferson de A Silva Systems Management & Product Specialist [email_address]
  2. 2. Por que sistemas de alta performance
  3. 3. A tendência tecnológica – Todo ano nós ficamos mais rápidos mais processadores Breakdown in Frequency scaling <ul><li>Por volta do início de 2003 começou a limitação da freqüência do processador </li></ul><ul><li>De acordo com essa trajetória passada nós deveríamos estar hoje acima de 10GHz ! </li></ul><ul><li>Historicamente freqüência mais alta aumenta uma única threaded </li></ul><ul><li>Multi-core sómente melhora aumenta de performance de software quando for possível aumentar o número de execução de threads </li></ul>Intel CPU Trends (sources: Intel, Wikipedia, K. Olukotun) Pentium 386 Xeon Paxville Montecito
  4. 4. Aspectos de Planejamento de Capacidade em ambiente x86 <ul><li>Network Subsystem Performance Update </li></ul><ul><ul><li>TOE, IOAT technology overview </li></ul></ul><ul><ul><li>TOE, and IOAT Ethernet throughput </li></ul></ul><ul><li>Storage Subsystem Performance Update </li></ul><ul><ul><li>2.5” vs. 3.5” disk effects on performance </li></ul></ul><ul><li>Memory Subsystem Performance Update </li></ul><ul><ul><li>Memory operation fundamentals </li></ul></ul><ul><ul><ul><li>Latency vs. Bandwidth </li></ul></ul></ul><ul><ul><li>DDR2 & FBDIMM memory performance </li></ul></ul><ul><li>CPU Technology & Performance Update </li></ul><ul><ul><li>Snoop filter performance overview </li></ul></ul><ul><ul><li>Multi-core processor performance update </li></ul></ul><ul><ul><li>New processor architecture changes and performance </li></ul></ul><ul><ul><ul><li>AMD® Opteron® Next Generation </li></ul></ul></ul><ul><ul><ul><li>Intel Core® (Xeon® 5100 Woodcrest) </li></ul></ul></ul><ul><ul><ul><li>Intel Tulsa </li></ul></ul></ul><ul><ul><li>Clovertown Performance update </li></ul></ul><ul><li>X Architecture Overview </li></ul><ul><li>Performance per Watt </li></ul><ul><ul><li>What is performance/watt? </li></ul></ul><ul><ul><li>How does Xeon 5100 Series (Woodcrest) perform? </li></ul></ul><ul><li>Product Positioning </li></ul><ul><ul><li>How to position Xeon vs. Opteron products? </li></ul></ul>
  5. 5. Aspectos de Planejamento de Capacidade em ambiente x86 <ul><li>TOE e IOAT (TCP/IP Offload–I/O Acceleration Technology) </li></ul><ul><li>Discos 2.5” vs 3.5” </li></ul><ul><li>SDRAM, DDR, DDR2 e FBD </li></ul><ul><li>CPU (Multi-core, novas arquiteturas) </li></ul><ul><li>VT (On chip e software) </li></ul><ul><li>Consumo de energia </li></ul>
  6. 6. Aspectos de Planejamento de Capacidade em ambiente x86 Network Architecture – Standard NIC CPU CPU LAN Memory Chipset 1 2 3 4 <ul><li>Potential bottlenecks </li></ul><ul><li>Interrupt Process and Multiple Memory Accesses by the CPU </li></ul><ul><li>TCP Protocol Processing </li></ul><ul><li>CPU Memory Copies </li></ul>LAN
  7. 7. Aspectos de Planejamento de Capacidade em ambiente x86 TCP/IP Offload - TOE <ul><li>Benefit </li></ul><ul><li>Less code processing by CPU </li></ul><ul><li>Fewer CPU data copies </li></ul>CPU CPU LAN Memory Chipset 1 2 3 1 2 TOE
  8. 8. Aspectos de Planejamento de Capacidade em ambiente x86 I/O Acceleration Technology – Intel (IOAT) CPU CPU LAN Memory Chipset 1 2 3 <ul><li>Benefit </li></ul><ul><li>Few if any data copies by CPU </li></ul><ul><li>First version will only help receive performance since copies will be done only on frames that are moving from TCP/IP space to application space </li></ul>4 LAN
  9. 9. Memory Controller Memory Bus 400 MHz Memory Controller Memory Bus 533 MHz 667 MHz Memory Controller Memory Bus Not representative of any particular system Diagram is intended to illustrate speed and DIMM count limitations Aspectos de Planejamento de Capacidade em ambiente x86 And We Still Have The Capacity vs. Speed Trade-off DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
  10. 10. FBDIMM Solves This Problem With Serial Memory Bus And On-DIMM Advanced Memory Buffer (AMB) Serial Address Bus Serial Data Bus Same DDR2 DRAM Technology Memory Controller
  11. 11. FBDIMM Serial Bus Add Latency Due to Hops Serial Address Bus Serial Data Bus Memory Controller Address Data
  12. 12. Additional Memory Channels = Greater Capacity And Greater Throughput Which Offsets Additional Latency Under Load DDR2 Memory Controller FBD Memory Controller Greater Memory Bandwidth Less Memory Bandwidth DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
  13. 13. Measured DDR2 vs. FBD Memory Throughput 39% Increase 2.8x Increase 39% Increase 2.8x Increase
  14. 14. CPU Bottleneck Performance Fundamentals <ul><li>Core Intensive - Processor is executing instructions as fast as CPU core can process </li></ul><ul><li>Latency Intensive - Processor is executing instructions as fast as memory latency allows </li></ul><ul><li>Bandwidth Intensive - Processor is executing instructions as fast as memory bandwidth allows </li></ul>Potential Processor Bottlenecks Core Intensive Bandwidth Intensive Latency Intensive
  15. 15. Xeon vs. Opteron Performance Fundamentals Dual Core System Design <ul><li>Core Intensive - Processor is executing instructions as fast as CPU core can process </li></ul><ul><li>Latency Intensive - Processor is executing instructions as fast as memory latency allows </li></ul><ul><li>Bandwidth Intensive - Processor is executing instructions as fast as memory bandwidth allows </li></ul>Potential Processor Bottlenecks Core Intensive Bandwidth Intensive Latency Intensive Woodcrest,Clovertown and Tulsa Win By as much as 20+% X3 Xeon Wins Woodcrest and Opteron About the same Opteron Wins by as much As 2X
  16. 16. PCI Memory Bridge Memory Bridge Memory Bridge Xeon Coherency Protocol – CPU Snoop Request PCI PCI PCI IO Controller USB, IDE, SATA,etc Memory Bridge Memory Controller Cache Miss Read Data Snoop All Other Processor Caches
  17. 17. Xeon Coherency Protocol – CPU Snoop Response PCI Memory Bridge PCI PCI PCI IO Controller USB, IDE, SATA,etc Only Now Can Processor Operate on Data! Memory Bridge Memory Bridge Memory Bridge Memory Controller
  18. 18. Xeon Coherency Protocol – DMA Snoop Request PCI Memory Controller Memory Bridge Memory Bridge Memory Bridge Memory Bridge PCI PCI PCI IO Controller USB, IDE, SATA,etc DMA Read Data Snoop All Processor Caches
  19. 19. Xeon Coherency Protocol – DMA Snoop Response PCI Memory Controller Memory Bridge Memory Bridge Memory Bridge Memory Bridge PCI PCI PCI IO Controller USB, IDE, SATA,etc Only Now Can Memory Be Accessed! Snoop Responses Returned
  20. 20. AMD Architecture – Local Memory Access HyperTransport ™ PCI-X 100Mhz PCI-X 100Mhz HyperTransport 6.4GB/s coherent HyperTransport PCI-X 100Mhz HyperTransport PCI-X 133Mhz HyperTransport PCI-X 133Mhz AMD Opteron CPU 0 AMD Opteron CPU 1 AMD Opteron CPU 3 AMD Opteron CPU 2 PCI-X Bridge PCI-X Bridge PCI-X Bridge HyperTransport ™ HyperTransport ™ Processor Cache Miss – Local Read <ul><li>Local memory read happens fast – This low latency is well publicized </li></ul>HyperTransport 6.4GB/s coherent HyperTransport HyperTransport ™
  21. 21. AMD Architecture – Local Memory Access HyperTransport ™ PCI-X 100Mhz PCI-X 100Mhz HyperTransport 6.4GB/s coherent HyperTransport PCI-X 100Mhz HyperTransport HyperTransport PCI-X 133Mhz AMD Opteron CPU 0 AMD Opteron CPU 1 AMD Opteron CPU 3 AMD Opteron CPU 2 PCI-X Bridge PCI-X Bridge PCI-X Bridge HyperTransport ™ HyperTransport ™ HyperTransport 6.4GB/s coherent HyperTransport HyperTransport ™ Snoop CPU 1,3 Snoop CPU 2 Snoop CPU 3 Snoop Response From CPU1 Snoop Response From CPU2 Snoop Response From CPU3 Snoop Response From CPU3 <ul><li>Local memory read happens fast – This low latency is well publicized </li></ul><ul><li>But processor cannot use data until ALL snoops complete </li></ul><ul><li>In 4-way there are always two hops for snoops </li></ul>PCI-X 133Mhz
  22. 22. AMD Architecture – Local Memory Access HyperTransport ™ PCI-X 100Mhz PCI-X 100Mhz HyperTransport 6.4GB/s coherent HyperTransport PCI-X 100Mhz HyperTransport HyperTransport PCI-X 133Mhz AMD Opteron CPU 0 AMD Opteron CPU 1 AMD Opteron CPU 3 AMD Opteron CPU 2 PCI-X Bridge PCI-X Bridge PCI-X Bridge HyperTransport ™ HyperTransport ™ Only now can execution proceed! HyperTransport 6.4GB/s coherent HyperTransport HyperTransport ™ Read Complete And Usable PCI-X 133Mhz
  23. 23. Processor Futures <ul><li>Both AMD and Intel have significant processor architecture changes happening soon </li></ul><ul><li>AMD – Next Generation Processors </li></ul><ul><ul><li>Rev F (Dual Core) </li></ul></ul><ul><ul><li>Barcelona (Quad Core) </li></ul></ul><ul><li>Intel – Core Micro-Architecture Processors </li></ul><ul><ul><li>Woodcrest (Dual Core) </li></ul></ul><ul><ul><li>Clovertown (Quad Core) </li></ul></ul><ul><li>Intel Xeon MP – </li></ul><ul><ul><li>Tulsa (Dual Core) </li></ul></ul><ul><li>Intel MP based on Core Micro-Architecture </li></ul><ul><ul><li>Tigerton (Quad-Core) </li></ul></ul>
  24. 24. Opteron Next Gen Processors Add Faster DDR2 Memory Opteron with DDR2 Memory Controller Opteron with DDR2 Memory Controller Rev E 266/333MHz DDR1 -> 400/533 MHz DDR2 Opteron with DDR2 Memory Controller Rev E 400MHz DDR1 -> 667 MHz DDR2 Rev E 400MHz DDR1 -> 800 MHz DDR2 DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
  25. 25. Opteron Dual-Core Design Cache to Cache data sharing is done through crossbar switch. CPU0 1MB L2 Cache CPU1 System Request Interface Crossbar Switch Memory Controller HT0 HT1 HT2 AMD Opteron™ Architecture 1MB L2 Cache
  26. 26. AMD Opteron Quad-Core Design: Barcelona CPU1 Crossbar Switch Memory Controller HT0 HT1 HT2 AMD Opteron™ Architecture System Request Interface L3 Cache Quad Core Design: Adds L3 Cache CPU0 L2 Cache CPU1 L2 Cache CPU3 L2 Cache CPU2 L2 Cache
  27. 27. Xeon 5100 Series (Woodcrest) DP Architecture Source: Intel public data
  28. 28. Wide Dynamic Execution From: http://www.intel.com/technology/architecture/coremicro/#anchor2 <ul><li>Executes 4 instructions per clock cycle compared to 3 instructions per cycle for NetBurst </li></ul>Net Burst Core Microarchitecture
  29. 29. Xeon vs. Core ™ Dual-Core Design Cache to cache data sharing is now done through shared cache Cache to cache data sharing was done through bus interface (slow) CPU0 2MB L2 Cache Intel Xeon Dual-Core Architecture CPU1 2MB L2 Cache Bus Interface <ul><li>In Xeon 5100 Series (Woodcrest) L2 Cache can be dynamically shared so if one processor needs all cache it can be used, or it can be shared equally </li></ul>Intel Core™ Architecture CPU0 CPU1 4 MB Shared Cache Bus Interface
  30. 30. Intel Core ™ Quad-Core Design 4 MB Shared Cache Bus Interface Bus Interface 4 MB Shared Cache FSB <ul><li>Clovertown is basically two Woodcrest multi-chip modules (MCM’s) on a single die </li></ul><ul><li>MCM die allows easy transition and better yields than monolithic die </li></ul><ul><li>MCM’s must leverage FSB interface for cache to cache communication </li></ul>MCM to MCM data sharing is done through bus interface (slow) CPU2 CPU3 CPU0 CPU1
  31. 31. Intel Caneland MP Platform
  32. 32. 3x @ 34.1GB/s 10.6 GB/s 2x @ 42.6GB/s 21.3 GB/s 1.6x @10.4GB/s 6.4 GB/s <ul><li>Quad FSB architecture delivers increased memory bandwidth </li></ul><ul><li>IBM bus technology provides optimal memory read and write bandwidth </li></ul><ul><li>Increased scalability port frequency for higher scalable bandwidth </li></ul><ul><li>Lower loaded latency across the board </li></ul>to X4 – Architectural Improvements
  33. 33. 2 Socket Product Positioning Today – AMD Dual core and Intel Quad core INTEL AMD Integer Processing HPC – core intensive BPC – core intensive Web-serving Java Database Collaboration Virtualization File and Print HPC – bandwidth intensive EDA Video Streaming Media encode/decode BPC – bandwidth intensive Large memory set workloads Memory Bandwidth/Capacity Core Processing Clovertown Next Gen - RevF Data mining SAP
  34. 34. 2 Socket Product Positioning 3Q07 – AMD Quad core and Intel Quad core INTEL AMD Integer Processing HPC – core intensive BPC – core intensive Web-serving Java Database Collaboration Virtualization File and Print HPC – bandwidth intensive EDA Video Streaming Media encode/decode BPC – bandwidth intensive Large memory set workloads Memory Bandwidth/Capacity Core Processing Clovertown RevF-Quad Core Data mining SAP
  35. 35. Aspectos de Planejamento de Capacidade em ambiente x86 Obrigado! [email_address]

×