Heiko J Schick – IBM Deutschland R&D GmbH
November 2010




QPACE
QCD Parallel Computing on the Cell Broadband Engine™ (Ce...
Agenda



 Chapter 1: Overview


 Chapter 2: Application optimized supercomputers


 Chapter 3: QPACE


 Chapter 4: Re...
Chapter 1: Overview


Building Blocks of Matter




 QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B...
Chapter 1: Overview


Computing Resource Requests



 Lattice QCD community aims for O(1−3) PFlops/s sustained beyond 201...
Chapter 2: Application optimized supercomputers


Performance Critical Kernels



 Overall performance of lattice QCD sim...
Chapter 2: Application optimized supercomputers


Relevant Performance Signatures



 Arithmetic operations
   – Floating...
Chapter 2: Application optimized supercomputers


Parallelization



 Parallelization strategy
   – Spatial domain decomp...
Chapter 2: Application optimized supercomputers


Performance Signature: caxpy



 Multiply a Vector X by a Scalar, Add t...
Chapter 2: Application optimized supercomputers


Sustained Performance



 Bandwidth/throughput of a device:


 Time   ...
Chapter 2: Application optimized supercomputers


Relevant Hardware Characteristics



 Floating point unit throughput:

...
Chapter 2: Application optimized supercomputers


Balanced Hardware



 Example caxpy:




 Processor                    ...
Chapter 2: Application optimized supercomputers


Cell/B.E. Architecture




12                                           ...
Chapter 2: Application optimized supercomputers


Balanced Systems ?!?




13                                             ...
Chapter 2: Application optimized supercomputers


… but are they Reliable, Available and Serviceable ?!?




14           ...
Chapter 3: QPACE


Collaboration and Credits



 QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)...
Project Timetable



 01/08       Official project start
 06/08       Node card bring-up
 10/08       Fully populated b...
Production Chain




Major steps
     – Pre-integration at University Regensburg
     – Integration at IBM / Boeblingen
  ...
Chapter 3: QPACE


Concept



 System
     – Node card with IBM® PowerXCell™ 8i processor and network processor (NWP)
   ...
Chapter 3: QPACE


Networks



 Torus network
     – Nearest-neighbor communication, 3-dimensional torus topology
     – ...
Chapter 3: QPACE




                     Root Card
                     (16 per rack)                            Backplan...
Chapter 3: QPACE


Node Card



 Components
     –   IBM PowerXCell 8i processor 3.2 GHZ
     –   4 Gigabyte DDR2 memory ...
Chapter 3: QPACE


Node Card

                                   Network Processor   Network PHYs
                   Power...
Chapter 3: QPACE


Node Card

                                                             DDR2           DDR2
           ...
Chapter 3: QPACE


Network Processor

                                                x+
                                 ...
Chapter 3: QPACE


Network Processor

                                                                FlexIO


           ...
Chapter 3: QPACE


Processor Bus Interface



 FlexIO Interface
     –   High bandwidth interface between IBM PowerXCell ...
Chapter 3: QPACE


Torus Network Physical Layer



 Physical layer
     – 10GbE @ 2.5 GHz → 1 GByte/s



 Eye diagram fo...
Torus Network Architecture



 2-sided communication
     – Node A initiates send, node B initiates receive
     – Send a...
Chapter 3: QPACE


Torus Network Reconfiguration



 Torus network PHYs provide 2 interfaces
     – Used for network reco...
Chapter 3: QPACE


Cooling



 Concept
     – Node card mounted in housing = heat conductor
     – Housing connected to l...
Chapter 3: QPACE


Power Efficiency




31                 © 2009 IBM Corporation
Chapter 4: Review and Summary


Project Review



 Hardware design
     – Almost all critical problems solved in time
   ...
Chapter 4: Review and Summary


Summary



 QPACE is a new, scalable LQCD machine based on the PowerXCell 8i processor.

...
Chapter 5: Unforgettable Impressions ;-)




34                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




35                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




36                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




37                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




38                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




39                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




40                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




41                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




42                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




43                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




44                                         © 2009 IBM Corporation
45   © 2009 IBM Corporation
Thank you very much for your attention.
46                                       © 2009 IBM Corporation
Disclaimer



 IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries,
  xSeries, zSeries, z/OS, ...
Upcoming SlideShare
Loading in …5
×

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

741 views
645 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
741
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

  1. 1. Heiko J Schick – IBM Deutschland R&D GmbH November 2010 QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.) © 2009 IBM Corporation
  2. 2. Agenda  Chapter 1: Overview  Chapter 2: Application optimized supercomputers  Chapter 3: QPACE  Chapter 4: Review and Summary  Chapter 5: Unforgettable Impressions ;-) 2 © 2009 IBM Corporation
  3. 3. Chapter 1: Overview Building Blocks of Matter  QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)  Quarks are the constituents of matter which strongly interact exchanging gluons.  Particular phenomena – Confinement – Asymptotic freedom (Nobel Prize 2004)  Theory of strong interactions = Quantum Chromodynamics (QCD) 3 © 2009 IBM Corporation
  4. 4. Chapter 1: Overview Computing Resource Requests  Lattice QCD community aims for O(1−3) PFlops/s sustained beyond 2010.  Europe – “The computational requirements voiced by these European groups sum up to more than 1 sustained Petaflop/s by 2009.” [HPC in Europe Taskforce (HET), 2006]  US (USQCD) – Hope for O(1) PFlops/s sustained in 2010-11. “A goal with very substantial scientific rewards.” [USQCD SciDAC-2 proposal, 2006]  Similar requests from Japan. 4 © 2009 IBM Corporation
  5. 5. Chapter 2: Application optimized supercomputers Performance Critical Kernels  Overall performance of lattice QCD simulations dominated by a few kernels: – Linear algebra • Single processor operations • Typically memory bandwidth limited – Global reductions • Typically limited by network latency: • d-dimensional torus network: – Sparse matrix-vector multiplication 5 © 2009 IBM Corporation
  6. 6. Chapter 2: Application optimized supercomputers Relevant Performance Signatures  Arithmetic operations – Floating-point arithmetic's with complex operands – Dominant operation a × b + c  Memory operations – High data re-use – Access pattern: • Random, small blocks (optimize for cache) • 3 streams, large blocks (vector-like architectures)  Flow control – Simple / predictable 6 © 2009 IBM Corporation
  7. 7. Chapter 2: Application optimized supercomputers Parallelization  Parallelization strategy – Spatial domain decomposition to partition the simulation domain into small 3d sub- domains, one of the sub-domain is assigned to each processor.  Nearest neighbour communication – 3-4 dimensional torus  Homogeneous communication patterns  Large bandwidth  Access pattern – Medium size messages = O(10) kBytes (large local problem size) – Small messages = O(0.1) kBytes (small local problem size) 7 © 2009 IBM Corporation
  8. 8. Chapter 2: Application optimized supercomputers Performance Signature: caxpy  Multiply a Vector X by a Scalar, Add to a Vector Y, and Store in the Vector Y.  Task: where is a complex scalar RF and are complex 3x4 matrices  Operation per i: = 96 FLOPS M  Information transfer between storage and register file (front-end to processing device): – Load: = 48 8-byte words – Store: = 24 8-byte words  Balance: = 1.3 FLOPS / word 8 © 2009 IBM Corporation
  9. 9. Chapter 2: Application optimized supercomputers Sustained Performance  Bandwidth/throughput of a device:  Time needed to execute task i: where amount of processed data latency  Efficiency is – “Ideal” execution time – “Real” execution time 9 © 2009 IBM Corporation
  10. 10. Chapter 2: Application optimized supercomputers Relevant Hardware Characteristics  Floating point unit throughput: – Caveat: Processor instruction set matching • No support for complex arithmetic's (e.g. Cell/B.E.) • Additional shuffle operations needed.  Memory bandwidth: – Multi-level memory hierarchy • External memory • Cache • Register file 10 © 2009 IBM Corporation
  11. 11. Chapter 2: Application optimized supercomputers Balanced Hardware  Example caxpy: Processor FPU throughput Memory bandwidth [FLOPS / cycle] [words / cycle] [FLOPS / word] apeNEXT 8 2 4 QCDOC (MM) 2 0.63 3.2 QCDOC (LS) 2 2 1 Xeon 2 0.29 7 GPU 128 x 2 17.3 (*) 14.8 Cell/B.E. (MM) 8x4 1 32 Cell/B.E. (LS) 8x4 8x4 2 11 © 2009 IBM Corporation
  12. 12. Chapter 2: Application optimized supercomputers Cell/B.E. Architecture 12 © 2009 IBM Corporation
  13. 13. Chapter 2: Application optimized supercomputers Balanced Systems ?!? 13 © 2009 IBM Corporation
  14. 14. Chapter 2: Application optimized supercomputers … but are they Reliable, Available and Serviceable ?!? 14 © 2009 IBM Corporation
  15. 15. Chapter 3: QPACE Collaboration and Credits  QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)  Academic Partners – University Regensburg S. Heybrock, D. Hierl, T. Maurer, N. Meyer, A. Nobile, A. Schaefer, S. Solbrig, T. Streuer, T. Wettig – University Wuppertal Z. Fodor, A. Frommer, M. Huesken – University Ferrara M. Pivanti, F. Schifano, R. Tripiccione – University Milano H. Simma – DESY Zeuthen D.Pleiter, K.-H. Sulanke, F. Winter – Research Lab Juelich M. Drochner, N. Eicker, T. Lippert  Industrial Partner – IBM (DE, US, FR) H. Baier, H. Boettiger, A. Castellane, J.-F. Fauh, U. Fischer, G. Goldrian, C. Gomez, T. Huth, B. Krill, J. Lauritsen, J. McFadden, I. Ouda, M. Ries, H.J. Schick, J.-S. Vogt  Main Funding – DFG (SFB TR55), IBM  Support by Others – Eurotech (IT) , Knuerr (DE), Xilinx (US) 15 © 2009 IBM Corporation
  16. 16. Project Timetable  01/08 Official project start  06/08 Node card bring-up  10/08 Fully populated backplane  01/09 Hardware integration tests  02-03/09 Release to manufacturing  05/09 Integration of 1st rack  07/09 Deployment of 2 racks at JSC  08/09 Deployment of 4 racks at JSC and 4 racks at University Wuppertal complete 16 © 2009 IBM Corporation
  17. 17. Production Chain Major steps – Pre-integration at University Regensburg – Integration at IBM / Boeblingen – Installation at FZ Juelich and University Wuppertal 17 © 2009 IBM Corporation
  18. 18. Chapter 3: QPACE Concept  System – Node card with IBM® PowerXCell™ 8i processor and network processor (NWP) • Important feature: fast double precision arithmetic's – Commodity processor interconnected by a custom network – Custom system design – Liquid cooling system  Rack parameters – 256 node cards • 26 TFLOPS peak (double precision) • 1 TB Memory – O(35) kWatt power consumption  Applications – Target sustained performance of 20-30% – Optimized for calculations in theoretical particle physics: Simulation of Quantum Chromodynamics 18 © 2009 IBM Corporation
  19. 19. Chapter 3: QPACE Networks  Torus network – Nearest-neighbor communication, 3-dimensional torus topology – Aggregate bandwidth 6 GByte/s per node and direction – Remote DMA communication (local store to local store)  Interrupt tree network – Evaluation of global conditions and synchronization – Global Exceptions – 2 signals per direction  Ethernet network – 1 Gigabit Ethernet link per node card to rack-level switches (switched network) – I/O to parallel file system (user input / output) – Linux network boot – Aim of O(10) GB bandwidth per rack 19 © 2009 IBM Corporation
  20. 20. Chapter 3: QPACE Root Card (16 per rack) Backplane (8 per rack) Node Card (256 per rack) Power Supply and Power Adapter Card (24 per rack) Rack 20 © 2009 IBM Corporation
  21. 21. Chapter 3: QPACE Node Card  Components – IBM PowerXCell 8i processor 3.2 GHZ – 4 Gigabyte DDR2 memory 800 MHZ with ECC – Network processor (NWP) Xilinx FPGA LX110T FPGA – Ethernet PHY – 6 x 1GB/s external links using PCI Express physical layer – Service Processor (SP) Freescale 52211 – FLASH (firmware and FPGA configuration) – Power subsystem – Clocking  Network Processor – FLEXIO interface to PowerXCell 8i processor, 2 bytes with 3 GHZ bit rate – Gigabit Ethernet – UART FW Linux console – UART SP communication – SPI Master (boot flash) – SPI Slave for training and configuration – GPIO 21 © 2009 IBM Corporation
  22. 22. Chapter 3: QPACE Node Card Network Processor Network PHYs PowerXCell 8i (FPGA) Memory Processor 22 © 2009 IBM Corporation
  23. 23. Chapter 3: QPACE Node Card DDR2 DDR2 DDR2 DDR2 800MHz I2C Power SPI RW Subsystem (Debug) PowerXCell 8i FLEXIO FLEXIO Clocking 6GB/s 6GB/s RS232 SPI I2C SP FPGA Virtex-5 UART Freescale MCF52211 GigE PHY SPI 384 IO@250MHZ Flash 4*8*2*6 = 384 IO 680 available (LX110T) 6x 1GB/s PHY Compute Network 23 © 2009 IBM Corporation
  24. 24. Chapter 3: QPACE Network Processor x+ Link PHY Slices 92 % Interface PINs 86 % x- Link LUT-FF pairs 73 % PHY Interface Flip-Flops 55 %     Network Logic   LUTs 53 % z- FlexIO Routing Link BRAM / FIFOs 35 % PHY Interface Interface Arbitration FIFOs Ethernet PHY Configuration Interface Global Flip-Flops LUTs Signals Processor Interface 53 % 46 % Serial Interfaces Torus 36 % 39 % SPI Flash Ethernet 4% 2% 24 © 2009 IBM Corporation
  25. 25. Chapter 3: QPACE Network Processor FlexIO RocketIO IBM: • RocketIO Logic IOC IOIF IOC ((IOIF) ) FELX iO • IOC Logic • GBIF Logic Slave GBIF Master Receive Requests Send Requests Switch / Address Decoder / FIFOs / Bus Controller Academic Partners: • Network Processor Logic 6 x 1GB/S 25 © 2009 IBM Corporation
  26. 26. Chapter 3: QPACE Processor Bus Interface  FlexIO Interface – High bandwidth interface between IBM PowerXCell 8i processor and Xilinx Viretx-5 FPGA – Implementation from Rambus Inc – Optimized for intra-board environments – Uses RocketIO GPT transceiver features – Requires link training after power-on • Phase calibration (aligns the data for optimal sampling point) • Parallel calibration (synchronizes the receive deserializer with the transmit serializer) • Levelization calibration (aligns all data lanes)  Challenges – Speed, Latency, Bandwidth and Timing (Clock) – 3 Gbyte/sec communication channel – 2 Byte link wide 26 © 2009 IBM Corporation
  27. 27. Chapter 3: QPACE Torus Network Physical Layer  Physical layer – 10GbE @ 2.5 GHz → 1 GByte/s  Eye diagram for bad case link – 3.125 GHz – 40 cm PCB, 50 cm cable, – 1 PCB-PCB, 2 PCB-cable connectors  Custom data link layer – Fixed size messages – 128 Byte payload + 4 Byte header + 4 Byte CRC → Minimal protocol overhead 27 © 2009 IBM Corporation
  28. 28. Torus Network Architecture  2-sided communication – Node A initiates send, node B initiates receive – Send and receive commands have to match – Multiple use of same link by virtual channels  Send / receive from / to local store or main memory – CPU → NWP • CPU moves data and control info to NWP • Back-pressure controlled – NWP → NWP • Independent of processor • Each datagram has to be acknowledged – NWP → CPU • CPU provides credits to NWP • NWP writes data into processor • Completion indicated by notification 28 © 2009 IBM Corporation
  29. 29. Chapter 3: QPACE Torus Network Reconfiguration  Torus network PHYs provide 2 interfaces – Used for network reconfiguration b selecting primary or secondary interface  Example – 1x8 or 2x4 node-cards  Partition sizes (1,2,2N) * (1,2,4,8,16) * (1,2,4,8) – N ... number of racks connected via cables 29 © 2009 IBM Corporation
  30. 30. Chapter 3: QPACE Cooling  Concept – Node card mounted in housing = heat conductor – Housing connected to liquid cooled cold plate – Critical thermal interfaces • Processor – thermal box • Thermal box – cold plate – Dry connection between node card and cooling circuit  Node card housing – Closed node card housing acts as heat conductor. – Heat conductor is linked with liquid-cooled “cold plate” – Cold Plate is placed between two rows of node cards.  Simulation Results for one Cold Plate – Ambient 12°C – Water 10 L / min – Load 4224 Watt 2112 Watt / side 30 © 2009 IBM Corporation
  31. 31. Chapter 3: QPACE Power Efficiency 31 © 2009 IBM Corporation
  32. 32. Chapter 4: Review and Summary Project Review  Hardware design – Almost all critical problems solved in time – Network Processor implementation still a challenge – No serious problems due to wrong design decisions  Hardware status – Manufacturing quality good: Small bone pile, few defects during operation.  Time schedule – Essentially stayed within planned schedule – Implementation of system / application software delayed 32 © 2009 IBM Corporation
  33. 33. Chapter 4: Review and Summary Summary  QPACE is a new, scalable LQCD machine based on the PowerXCell 8i processor.  Design highlights – FPGA directly attached to processor – LQCD optimized, low latency torus network – Novel, cost-efficient liquid cooling system – High packaging density – Very power efficient architecture  O(20-30%) sustained performance for key LQCD kernels is reached / feasible → O(10-16) TFLOPS / rack (SP) 33 © 2009 IBM Corporation
  34. 34. Chapter 5: Unforgettable Impressions ;-) 34 © 2009 IBM Corporation
  35. 35. Chapter 5: Unforgettable Impressions ;-) 35 © 2009 IBM Corporation
  36. 36. Chapter 5: Unforgettable Impressions ;-) 36 © 2009 IBM Corporation
  37. 37. Chapter 5: Unforgettable Impressions ;-) 37 © 2009 IBM Corporation
  38. 38. Chapter 5: Unforgettable Impressions ;-) 38 © 2009 IBM Corporation
  39. 39. Chapter 5: Unforgettable Impressions ;-) 39 © 2009 IBM Corporation
  40. 40. Chapter 5: Unforgettable Impressions ;-) 40 © 2009 IBM Corporation
  41. 41. Chapter 5: Unforgettable Impressions ;-) 41 © 2009 IBM Corporation
  42. 42. Chapter 5: Unforgettable Impressions ;-) 42 © 2009 IBM Corporation
  43. 43. Chapter 5: Unforgettable Impressions ;-) 43 © 2009 IBM Corporation
  44. 44. Chapter 5: Unforgettable Impressions ;-) 44 © 2009 IBM Corporation
  45. 45. 45 © 2009 IBM Corporation
  46. 46. Thank you very much for your attention. 46 © 2009 IBM Corporation
  47. 47. Disclaimer  IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries.  Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both.  Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change. 47 © 2009 IBM Corporation

×