SlideShare a Scribd company logo
Heiko J Schick – IBM Deutschland R&D GmbH
November 2010




QPACE
QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)




                                                                   © 2009 IBM Corporation
Agenda



 Chapter 1: Overview


 Chapter 2: Application optimized supercomputers


 Chapter 3: QPACE


 Chapter 4: Review and Summary


 Chapter 5: Unforgettable Impressions ;-)




2                                                   © 2009 IBM Corporation
Chapter 1: Overview


Building Blocks of Matter




 QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)


 Quarks are the constituents of matter which strongly interact exchanging gluons.


 Particular phenomena
     – Confinement
     – Asymptotic freedom   (Nobel Prize 2004)



 Theory of strong interactions = Quantum Chromodynamics (QCD)

3                                                                                    © 2009 IBM Corporation
Chapter 1: Overview


Computing Resource Requests



 Lattice QCD community aims for O(1−3) PFlops/s sustained beyond 2010.


 Europe
   – “The computational requirements voiced by these European groups sum up to more than
     1 sustained Petaflop/s by 2009.” [HPC in Europe Taskforce (HET), 2006]


 US (USQCD)
   – Hope for O(1) PFlops/s sustained in 2010-11. “A goal with very substantial scientific
     rewards.” [USQCD SciDAC-2 proposal, 2006]


 Similar requests from Japan.




4                                                                                   © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Performance Critical Kernels



 Overall performance of lattice QCD simulations dominated by a few kernels:

     – Linear algebra
         • Single processor operations
         • Typically memory bandwidth limited

     – Global reductions
        • Typically limited by network latency:
        • d-dimensional torus network:




     – Sparse matrix-vector multiplication




5                                                                              © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Relevant Performance Signatures



 Arithmetic operations
   – Floating-point arithmetic's with complex operands
   – Dominant operation a × b + c


 Memory operations
   – High data re-use
   – Access pattern:
       • Random, small blocks (optimize for cache)
       • 3 streams, large blocks (vector-like architectures)


 Flow control
    – Simple / predictable




6                                                              © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Parallelization



 Parallelization strategy
   – Spatial domain decomposition to partition the simulation domain into small 3d sub-
     domains, one of the sub-domain is assigned to each processor.


 Nearest neighbour communication
   – 3-4 dimensional torus


 Homogeneous communication patterns


 Large bandwidth


 Access pattern
   – Medium size messages = O(10) kBytes          (large local problem size)
   – Small messages = O(0.1) kBytes               (small local problem size)




7                                                                               © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Performance Signature: caxpy



 Multiply a Vector X by a Scalar, Add to a Vector Y, and Store in the Vector Y.


 Task:

    where                      is a complex scalar                                 RF
                               and     are complex 3x4 matrices

 Operation per i:                                = 96 FLOPS                        M


 Information transfer between storage and register file (front-end to processing device):

     – Load:                                  = 48 8-byte words

     – Store:                                 = 24 8-byte words

 Balance:                                                        = 1.3 FLOPS / word

8                                                                                      © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Sustained Performance



 Bandwidth/throughput of a device:


 Time             needed to execute task i:

    where                      amount of processed data
                               latency

 Efficiency is


     – “Ideal” execution time


     – “Real” execution time




9                                                         © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Relevant Hardware Characteristics



 Floating point unit throughput:

     – Caveat: Processor instruction set matching
        • No support for complex arithmetic's (e.g. Cell/B.E.)
        • Additional shuffle operations needed.


 Memory bandwidth:

     – Multi-level memory hierarchy
        • External memory
        • Cache
        • Register file




10                                                               © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Balanced Hardware



 Example caxpy:




 Processor                                 FPU throughput     Memory bandwidth


                                            [FLOPS / cycle]    [words / cycle]   [FLOPS / word]

 apeNEXT                                            8                 2                4
 QCDOC (MM)                                         2               0.63              3.2
 QCDOC (LS)                                         2                 2                1
 Xeon                                               2               0.29               7
 GPU                                              128 x 2          17.3 (*)           14.8
 Cell/B.E. (MM)                                    8x4                1               32
 Cell/B.E. (LS)                                    8x4              8x4                2


11                                                                                     © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Cell/B.E. Architecture




12                                                © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


Balanced Systems ?!?




13                                                © 2009 IBM Corporation
Chapter 2: Application optimized supercomputers


… but are they Reliable, Available and Serviceable ?!?




14                                                       © 2009 IBM Corporation
Chapter 3: QPACE


Collaboration and Credits



 QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)


 Academic Partners
     –   University Regensburg       S. Heybrock, D. Hierl, T. Maurer, N. Meyer, A. Nobile, A. Schaefer, S. Solbrig, T. Streuer, T. Wettig
     –   University Wuppertal        Z. Fodor, A. Frommer, M. Huesken
     –   University Ferrara          M. Pivanti, F. Schifano, R. Tripiccione
     –   University Milano           H. Simma
     –   DESY Zeuthen                D.Pleiter, K.-H. Sulanke, F. Winter
     –   Research Lab Juelich        M. Drochner, N. Eicker, T. Lippert

 Industrial Partner
     – IBM   (DE, US, FR)            H. Baier, H. Boettiger, A. Castellane, J.-F. Fauh, U. Fischer, G. Goldrian, C. Gomez, T. Huth, B. Krill,
                                     J. Lauritsen, J. McFadden, I. Ouda, M. Ries, H.J. Schick, J.-S. Vogt

 Main Funding
   – DFG (SFB TR55), IBM
 Support by Others
     – Eurotech (IT) , Knuerr (DE), Xilinx (US)


15                                                                                                                           © 2009 IBM Corporation
Project Timetable



 01/08       Official project start
 06/08       Node card bring-up
 10/08       Fully populated backplane
 01/09       Hardware integration tests
 02-03/09    Release to manufacturing
 05/09       Integration of 1st rack
 07/09       Deployment of 2 racks at JSC
 08/09       Deployment of 4 racks at JSC and
              4 racks at University Wuppertal complete




16                                                       © 2009 IBM Corporation
Production Chain




Major steps
     – Pre-integration at University Regensburg
     – Integration at IBM / Boeblingen
     – Installation at FZ Juelich and University Wuppertal




17                                                           © 2009 IBM Corporation
Chapter 3: QPACE


Concept



 System
     – Node card with IBM® PowerXCell™ 8i processor and network processor (NWP)
         • Important feature: fast double precision arithmetic's
     – Commodity processor interconnected by a custom network
     – Custom system design
     – Liquid cooling system


 Rack parameters
     – 256 node cards
         • 26 TFLOPS peak (double precision)
         • 1 TB Memory
     – O(35) kWatt power consumption


 Applications
     – Target sustained performance of 20-30%
     – Optimized for calculations in theoretical particle physics:
       Simulation of Quantum Chromodynamics




18                                                                                © 2009 IBM Corporation
Chapter 3: QPACE


Networks



 Torus network
     – Nearest-neighbor communication, 3-dimensional torus topology
     – Aggregate bandwidth 6 GByte/s per node and direction
     – Remote DMA communication (local store to local store)



 Interrupt tree network
     – Evaluation of global conditions and synchronization
     – Global Exceptions
     – 2 signals per direction



 Ethernet network
     –   1 Gigabit Ethernet link per node card to rack-level switches (switched network)
     –   I/O to parallel file system (user input / output)
     –   Linux network boot
     –   Aim of O(10) GB bandwidth per rack




19                                                                                         © 2009 IBM Corporation
Chapter 3: QPACE




                     Root Card
                     (16 per rack)                            Backplane
                                                               (8 per rack)




                   Node Card
                   (256 per rack)


                                            Power Supply and Power Adapter Card
                                                         (24 per rack)
                                     Rack


20                                                                       © 2009 IBM Corporation
Chapter 3: QPACE


Node Card



 Components
     –   IBM PowerXCell 8i processor 3.2 GHZ
     –   4 Gigabyte DDR2 memory 800 MHZ with ECC
     –   Network processor (NWP) Xilinx FPGA LX110T FPGA
     –   Ethernet PHY
     –   6 x 1GB/s external links using PCI Express physical layer
     –   Service Processor (SP) Freescale 52211
     –   FLASH (firmware and FPGA configuration)
     –   Power subsystem
     –   Clocking


 Network Processor
     –   FLEXIO interface to PowerXCell 8i processor, 2 bytes with 3 GHZ bit rate
     –   Gigabit Ethernet
     –   UART FW Linux console
     –   UART SP communication
     –   SPI Master (boot flash)
     –   SPI Slave for training and configuration
     –   GPIO


21                                                                                  © 2009 IBM Corporation
Chapter 3: QPACE


Node Card

                                   Network Processor   Network PHYs
                   PowerXCell 8i   (FPGA)
     Memory        Processor




22                                                                © 2009 IBM Corporation
Chapter 3: QPACE


Node Card

                                                             DDR2           DDR2
                                                          DDR2            DDR2

                                                                              800MHz


                     I2C
                                  Power             SPI                                   RW
                                Subsystem                                               (Debug)
                                                              PowerXCell 8i




                                                          FLEXIO           FLEXIO
                                 Clocking
                                                           6GB/s            6GB/s


                                                                                                        RS232
                                              SPI
                   I2C
                              SP                                    FPGA Virtex-5
                                            UART
                           Freescale
                           MCF52211                                                         GigE         PHY



                                            SPI                                384 IO@250MHZ
                             Flash
                                                                                          4*8*2*6 = 384 IO
                                                                                       680 available (LX110T)
                                                                   6x 1GB/s PHY




                                                               Compute Network




23                                                                                                              © 2009 IBM Corporation
Chapter 3: QPACE


Network Processor

                                                x+
                                      Link
                                                PHY
                                                                    Slices                     92 %
                                   Interface
                                                                    PINs                       86 %
                                                x-
                                      Link
                                                                    LUT-FF pairs               73 %
                                                PHY
                                   Interface
                                                                    Flip-Flops                 55 %
                                                
                                                
                   Network Logic
                                                                  LUTs                       53 %
                                                z-

      FlexIO         Routing          Link                          BRAM / FIFOs               35 %
                                                PHY
     Interface                     Interface
                    Arbitration

                      FIFOs
                                   Ethernet
                                                PHY
                   Configuration   Interface

                                    Global                                   Flip-Flops       LUTs
                                    Signals

                                                     Processor Interface          53 %         46 %
                                     Serial
                                   Interfaces
                                                     Torus                        36 %         39 %
                                   SPI Flash         Ethernet                      4%            2%



24                                                                                 © 2009 IBM Corporation
Chapter 3: QPACE


Network Processor

                                                                FlexIO


                                                              RocketIO                              IBM:
                                                                                                    • RocketIO Logic
                                                                IOC IOIF
                                                               IOC ((IOIF) )
                                                                 FELX iO                            • IOC Logic
                                                                                                    • GBIF Logic


                              Slave                               GBIF                Master




                         Receive Requests                                           Send Requests


                                            Switch / Address Decoder / FIFOs / Bus Controller



                                                                                                    Academic Partners:
                                                                                                    • Network Processor Logic




             6 x 1GB/S




25                                                                                                           © 2009 IBM Corporation
Chapter 3: QPACE


Processor Bus Interface



 FlexIO Interface
     –   High bandwidth interface between IBM PowerXCell 8i processor and Xilinx Viretx-5 FPGA
     –   Implementation from Rambus Inc
     –   Optimized for intra-board environments
     –   Uses RocketIO GPT transceiver features
     –   Requires link training after power-on
           • Phase calibration                 (aligns the data for optimal sampling point)
           • Parallel calibration              (synchronizes the receive deserializer with the transmit serializer)
           • Levelization calibration          (aligns all data lanes)



 Challenges
   – Speed, Latency, Bandwidth and Timing (Clock)
     – 3 Gbyte/sec communication channel
     – 2 Byte link wide




26                                                                                                                    © 2009 IBM Corporation
Chapter 3: QPACE


Torus Network Physical Layer



 Physical layer
     – 10GbE @ 2.5 GHz → 1 GByte/s



 Eye diagram for bad case link
     – 3.125 GHz
     – 40 cm PCB, 50 cm cable,
     – 1 PCB-PCB, 2 PCB-cable connectors



 Custom data link layer
     – Fixed size messages
     – 128 Byte payload + 4 Byte header + 4 Byte CRC
       → Minimal protocol overhead




27                                                     © 2009 IBM Corporation
Torus Network Architecture



 2-sided communication
     – Node A initiates send, node B initiates receive
     – Send and receive commands have to match
     – Multiple use of same link by virtual channels


 Send / receive from / to local store or main memory
     – CPU → NWP
        • CPU moves data and control info to NWP
        • Back-pressure controlled

     – NWP → NWP
        • Independent of processor
        • Each datagram has to be acknowledged

     – NWP → CPU
        • CPU provides credits to NWP
        • NWP writes data into processor
        • Completion indicated by notification



28                                                       © 2009 IBM Corporation
Chapter 3: QPACE


Torus Network Reconfiguration



 Torus network PHYs provide 2 interfaces
     – Used for network reconfiguration b selecting primary or secondary interface



 Example
     – 1x8 or 2x4 node-cards




 Partition sizes (1,2,2N) * (1,2,4,8,16) * (1,2,4,8)
     – N ... number of racks connected via cables

29                                                                                   © 2009 IBM Corporation
Chapter 3: QPACE


Cooling



 Concept
     – Node card mounted in housing = heat conductor
     – Housing connected to liquid cooled cold plate
     – Critical thermal interfaces
         • Processor – thermal box
         • Thermal box – cold plate
     – Dry connection between node card and cooling circuit



 Node card housing
     – Closed node card housing acts as heat conductor.
     – Heat conductor is linked with liquid-cooled “cold plate”
     – Cold Plate is placed between two rows of node cards.


 Simulation Results for one Cold Plate
     – Ambient          12°C
     – Water            10 L / min
     – Load             4224 Watt
                        2112 Watt / side

30                                                                © 2009 IBM Corporation
Chapter 3: QPACE


Power Efficiency




31                 © 2009 IBM Corporation
Chapter 4: Review and Summary


Project Review



 Hardware design
     – Almost all critical problems solved in time
     – Network Processor implementation still a challenge
     – No serious problems due to wrong design decisions



 Hardware status
     – Manufacturing quality good: Small bone pile, few defects during operation.



 Time schedule
     – Essentially stayed within planned schedule
     – Implementation of system / application software delayed




32                                                                                  © 2009 IBM Corporation
Chapter 4: Review and Summary


Summary



 QPACE is a new, scalable LQCD machine based on the PowerXCell 8i processor.


 Design highlights
      –   FPGA directly attached to processor
      –   LQCD optimized, low latency torus network
      –   Novel, cost-efficient liquid cooling system
      –   High packaging density
      –   Very power efficient architecture



 O(20-30%) sustained performance for key LQCD kernels is reached / feasible

     → O(10-16) TFLOPS / rack (SP)




33                                                                             © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




34                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




35                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




36                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




37                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




38                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




39                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




40                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




41                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




42                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




43                                         © 2009 IBM Corporation
Chapter 5: Unforgettable Impressions ;-)




44                                         © 2009 IBM Corporation
45   © 2009 IBM Corporation
Thank you very much for your attention.
46                                       © 2009 IBM Corporation
Disclaimer



 IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries,
  xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix
  und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are
  trademarks of the IBM Corporation in US and/or other countries.


 Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United
  States, other countries, or both and is used under license there from. Linux is a trademark of
  Linus Torvalds in the United States, other countries or both.


 Other company, product, or service names may be trademarks or service marks of others.
  The information and materials are provided on an "as is" basis and are subject to change.




47                                                                                   © 2009 IBM Corporation

More Related Content

What's hot

CCNxCon2012: Session 4: Caesar: a Content Router for High Speed Forwarding
CCNxCon2012: Session 4: Caesar:  a Content Router for High Speed ForwardingCCNxCon2012: Session 4: Caesar:  a Content Router for High Speed Forwarding
CCNxCon2012: Session 4: Caesar: a Content Router for High Speed Forwarding
PARC, a Xerox company
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
Subhajit Sahu
 
Iris an architecture for cognitive radio networking testbeds
Iris   an architecture for cognitive radio networking testbedsIris   an architecture for cognitive radio networking testbeds
Iris an architecture for cognitive radio networking testbedsPatricia Oniga
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
IJERA Editor
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neonsean chen
 
TotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalView Debugger On Blue Gene
TotalView Debugger On Blue Gene
Totalviewtech
 
Lc3519051910
Lc3519051910Lc3519051910
Lc3519051910
IJERA Editor
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
BaharJV
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Futurekarl.barnes
 
TU München creates a state-of-the-art research environment
TU München creates a state-of-the-art research environment TU München creates a state-of-the-art research environment
TU München creates a state-of-the-art research environment
IBM India Smarter Computing
 

What's hot (17)

CCNxCon2012: Session 4: Caesar: a Content Router for High Speed Forwarding
CCNxCon2012: Session 4: Caesar:  a Content Router for High Speed ForwardingCCNxCon2012: Session 4: Caesar:  a Content Router for High Speed Forwarding
CCNxCon2012: Session 4: Caesar: a Content Router for High Speed Forwarding
 
Vector processor : Notes
Vector processor : NotesVector processor : Notes
Vector processor : Notes
 
Iris an architecture for cognitive radio networking testbeds
Iris   an architecture for cognitive radio networking testbedsIris   an architecture for cognitive radio networking testbeds
Iris an architecture for cognitive radio networking testbeds
 
Technology (1)
Technology (1)Technology (1)
Technology (1)
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
 
Lect.10.arm soc.4 neon
Lect.10.arm soc.4 neonLect.10.arm soc.4 neon
Lect.10.arm soc.4 neon
 
1
11
1
 
Par com
Par comPar com
Par com
 
53
5353
53
 
TotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalView Debugger On Blue Gene
TotalView Debugger On Blue Gene
 
Lc3519051910
Lc3519051910Lc3519051910
Lc3519051910
 
101 cd 1415-1445
101 cd 1415-1445101 cd 1415-1445
101 cd 1415-1445
 
design_flow
design_flowdesign_flow
design_flow
 
Morph : a novel accelerator
Morph : a novel acceleratorMorph : a novel accelerator
Morph : a novel accelerator
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Future
 
TU München creates a state-of-the-art research environment
TU München creates a state-of-the-art research environment TU München creates a state-of-the-art research environment
TU München creates a state-of-the-art research environment
 

Viewers also liked

IBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsIBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
Heiko Joerg Schick
 
HKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western Opera
HKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western OperaHKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western Opera
HKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western Opera
Shan Shan Hung
 
Investment Game
Investment  GameInvestment  Game
Investment Game
hiteshanand
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderHeiko Joerg Schick
 
OPORTO CITY by Faria
OPORTO CITY by FariaOPORTO CITY by Faria
OPORTO CITY by Faria
Faria22
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressHeiko Joerg Schick
 

Viewers also liked (7)

IBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood MapsIBM Corporate Service Corps - Helping Create Interactive Flood Maps
IBM Corporate Service Corps - Helping Create Interactive Flood Maps
 
HKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western Opera
HKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western OperaHKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western Opera
HKBU POLS3620 Contemporary Europe and Asia Presenation: Chinese & Western Opera
 
Hsa4941
Hsa4941Hsa4941
Hsa4941
 
Investment Game
Investment  GameInvestment  Game
Investment Game
 
Browser and Management App for Google's Person Finder
Browser and Management App for Google's Person FinderBrowser and Management App for Google's Person Finder
Browser and Management App for Google's Person Finder
 
OPORTO CITY by Faria
OPORTO CITY by FariaOPORTO CITY by Faria
OPORTO CITY by Faria
 
directCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI ExpressdirectCell - Cell/B.E. tightly coupled via PCI Express
directCell - Cell/B.E. tightly coupled via PCI Express
 

Similar to QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Heiko Joerg Schick
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
inside-BigData.com
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
Slide_N
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/L
msramakrishna
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Slide_N
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
MohmdUmer
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
NomanSiddiqui41
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHeiko Joerg Schick
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric compute
Perry Lea
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
inside-BigData.com
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorHardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Slide_N
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!
Slide_N
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
Michael Gschwind
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
mustafa sarac
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
inside-BigData.com
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAH
Akash M Shah
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
Slide_N
 
Automating the Configuration of the FlexRay Communication Cycle
Automating the Configuration of the FlexRay Communication CycleAutomating the Configuration of the FlexRay Communication Cycle
Automating the Configuration of the FlexRay Communication Cycle
Nicolas Navet
 

Similar to QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.) (20)

Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
Experiences in Application Specific Supercomputer Design - Reasons, Challenge...
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/L
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
 
High Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale ComputingHigh Performance Computing - Challenges on the Road to Exascale Computing
High Performance Computing - Challenges on the Road to Exascale Computing
 
Industrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric computeIndustrial trends in heterogeneous and esoteric compute
Industrial trends in heterogeneous and esoteric compute
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorHardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
 
Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!Feeding the Multicore Beast:It’s All About the Data!
Feeding the Multicore Beast:It’s All About the Data!
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Clustering by AKASHMSHAH
Clustering by AKASHMSHAHClustering by AKASHMSHAH
Clustering by AKASHMSHAH
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
Automating the Configuration of the FlexRay Communication Cycle
Automating the Configuration of the FlexRay Communication CycleAutomating the Configuration of the FlexRay Communication Cycle
Automating the Configuration of the FlexRay Communication Cycle
 

More from Heiko Joerg Schick

Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)
Heiko Joerg Schick
 
Huawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHuawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technology
Heiko Joerg Schick
 
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
Heiko Joerg Schick
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous Driving
Heiko Joerg Schick
 
From edge computing to in-car computing
From edge computing to in-car computingFrom edge computing to in-car computing
From edge computing to in-car computing
Heiko Joerg Schick
 
Need and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingNeed and value for various levels of autonomous driving
Need and value for various levels of autonomous driving
Heiko Joerg Schick
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsHeiko Joerg Schick
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSHeiko Joerg Schick
 
Real time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesReal time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesHeiko Joerg Schick
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 

More from Heiko Joerg Schick (13)

Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)Da Vinci - A scaleable architecture for neural network computing (updated v4)
Da Vinci - A scaleable architecture for neural network computing (updated v4)
 
Huawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technologyHuawei empowers healthcare industry with AI technology
Huawei empowers healthcare industry with AI technology
 
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
The 2025 Huawei trend forecast gives you the lowdown on data centre facilitie...
 
The Smarter Car for Autonomous Driving
 The Smarter Car for Autonomous Driving The Smarter Car for Autonomous Driving
The Smarter Car for Autonomous Driving
 
From edge computing to in-car computing
From edge computing to in-car computingFrom edge computing to in-car computing
From edge computing to in-car computing
 
Need and value for various levels of autonomous driving
Need and value for various levels of autonomous drivingNeed and value for various levels of autonomous driving
Need and value for various levels of autonomous driving
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFSRun-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
Run-Time Reconfiguration for HyperTransport coupled FPGAs using ACCFS
 
Real time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the PhilippinesReal time Flood Simulation for Metro Manila and the Philippines
Real time Flood Simulation for Metro Manila and the Philippines
 
Slimline Open Firmware
Slimline Open FirmwareSlimline Open Firmware
Slimline Open Firmware
 
Agnostic Device Drivers
Agnostic Device DriversAgnostic Device Drivers
Agnostic Device Drivers
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)

  • 1. Heiko J Schick – IBM Deutschland R&D GmbH November 2010 QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.) © 2009 IBM Corporation
  • 2. Agenda  Chapter 1: Overview  Chapter 2: Application optimized supercomputers  Chapter 3: QPACE  Chapter 4: Review and Summary  Chapter 5: Unforgettable Impressions ;-) 2 © 2009 IBM Corporation
  • 3. Chapter 1: Overview Building Blocks of Matter  QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)  Quarks are the constituents of matter which strongly interact exchanging gluons.  Particular phenomena – Confinement – Asymptotic freedom (Nobel Prize 2004)  Theory of strong interactions = Quantum Chromodynamics (QCD) 3 © 2009 IBM Corporation
  • 4. Chapter 1: Overview Computing Resource Requests  Lattice QCD community aims for O(1−3) PFlops/s sustained beyond 2010.  Europe – “The computational requirements voiced by these European groups sum up to more than 1 sustained Petaflop/s by 2009.” [HPC in Europe Taskforce (HET), 2006]  US (USQCD) – Hope for O(1) PFlops/s sustained in 2010-11. “A goal with very substantial scientific rewards.” [USQCD SciDAC-2 proposal, 2006]  Similar requests from Japan. 4 © 2009 IBM Corporation
  • 5. Chapter 2: Application optimized supercomputers Performance Critical Kernels  Overall performance of lattice QCD simulations dominated by a few kernels: – Linear algebra • Single processor operations • Typically memory bandwidth limited – Global reductions • Typically limited by network latency: • d-dimensional torus network: – Sparse matrix-vector multiplication 5 © 2009 IBM Corporation
  • 6. Chapter 2: Application optimized supercomputers Relevant Performance Signatures  Arithmetic operations – Floating-point arithmetic's with complex operands – Dominant operation a × b + c  Memory operations – High data re-use – Access pattern: • Random, small blocks (optimize for cache) • 3 streams, large blocks (vector-like architectures)  Flow control – Simple / predictable 6 © 2009 IBM Corporation
  • 7. Chapter 2: Application optimized supercomputers Parallelization  Parallelization strategy – Spatial domain decomposition to partition the simulation domain into small 3d sub- domains, one of the sub-domain is assigned to each processor.  Nearest neighbour communication – 3-4 dimensional torus  Homogeneous communication patterns  Large bandwidth  Access pattern – Medium size messages = O(10) kBytes (large local problem size) – Small messages = O(0.1) kBytes (small local problem size) 7 © 2009 IBM Corporation
  • 8. Chapter 2: Application optimized supercomputers Performance Signature: caxpy  Multiply a Vector X by a Scalar, Add to a Vector Y, and Store in the Vector Y.  Task: where is a complex scalar RF and are complex 3x4 matrices  Operation per i: = 96 FLOPS M  Information transfer between storage and register file (front-end to processing device): – Load: = 48 8-byte words – Store: = 24 8-byte words  Balance: = 1.3 FLOPS / word 8 © 2009 IBM Corporation
  • 9. Chapter 2: Application optimized supercomputers Sustained Performance  Bandwidth/throughput of a device:  Time needed to execute task i: where amount of processed data latency  Efficiency is – “Ideal” execution time – “Real” execution time 9 © 2009 IBM Corporation
  • 10. Chapter 2: Application optimized supercomputers Relevant Hardware Characteristics  Floating point unit throughput: – Caveat: Processor instruction set matching • No support for complex arithmetic's (e.g. Cell/B.E.) • Additional shuffle operations needed.  Memory bandwidth: – Multi-level memory hierarchy • External memory • Cache • Register file 10 © 2009 IBM Corporation
  • 11. Chapter 2: Application optimized supercomputers Balanced Hardware  Example caxpy: Processor FPU throughput Memory bandwidth [FLOPS / cycle] [words / cycle] [FLOPS / word] apeNEXT 8 2 4 QCDOC (MM) 2 0.63 3.2 QCDOC (LS) 2 2 1 Xeon 2 0.29 7 GPU 128 x 2 17.3 (*) 14.8 Cell/B.E. (MM) 8x4 1 32 Cell/B.E. (LS) 8x4 8x4 2 11 © 2009 IBM Corporation
  • 12. Chapter 2: Application optimized supercomputers Cell/B.E. Architecture 12 © 2009 IBM Corporation
  • 13. Chapter 2: Application optimized supercomputers Balanced Systems ?!? 13 © 2009 IBM Corporation
  • 14. Chapter 2: Application optimized supercomputers … but are they Reliable, Available and Serviceable ?!? 14 © 2009 IBM Corporation
  • 15. Chapter 3: QPACE Collaboration and Credits  QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)  Academic Partners – University Regensburg S. Heybrock, D. Hierl, T. Maurer, N. Meyer, A. Nobile, A. Schaefer, S. Solbrig, T. Streuer, T. Wettig – University Wuppertal Z. Fodor, A. Frommer, M. Huesken – University Ferrara M. Pivanti, F. Schifano, R. Tripiccione – University Milano H. Simma – DESY Zeuthen D.Pleiter, K.-H. Sulanke, F. Winter – Research Lab Juelich M. Drochner, N. Eicker, T. Lippert  Industrial Partner – IBM (DE, US, FR) H. Baier, H. Boettiger, A. Castellane, J.-F. Fauh, U. Fischer, G. Goldrian, C. Gomez, T. Huth, B. Krill, J. Lauritsen, J. McFadden, I. Ouda, M. Ries, H.J. Schick, J.-S. Vogt  Main Funding – DFG (SFB TR55), IBM  Support by Others – Eurotech (IT) , Knuerr (DE), Xilinx (US) 15 © 2009 IBM Corporation
  • 16. Project Timetable  01/08 Official project start  06/08 Node card bring-up  10/08 Fully populated backplane  01/09 Hardware integration tests  02-03/09 Release to manufacturing  05/09 Integration of 1st rack  07/09 Deployment of 2 racks at JSC  08/09 Deployment of 4 racks at JSC and 4 racks at University Wuppertal complete 16 © 2009 IBM Corporation
  • 17. Production Chain Major steps – Pre-integration at University Regensburg – Integration at IBM / Boeblingen – Installation at FZ Juelich and University Wuppertal 17 © 2009 IBM Corporation
  • 18. Chapter 3: QPACE Concept  System – Node card with IBM® PowerXCell™ 8i processor and network processor (NWP) • Important feature: fast double precision arithmetic's – Commodity processor interconnected by a custom network – Custom system design – Liquid cooling system  Rack parameters – 256 node cards • 26 TFLOPS peak (double precision) • 1 TB Memory – O(35) kWatt power consumption  Applications – Target sustained performance of 20-30% – Optimized for calculations in theoretical particle physics: Simulation of Quantum Chromodynamics 18 © 2009 IBM Corporation
  • 19. Chapter 3: QPACE Networks  Torus network – Nearest-neighbor communication, 3-dimensional torus topology – Aggregate bandwidth 6 GByte/s per node and direction – Remote DMA communication (local store to local store)  Interrupt tree network – Evaluation of global conditions and synchronization – Global Exceptions – 2 signals per direction  Ethernet network – 1 Gigabit Ethernet link per node card to rack-level switches (switched network) – I/O to parallel file system (user input / output) – Linux network boot – Aim of O(10) GB bandwidth per rack 19 © 2009 IBM Corporation
  • 20. Chapter 3: QPACE Root Card (16 per rack) Backplane (8 per rack) Node Card (256 per rack) Power Supply and Power Adapter Card (24 per rack) Rack 20 © 2009 IBM Corporation
  • 21. Chapter 3: QPACE Node Card  Components – IBM PowerXCell 8i processor 3.2 GHZ – 4 Gigabyte DDR2 memory 800 MHZ with ECC – Network processor (NWP) Xilinx FPGA LX110T FPGA – Ethernet PHY – 6 x 1GB/s external links using PCI Express physical layer – Service Processor (SP) Freescale 52211 – FLASH (firmware and FPGA configuration) – Power subsystem – Clocking  Network Processor – FLEXIO interface to PowerXCell 8i processor, 2 bytes with 3 GHZ bit rate – Gigabit Ethernet – UART FW Linux console – UART SP communication – SPI Master (boot flash) – SPI Slave for training and configuration – GPIO 21 © 2009 IBM Corporation
  • 22. Chapter 3: QPACE Node Card Network Processor Network PHYs PowerXCell 8i (FPGA) Memory Processor 22 © 2009 IBM Corporation
  • 23. Chapter 3: QPACE Node Card DDR2 DDR2 DDR2 DDR2 800MHz I2C Power SPI RW Subsystem (Debug) PowerXCell 8i FLEXIO FLEXIO Clocking 6GB/s 6GB/s RS232 SPI I2C SP FPGA Virtex-5 UART Freescale MCF52211 GigE PHY SPI 384 IO@250MHZ Flash 4*8*2*6 = 384 IO 680 available (LX110T) 6x 1GB/s PHY Compute Network 23 © 2009 IBM Corporation
  • 24. Chapter 3: QPACE Network Processor x+ Link PHY Slices 92 % Interface PINs 86 % x- Link LUT-FF pairs 73 % PHY Interface Flip-Flops 55 %     Network Logic   LUTs 53 % z- FlexIO Routing Link BRAM / FIFOs 35 % PHY Interface Interface Arbitration FIFOs Ethernet PHY Configuration Interface Global Flip-Flops LUTs Signals Processor Interface 53 % 46 % Serial Interfaces Torus 36 % 39 % SPI Flash Ethernet 4% 2% 24 © 2009 IBM Corporation
  • 25. Chapter 3: QPACE Network Processor FlexIO RocketIO IBM: • RocketIO Logic IOC IOIF IOC ((IOIF) ) FELX iO • IOC Logic • GBIF Logic Slave GBIF Master Receive Requests Send Requests Switch / Address Decoder / FIFOs / Bus Controller Academic Partners: • Network Processor Logic 6 x 1GB/S 25 © 2009 IBM Corporation
  • 26. Chapter 3: QPACE Processor Bus Interface  FlexIO Interface – High bandwidth interface between IBM PowerXCell 8i processor and Xilinx Viretx-5 FPGA – Implementation from Rambus Inc – Optimized for intra-board environments – Uses RocketIO GPT transceiver features – Requires link training after power-on • Phase calibration (aligns the data for optimal sampling point) • Parallel calibration (synchronizes the receive deserializer with the transmit serializer) • Levelization calibration (aligns all data lanes)  Challenges – Speed, Latency, Bandwidth and Timing (Clock) – 3 Gbyte/sec communication channel – 2 Byte link wide 26 © 2009 IBM Corporation
  • 27. Chapter 3: QPACE Torus Network Physical Layer  Physical layer – 10GbE @ 2.5 GHz → 1 GByte/s  Eye diagram for bad case link – 3.125 GHz – 40 cm PCB, 50 cm cable, – 1 PCB-PCB, 2 PCB-cable connectors  Custom data link layer – Fixed size messages – 128 Byte payload + 4 Byte header + 4 Byte CRC → Minimal protocol overhead 27 © 2009 IBM Corporation
  • 28. Torus Network Architecture  2-sided communication – Node A initiates send, node B initiates receive – Send and receive commands have to match – Multiple use of same link by virtual channels  Send / receive from / to local store or main memory – CPU → NWP • CPU moves data and control info to NWP • Back-pressure controlled – NWP → NWP • Independent of processor • Each datagram has to be acknowledged – NWP → CPU • CPU provides credits to NWP • NWP writes data into processor • Completion indicated by notification 28 © 2009 IBM Corporation
  • 29. Chapter 3: QPACE Torus Network Reconfiguration  Torus network PHYs provide 2 interfaces – Used for network reconfiguration b selecting primary or secondary interface  Example – 1x8 or 2x4 node-cards  Partition sizes (1,2,2N) * (1,2,4,8,16) * (1,2,4,8) – N ... number of racks connected via cables 29 © 2009 IBM Corporation
  • 30. Chapter 3: QPACE Cooling  Concept – Node card mounted in housing = heat conductor – Housing connected to liquid cooled cold plate – Critical thermal interfaces • Processor – thermal box • Thermal box – cold plate – Dry connection between node card and cooling circuit  Node card housing – Closed node card housing acts as heat conductor. – Heat conductor is linked with liquid-cooled “cold plate” – Cold Plate is placed between two rows of node cards.  Simulation Results for one Cold Plate – Ambient 12°C – Water 10 L / min – Load 4224 Watt 2112 Watt / side 30 © 2009 IBM Corporation
  • 31. Chapter 3: QPACE Power Efficiency 31 © 2009 IBM Corporation
  • 32. Chapter 4: Review and Summary Project Review  Hardware design – Almost all critical problems solved in time – Network Processor implementation still a challenge – No serious problems due to wrong design decisions  Hardware status – Manufacturing quality good: Small bone pile, few defects during operation.  Time schedule – Essentially stayed within planned schedule – Implementation of system / application software delayed 32 © 2009 IBM Corporation
  • 33. Chapter 4: Review and Summary Summary  QPACE is a new, scalable LQCD machine based on the PowerXCell 8i processor.  Design highlights – FPGA directly attached to processor – LQCD optimized, low latency torus network – Novel, cost-efficient liquid cooling system – High packaging density – Very power efficient architecture  O(20-30%) sustained performance for key LQCD kernels is reached / feasible → O(10-16) TFLOPS / rack (SP) 33 © 2009 IBM Corporation
  • 34. Chapter 5: Unforgettable Impressions ;-) 34 © 2009 IBM Corporation
  • 35. Chapter 5: Unforgettable Impressions ;-) 35 © 2009 IBM Corporation
  • 36. Chapter 5: Unforgettable Impressions ;-) 36 © 2009 IBM Corporation
  • 37. Chapter 5: Unforgettable Impressions ;-) 37 © 2009 IBM Corporation
  • 38. Chapter 5: Unforgettable Impressions ;-) 38 © 2009 IBM Corporation
  • 39. Chapter 5: Unforgettable Impressions ;-) 39 © 2009 IBM Corporation
  • 40. Chapter 5: Unforgettable Impressions ;-) 40 © 2009 IBM Corporation
  • 41. Chapter 5: Unforgettable Impressions ;-) 41 © 2009 IBM Corporation
  • 42. Chapter 5: Unforgettable Impressions ;-) 42 © 2009 IBM Corporation
  • 43. Chapter 5: Unforgettable Impressions ;-) 43 © 2009 IBM Corporation
  • 44. Chapter 5: Unforgettable Impressions ;-) 44 © 2009 IBM Corporation
  • 45. 45 © 2009 IBM Corporation
  • 46. Thank you very much for your attention. 46 © 2009 IBM Corporation
  • 47. Disclaimer  IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries.  Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both.  Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change. 47 © 2009 IBM Corporation