Your SlideShare is downloading. ×
The Cell Processor
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Cell Processor

1,405
views

Published on

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,405
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
107
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • VMX AltiVec SIMD instructions on IBM PowerPC processors Less speculative logic
  • VMX AltiVec SIMD instructions on IBM PowerPC processors Less speculative logic
  • Switch gibt es noch nicht
  • Dr. V. S. Pande, Distributed Computing Project, Stanford University (permission given for showing the video as well) Folding@Home on the PS3: the Cure@PS3 project INTRODUCTION Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation. By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine. FAH has targeted the study of of protein folding and protein folding disease, and numerous scientific advances have come from the project. Now in 2006, we are looking forward to another major advance in capabilities. This advance utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. With this new technology (as well as new advances with GPUs ), we will likely be able to attain performance on the 100 gigaflop scale per computer. With about 10,000 such machines, we would be able to achieve performance on the petaflop scale . With software from Sony, the PlayStation 3 will now be able to contribute to the Folding@Home project, pushing Folding@Home a major step forward. Our goal is to apply this new technology to push Folding@Home into a new level of capabilities, applying our simulations to further study of protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. With these computational advances, coupled with new simulation methodologies to harness the new techniques, we will be able to address questions previously considered impossible to tackle computationally, and make even greater impacts on our knowledge of folding and folding related diseases. ADVANCED FEATURES FOR THE PS3 The PS3 client will also support some advanced visualization features. While the Cell microprocessor does most of the calculation processing of the simulation, the graphic chip of the PLAYSTATION 3 system (the RSX) displays the actual folding process in real-time using new technologies such as HDR and ISO surface rendering. It is possible to navigate the 3D space of the molecule using the interactive controller of the PS3, allowing us to look at the protein from different angles in real-time. For a preview of a prototype of the GUI for the PS3 client, check out a screenshot or one of these videos ( 355K avi , 866K avi , 6MB avi , 6MB avi -- more videos and formats to come). There is also a "bootleg" video of Sony's presentation on FAH that is now on YouTube (although the audio and video quality is pretty bad). http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats
  • Cell Blade systems compute and compress images. These images are then delivered via the network to clients for decompression and display. GPStream framework can be used to deliver the images to mobile clients via wireless. This is really an example of situational awareness. In this specific case, the Predator Unmanned Aerial Vehicle has a small camera mounted in the nose (blue circle would be live video), the surroundings would be rendered for the remote pilot to help them avoid turning into a mountain or no fly zone. We this this is also valid for commercial aircraft for night, poor weather, etc.
  • Ein erfahrener Arzt kann aus Schnittbildern sehr viel herauslesen. Aber 3dimensionale Bilder, dynamisch, d.h. Unter Einschluß des Faktors Zeit eröffnen völlig neue Diagnosemöglichkeiten. Medical imaging is another area that is progressing rapidly and creating a new more demanding workload. Today an average exam generates 1GByte of data you can’t go to the future adding time dependent analysis without an application-optimized system. An average exam generates 1GBytes of data (for one Digital x-ray or simple CT scan - much more for complicated CT or MRI studies) We estimate that 10^2-10^4 floating point operations are used to capture, process and analyze a Byte of medical data So, a typical exam requires 10^11- 10^13 operations Assume an exam must be completed in “real time” (5 minutes?) to be of diagnostic use This requires 0.3- 33GF/s of compute power – delivered today by single processor Intel workstations Scanner technology will rapidly evolve to generate 10-20x the amount of data in the same scan time Sixteen Slice CT Scanner 600-2000 slices per exam  300 MB – 1 GB per exam CT Scan workflow – typical helical scan multi-slice acquisition Stage 1: Interpolate data to generate equivalent “step-and-shoot” slices Stage 2: Filtered Back-Projection to generate 2D slice view (Fourier filter + numerical integration) Stage 3: Volume rendering (optional—many radiologists prefer to look at slices, but with increasing resolution/slice count, it may become mandatory) Note (1) Stage 2 should be trivially parallelizable (scale out) Note (2) Increase in the number of slices acquired simultaneously  increased computational cost for “cone-effect” corrections. Note (3) There are claims that improved algorithms can reduce the computational burden enormously (UIUC Technology Licensing Office) Example: 313MB of raw scan data  5 x 1MB images (cross-sections?). Each image takes 19 seconds to process on a 3GHz Wintel box. High resolution 3000 slice run (from machines like the new Siemens Somatom 64) might take ~16 hours to process on such a commodity system. Note that the 3GB of 2D image data can be accommodated within main memory. PV-4D (www.pv-4d.com) Showcase at Supercomputing 2005 / Cebit 2006 About 4 times faster than Opteron with same algorithm If fully optimized, projected about > 6 times faster than Opteron Last minute prototype running on four Cell blades Stereo display using shutter glasses, 8-10 frames per second - Achieving this frame rate using two blades at a time - Four blades required for data set size Data sets about 1.6GB in size - Beating heart (400x400x400 voxels, 6 samples) - CFD simulation (~600x200x100 voxels, 40 samples)
  • Handling large data Handling large code SIMD aspect?
  • Q: What’s the parameters to spe_create_thread…
  • Handling large data Handling large code SIMD aspect?
  • Handling large data Handling large code SIMD aspect?
  • Middleware / libraries likely to be optimized - media, e.g., mplayer - encryption, e.g., OpenSSH PPE = P ower P rocessor E lement
  • Transcript

    • 1. The Cell Processor Computing of tomorrow or yesterday? Open Systems Design and Development 2007-04-12 | Heiko J Schick <schickhj@de.ibm.com> © 2007 IBM Corporation
    • 2. Agenda
      • Introduction
      • Limiters to Processor Performance
      • Cell Architecture
      • Cell Platform
      • Cell Applications
      • Cell Programming
      • Appendix
    • 3. 1 Introduction
    • 4. Cell History
      • IBM, SCEI / Sony and Toshiba Alliance formed in 2000
      • Design Center opened in March 2001 (Based in Austin, Texas)
      • Single Cell BE operational Spring 2004
      • 2-way SMP operational Summer 2004
      • February 7, 2005: First technical disclosures
      • November 9, 2005: Open Source SDK Published
    • 5. The problem is… … the view from the computer room!
    • 6. Outlook Source: Kurzweil “ Computer performance increases since 100 years exponential !!!”
    • 7. But what could you do if all objects were intelligent… … and connected?
    • 8. What could you do with unlimited computing power… for pennies? Could you predict the path of a storm down to the square kilometer? Could you identify another 20% of proven oil reserves without drilling one hole?
    • 9. 2 Limiters to Processor Performance
    • 10. Power Wall / Voltage Wall
      • Power components:
        • Active Power
        • Passive Power
          • Gate leakage
          • Sub-threshold leakage (source-drain leakage)
      Source: Tom’s Hardware Guide 1
    • 11. Memory Wall
      • Main memory now nearly 1000 cycles from the processor
        • Situation worse with (on-chip) SMP
      • Memory latency penalties drive inefficiency in the design
        • Expensive and sophisticated hardware to try and deal with it
        • Programmers that try to gain control of cache content are hindered by the hardware mechanisms
      • Latency induced bandwidth limitations
        • Much of the bandwidth to memory in systems can only be used speculatively
      2
    • 12. Frequency Wall
      • Increasing frequencies and deeper pipelines have reached diminishing returns on performance
      • Returns negative if power is taken into account
      • Results of studies depend on issue width of processor
        • The wider the processor the slower it wants to be
        • Simultaneous Multithreading helps to use issue slots efficiently
      • Results depend on number of architected registers and workload
        • More registers tolerate deeper pipeline
        • Fewer random branches in application tolerates deeper pipelines
      3
    • 13. Microprocessor Efficiency
      • Gelsinger’s law
        • 1.4x more performance for 2x more
      • Hofstee’s corollary
        • 1/1.4x efficiency loss in every generation
        • Examples: Cache size, Out-of-Order, Super-scalar, etc.
      Source: Tom’s Hardware Guide Increasing performance requires increasing efficiency !!!
    • 14. Attacking the Performance Walls
      • Multi-Core Non-Homogeneous Architecture
        • Control Plane vs. Data Plane processors
        • Attacks Power Wall
      • 3-level Model of Memory
        • Main Memory, Local Store, Registers
        • Attacks Memory Wall
      • Large Shared Register File & SW Controlled Branching
        • Allows deeper pipelines (11FO4 helps power)
        • Attacks Frequency Wall
    • 15. 3 Cell Architecture
    • 16. Cell BE Processor
      • ~250M transistors
      • ~235mm2
      • Top frequency >3GHz
      • 9 cores, 10 threads
      • > 200+ GFlops (SP) @3.2 GHz
      • > 20+ GFlops (DP) @3.2 GHz
      • Up to 25.6GB/s memory B/W
      • Up to 76,8GB/s I/O B/W
      • ~400M$(US) design investment
    • 17. Key Attributes of Cell
      • Cell is Multi-Core
        • Contains 64-bit Power Architecture TM
        • Contains 8 Synergistic Processor Elements (SPE)
      • Cell is a Flexible Architecture
        • Multi-OS support (including Linux) with Virtualization technology
        • Path for OS, legacy apps, and software development
      • Cell is a Broadband Architecture
        • SPE is RISC architecture with SIMD organization and Local Store
        • 128+ concurrent transactions to memory per processor
      • Cell is a Real-Time Architecture
        • Resource allocation (for Bandwidth Measurement)
        • Locking Caches (via Replacement Management Tables)
      • Cell is a Security Enabled Architecture
        • SPE dynamically reconfigurable as secure processors
    • 18.
    • 19. Power Processor Element (PPE)
      • 64-bit Power Architecture™ with VMX
      • In-order, 2-way hardware Multi-threading
      • Coherent Load/Store with 32KB I & D L1 and 512KB L2
      • Controls the SPEs
    • 20. Synergistic Processor Elements (SPEs)
      • SPE provides computational performance
        • Dual issue, up to 16-way 128-bit SIMD
        • Dedicated resources: 128 128-bit register file, 256KB Local Store
        • Each can be dynamically configured to protect resources
        • Dedicated DMA engine: Up to 16 outstanding request
        • Memory flow controller for DMA
        • 25 GB/s DMA data transfer
        • “ I/O Channels” for IPC
      • Seperate Cores
      • Simple Implementation (e.g. no branch prediction)
      • No Caches
      • No protected instructions
    • 21. SPE BLOCK DIAGRAM Permute Unit Load-Store Unit Floating-Point Unit Fixed-Point Unit Branch Unit Channel Unit Result Forwarding and Staging Register File Local Store (256kB) Single Port SRAM 128B Read 128B Write DMA Unit Instruction Issue Unit / Instruction Line Buffer 8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle 64 Byte/Cycle On-Chip Coherent Bus
    • 22. Element Interconnect Bus
      • Four 16 byte data rings, supporting multiple transfers
      • 96B/cycle peak bandwidth
      • Over 100 outstanding requests
      • 300+ GByte/sec @ 3.2 GHz
      Element Interconnect Bus (EIB)
    • 23.
      • Four 16B data rings connecting 12 bus elements
        • Two clockwise / Two counter-clockwise
      • Physically overlaps all processor elements
      • Central arbiter supports up to three concurrent transfers per data ring
        • Two stage, dual round robin arbiter
      • Each element port simultaneously supports 16B in and 16B out data path
        • Ring topology is transparent to element data interface
      Element Interconnect Bus (EIB) 16B 16B 16B 16B Data Arb 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B SPE0 SPE2 SPE4 SPE6 SPE7 SPE5 SPE3 SPE1 MIC PPE BIF/IOIF0 IOIF1
    • 24. Example of eight concurrent transactions MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF1 Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Controller Controller Ramp 0 Controller Ramp 1 Controller Ramp 2 Controller Ramp 3 Controller Ramp 4 Controller Ramp 5 Controller Ramp 6 Controller Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Data Arbiter Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Controller Controller Ramp 5 Controller Ramp 4 Controller Ramp 3 Controller Ramp 2 Controller Ramp 1 Controller Ramp 0 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF0 Ring1 Ring3 Ring0 Ring2 controls
    • 25. I/O and Memory Interfaces
      • I/O Provides wide bandwidth
        • Dual XDR TM controller (25.6GB/s @ 3.2Gbps)
        • Two configurable interfaces (76.8GB/s @6.4Gbps)
          • Configurable number of Bytes
          • Coherent or I/O Protection
        • Allows for multiple system configurations
    • 26.
    • 27. 4 Cell Platform
    • 28.
      • Game console systems
      • Blades
      • HDTV
      • Home media servers
      • Supercomputers
      • ...... ?
      Cell processor can support many systems Cell BE Processor XDR tm XDR tm IOIF0 IOIF1 Cell BE Processor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF Cell BE Proessor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF Cell BE Processor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF SW
    • 29.
      • Chassis
        • Standard IBM BladeCenter with:
          • 7 Blades (for 2 slots each) with full performance
          • 2 switches (1Gb Ethernet) with 4 external ports each
        • Updated Management Module Firmware.
        • External Infiniband Switches with optional FC ports.
      • Blade (400 GFLOPs)
        • Game Processor and Support Logic:
          • Dual Processor Configuration
          • Single SMP OS image
          • 1GB XDRAM
          • Optionally PCI-exp attached standard graphics adapter
        • BladeCenter Interface ( Based on IBM JS20):
          • New Blade Power System and Sense Logic Control
          • Firmware to connect processor & support logic to H8 service processor
          • Signal Level Converters for processor & support logic
          • 2 Infiniband (IB) Host Adapters with 2x IB 4x each
          • Physical link drivers (GbE Phy etc)
      Chassis 2x (+12V RS-485,USB,GbEn) Rambus Design: DRAM 1/2GB Cell BE Processor H8 SP Blade Input Power &Sense Level Convert GbE Phy BladeCenter Interface Blade Cell BE Processor South Bridge Rambus Design: DRAM 1/2GB South Bridge IB 4X IB 4X Blade QS20 Hardware Description
    • 30. QS20 Blade (w/o heatsinks)
    • 31. QS20 Blade Assembly
      • ATA Disk
      • Service Proc.
      • South Bridges
      • InfiniBand Cards
      • Blade Bezel
    • 32.
      • Up to 2 InfiniBand Cards can be attached.
      • Standard PC InfiniBand Card with special bezel
        • MHEA28-1TCSB Dual-Port HCA
        • PCI Express x8 interface
        • Dual 10 Gb/s InfiniBand 4X Ports
        • 128 MB Local Memory
        • IBTA v1.1 Compatible Design
      Options - InfiniBand
    • 33. Cell Software Stack Firmware Applications SLOF powerpc architecture dependent code Cell Broadband Engine Linux memory management device drivers gcc ppc64, spu backend glibc Hardware RTAS Secondary Boot Loader powerpc- and cell- specific Linux code Low-level FW scheduler (pSeries) (PMac) cell User space Linux common code device drivers
    • 34. Cell BE Development Platform Cell BE Firmware Graphics Std Devices Developer Workstation Cell Linux kernel Lower-level programming interface Basic Cell runtime: lib_spe, spelibc, … Basic Cell toolchain: gcc, binutils, gdb, oprofile, … Cell aware tooling Application Framework (segment specific) Standard Linux Development Environment  ppc64 Cell optimized libraries Cell specialized compilers Higher-level programming interface Application-level programming interface Tooling Libraries Cell enablement Cell exploitation
      • Cell is an exotic platform and hard to program
      • Challenging to exploit SPEs: Limited local memory (256 KB) – need to DMA data and code fragments back and forth Multi-level parallelism – 8 SPEs, 128-bit wide SIMD units in each SPE If done right, the result is impressive performance…
      • Make Cell easier to program
      • Hide complexity in critical libraries
      • Compiler support for standard tasks, e.g., overlays, global data access, SW-managed cache, auto vectorization, auto parallelization, …
      • Smart tooling
      • Make Cell a standard platform
      • Middleware and frameworks provide architecture-specific components and hide Cell –specifics from application developer
    • 35.
      • Alpha Quality
        • SDK hosted on FC4 / X86
      • OS: Initial Linux Cell 2.6.14 patches
      • SPE Threads runtime
      • XLC Cell C Compiler
      • SPE gdb debugger
      • Cell Coding Sample Source
      • Documentation
        • Installation Scripts
        • Cell Hardware Specs
        • Programming Docs
      SDK1.0
      • GCC Tools from SCEA
        • gcc 3.0 for Cell
        • Binutils for Cell
      • Alpha Quality
        • SDK hosted on FC5 / X86
      • Critical Linux Cell Performance Enhancements
        • Cell Enhanced Functions
      • Critical Cell RAS Functions
        • Machine Check, System Error
      • Performance Analysis Tools
        • Oprofile – PPU Cycle only profiling (No SPU)
      • GNU Toolchain updates
      • Mambo Updates
      • Julia Set Sample
      SDK1.1 Execution platform: Cell Simulator Hosting platform: Linux/86 (FC4) 11/2005 7/2006 SDK 2.0 12/2006
      • XL C/C++
        • Linux/x86, LoP
        • Overlay prototype
        • Auto-SIMD enhancements
      • Linux Kernel updates
        • Performance Enhancements
        • RAS/ Debug support
        • SPE runtime extensions
        • Interrupt controller enhancements
      • GNU Toolchain updates
        • FSF integration
        • GDB multi-thread support
        • Newlib library optimization
        • Prog model support for overlay
      • Programming Model Preview
        • Overlay support
        • Accelerated Libraries Framework
      • Library enhancements
        • Vector Math Library – Phase 1
        • MASS Library for PPU, MASSV Library for PPU/SPU
      • IDE
        • Tool integration
        • Remote tool support
      • Performance Analysis
        • Visualization tools
        • Bandwidth, Latency, Lock analyzers
        • Performance debug tools
        • Oprofile – SDK 1.1 plus PPU event based profiling
      • Mambo
        • Performance model correlation
        • Visualization
      SDK1.0.1 Execution platform: Cell Simulator Cell Blade 1 rev 2 Hosting platform: Linux/86 (FC4) Linux/Cell (FC4)* Linux/Power (FC4)* Execution platform: Cell Simulator Cell Blade 1 rev 3 Hosting platform: Linux/86 (FC5) Linux/Cell (FC5)* Linux/Power (FC5)* Refresh Execution platform: Cell Simulator Cell Blade 1 rev 3 Hosting platform: Linux/86 (FC5) Linux/Cell (FC5)* Linux/Power (FC5)* 2/2006 Refresh 9/2006 SDK1.1.1
      • Documentation
      • Mambo updates for CB1 and 64-bit hosting
      • ISO image update
      * Subset of tools
    • 36. Cell library content (source) ~ 156k loc
      • Standard SPE C library subset
        • optimized SPE C functions including stdlib c lib, math and etc.
      • Audio resample - resampling audio signals
      • FFT - 1D and 2D fft functions
      • gmath - mathematic functions optimized for gaming environment
      • image - convolution functions
      • intrinsics - generic intrinsic conversion functions
      • large-matrix - functions performing large matrix operations
      • matrix - basic matrix operations
      • mpm- multi-precision math functions
      • noise - noise generation functions
      • oscillator - basic sound generation functions
      • sim- providing I/O channels to simulated environments
      • surface - a set of bezier curve and surface functions
      • sync - synchronization library
      • vector - vector operation functions
      http://www.alphaworks.ibm.com/tech/cellsw
    • 37. 5 Cell Applications
    • 38. Peak GFLOPs FreeScale DC 1.5 GHz PPC 970 2.2 GHz AMD DC 2.2 GHz Intel SC 3.6 GHz Cell 3.0 GHz
    • 39. Cell Processor Example Application Areas
      • Cell is a processor that excels at processing of rich media content in the context of broad connectivity
        • Digital content creation (games and movies)
        • Game playing and game serving
        • Distribution of (dynamic, media rich) content
        • Imaging and image processing
        • Image analysis (e.g. video surveillance)
        • Next-generation physics-based visualization
        • Video conferencing (3D?)
        • Streaming applications (codecs etc.)
        • Physical simulation & science
    • 40. Opportunities for Cell BE Blade
      • Aerospace & Defense
        • Signal & Image Processing
        • Security, Surveillance
        • Simulation & Training, …
      • Petroleum Industry
        • Seismic computing
        • Reservoir Modeling, …
      • Communications Equipment
        • LAN/MAN Routers
        • Access
        • Converged Networks
        • Security, …
      • Medical Imaging
        • CT Scan
        • Ultrasound, …
      • Consumer / Digital Media
        • Digital Content Creation
        • Media Platform
        • Video Surveillance, …
      • Public Sector / Gov’t & Higher Educ.
        • Signal & Image Processing
        • Computational Chemistry, …
      • Finance
        • Trade modeling
      • Industrial
        • Semiconductor / LCD
        • Video Conference
      Petroleum Industry A&D Comm Industrial Cell Assets Consumer Public Finance
    • 41.
      • Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation of:
        • Protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer.
        • By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine.
      • Folding@Home utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers.
        • 14,000 PlayStation 3’s are literally outperforming 159,000 Windows Computers by more than Double!
        • In fact they out perform all the other clients combined.
      http://folding.stanford.edu/FAQ-PS3.html Dr. V. S. Pande, folding@home, Distributed Computing Project, Stanford University
    • 42. Ported by 235 584 tetrahedra 48 000 nodes 28 iterations in NKMG solver In 3.8 seconds Sustained Performance for large Objects: 52 GFLOP/s Multigrid Finite Element Solver on Cell using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
    • 43. Computational Fluid Dynamics Solver on Cell Ported by Sustained Performance for large Objects: Not yet benchmarked (3/2007) using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
    • 44. Computational Fluid Dynamics Solver on Cell A Lattice-Boltzmann Solver Developed by Fraunhofer IWTM http://www.itwm.fraunhofer.de/
    • 45. Terrain Rendering Engine (TRE) and IBM Blades Systems and Technology Group Commodity Cell BE Blade Add Live Video, Aerial Information, Combat Situational Awareness Next-Gen GCS Combine Data & Render Aircraft data / Field Data BladeCenter-1 Chassis QS20
    • 46. Example: Medical Computer Tomography (CT) Scans Image whole heart in 1 rotation 4D CT – includes time 2 slices 4 slices 8 slices 16 slices 32 slices 64 slices 128 slices 256 slices Current CT Products Future CT Products
    • 47. The moving image is aligned to the fixed image as the registration proceeds. Fixed Image Moving Image Registration Process “ Image Registration” Using Cell
    • 48. 6 Cell Programming
    • 49. Small single-SPE models – a sample
      • /* spe_foo.c: * A C program to be compiled into an executable called “spe_foo” */ int main( int speid, addr64 argp, addr64 envp ) { char i; /* do something intelligent here */ i = func_foo ( argp ); /* when the syscall is supported */ printf ( “Hello world! my result is %d ”, i); return i ; }
    • 50.
      • extern spe_program_handle spe_foo ; /* the spe image handle from CESOF */ int main() { int rc, status; speid_t spe_id; /* load & start the spe_foo program on an allocated spe */ spe_id = spe_create_thread (0, &spe_foo, 0, NULL, -1, 0); /* wait for spe prog. to complete and return final status */ rc = spe_wait (spe_id, &status, 0); return status; }
      Small single-SPE models – PPE controlling program
    • 51. Using SPEs
      • (1) Simple Function Offload
        • Remote Procedure Call Style
        • SPE working set fits in Local Store
        • PPE initiates DMA data/code transfers
        • Could be easily supported by a programming env, e.g.,
          • RPC Style IDL Compiler
          • Compiler Directives (pragmas)
          • Libraries
          • Or even automatic scheduling of code/data to SPEs
      • (2) Typical (Complex) Function Offload
        • SPE working set larger than Local Store
        • PPE initially loads SPE LS with small startup code
        • SPE initiates DMAs (code/data staging)
          •  Stream data through code
          •  Stream code through data
        • Latency hiding required in most cases
        • Requires &quot;high locality of reference&quot; characteristics
        • Can be extended to a “services offload model”
      PowerPC (PPE) SPU Local Store MFC N SPE Puts Results PPE Puts Text Static Data Parameters SPE executes PowerPC (PPE) SPU Local Store MFC N SPE Puts Results PPE Puts Initial Text Static Data Parameters System Memory SPE Independently Stages Text & Intermediate Data Transfers while executing
    • 52. Using SPEs
      • (3) Pipelining for complex functions
        • Functions split up in processing stages
        • Direct LS to LS communication possible
          • Including LS to LS DMA
          • Avoid PPE / System Memory bottlenecks
      • (4) Parallel stages for very compute-intense functions
        • PPE partitions and distributes work to multiple SPEs
      SPU Local Store MFC N SPU Local Store MFC N Parallel-stages PowerPC (PPE) System Memory PowerPC (PPE) System Memory SPU Local Store MFC N SPU Local Store MFC N Multi-stage Pipeline SPU Local Store MFC N
    • 53. Large single-SPE programming models
      • Data or code working set cannot fit completely into a local store
      • The PPE controlling process, kernel, and libspe runtime set up the system memory mapping as SPE’s secondary memory store
      • The SPE program accesses the secondary memory store via its software-controlled SPE DMA engine - Memory Flow Controller (MFC)
      SPE Program System Memory PPE controller maps system memory for SPE DMA trans. DMA transactions Local Store
    • 54. Large single-SPE programming models – I/O data
      • System memory for large size input / output data
        • e.g. Streaming model
      System memory int ip[32] int op[32] SPE program: op = func(ip) DMA DMA Local store int g_ip[512*1024] int g_op[512*1024]
    • 55. Large single-SPE programming models
      • System memory as secondary memory store
        • Manual management of data buffers
        • Automatic software-managed data cache
          • Software cache framework libraries
          • Compiler runtime support
      System memory SW cache entries SPE program Local store Global objects
    • 56. Large single-SPE programming models
      • System memory as secondary memory store
        • Manual loading of plug-in into code buffer
          • Plug-in framework libraries
        • Automatic software-managed code overlay
          • Compiler generated overlaying code
      System memory Local store SPE plug-in b SPE plug-in a SPE plug-in e SPE plug-in a SPE plug-in b SPE plug-in c SPE plug-in d SPE plug-in e SPE plug-in f
    • 57. Large single-SPE prog. models – Job Queue
      • Code and data packaged together as inputs to an SPE kernel program
      • A multi-tasking model – more discussion later
      Job queue System memory Local store code/data n code/data n+1 code/data n+2 code/data … Code n Data n SPE kernel DMA
    • 58. Large single-SPE programming models - DMA
      • DMA latency handling is critical to overall performance for SPE programs moving large data or code
      • Data pre-fetching is a key technique to hide DMA latency
        • e.g. double-buffering
      Time I Buf 1 (n) O Buf 1 (n) I Buf 2 (n+1) O Buf 2 (n-1) SPE program: Func (n) output n-2 input n Output n-1 Func (input n ) Input n+1 Func (input n+1 ) Func (input n-1 ) output n Input n+2 DMAs SPE exec. DMAs SPE exec.
    • 59. Large single-SPE programming models - CESOF
      • C ell E mbedded S PE O bject F ormat (CESOF) and PPE/SPE toolchains support the resolution of SPE references to the global system memory objects in the effective-address space.
      _EAR_g_foo structure Local Store Space Effective Address Space DMA transactions CESOF EAR symbol resolution Char g_foo[512] Char local_foo[512]
    • 60. Parallel programming models – Job Queue
      • Large set of jobs fed through a group of SPE programs
      • Streaming is a special case of job queue with regular and sequential data
      • Each SPE program locks on the shared job queue to obtain next job
      • For uneven jobs, workloads are self-balanced among available SPEs
      PPE SPE1 Kernel() SPE0 Kernel() SPE7 Kernel() System Memory I n . I 7 I 6 I 5 I 4 I 3 I 2 I 1 I 0 O n . O 7 O 6 O 5 O 4 O 3 O 2 O 1 O 0 … ..
    • 61. Parallel programming models – Pipeline / Streaming
      • Use LS to LS DMA bandwidth, not system memory bandwidth
      • Flexibility in connecting pipeline functions
      • Larger collective code size per pipeline
      • Load-balance is harder
      PPE SPE1 Kernel 1 () SPE0 Kernel 0 () SPE7 Kernel 7 () System Memory I n . . I 6 I 5 I 4 I 3 I 2 I 1 I 0 O n . . O 6 O 5 O 4 O 3 O 2 O 1 O 0 … .. DMA DMA
    • 62. Multi-tasking SPEs – LS resident multi-tasking
      • Simplest multi-tasking programming model
      • No memory protection among tasks
      • Co-operative, Non-preemptive, event-driven scheduling
      Task a Task b Task c Task d Task x Event Dispatcher Local Store SPE n Event Queue a c a d x a c d
    • 63. Multi-tasking SPEs – Self-managed multi-tasking
      • Non-LS resident
      • Blocked job context is swapped out of LS and scheduled back later to the job queue once unblocked
      System memory Local store task n task n+1 task n+2 Task … Code n Data n SPE kernel task n’ task queue Job queue
    • 64. libspe sample code
      • #include <libspe.h>
      • int main(int argc, char *argv[], char *envp[])
      • {
      • spe_program_handle_t *binary;
      • speid_t spe_thread;
      • int status;
      • binary = spe_open_image(argv[1]);
      • if (!binary)
      • return 1;
      • spe_thread = spe_create_thread(0, binary, argv+1,
      • envp, -1, 0);
      • if (!spe_thread)
      • return 2;
      • spe_wait(spe_thread, &status, 0);
      • spe_close_image(binary);
      • return status;
      • }
    • 65. libspe sample code
      • #include <libspe.h>
      • int main(int argc, char *argv[], char *envp[])
      • {
      • spe_program_handle_t *binary;
      • speid_t spe_thread;
      • int status;
      • binary = spe_open_image(argv[1]);
      • if (!binary)
      • return 1;
      • spe_thread = spe_create_thread(0, binary, argv+1,
      • envp, -1, 0);
      • if (!spe_thread)
      • return 2;
      • spe_wait(spe_thread, &status, 0);
      • spe_close_image(binary);
      • return status;
      • }
    • 66. libspe sample code
      • #include <libspe.h>
      • int main(int argc, char *argv[], char *envp[])
      • {
      • spe_program_handle_t *binary;
      • speid_t spe_thread;
      • int status;
      • binary = spe_open_image(argv[1]);
      • if (!binary)
      • return 1;
      • spe_thread = spe_create_thread(0, binary, argv+1,
      • envp, -1, 0);
      • if (!spe_thread)
      • return 2;
      • spe_wait(spe_thread, &status, 0);
      • spe_close_image(binary);
      • return status;
      • }
    • 67. Linux on Cell/B.E. kernel components
      • Platform abstraction arch/powerpc/platforms/{cell,ps3,beat}
      • Integrated Interrupt Handling
      • I/O Memory Management Unit
      • Power Management
      • Hypervisor abstractions
      • South Bridge drivers
      • SPU file system
    • 68. SPU file system
      • Virtual File System
      • /spu holds SPU contexts as directories
      • Files are primary user interfaces
      • New system calls: spu create and spu run
      • SPU contexts abstracted from real SPU
      • Preemptive context switching (W.I.P)
    • 69. PPE on Cell is a 100% compliant ppc64!
      • A solid base…
        • Everything in a distribution, all middleware runs out of the box
        • All tools available
        • BUT: not optimized to exploit Cell
      • Toolchain needs to cover Cell aspects
      • Optimized, critical “middleware” for Cell needed
        • Depending on workload requirements
    • 70. Using SPEs: Task Based Abstraction  APIs provided by user space libraries
      • SPE programs controlled via PPE-originated thread function calls
        • spe_create_thread(), ...
      • Calls on PPE and SPE
        • Mailboxes
        • DMA
        • Events
      • Simple runtime support (local store heap management, etc.)
      • Lots of library extensions
        • Encryption, signal processing, math operations
    • 71. spu_create
      • int spu create(const char *pathname, int flags, mode t mode);
        • creates a new context in pathname
        • returns an open file descriptor
        • context is gets destroyed when fd is closed
    • 72. spu_run
      • uint32 t spu run(int fd, uint32 t *npc, uint32 t *status);
        • transfers flow of control to SPU context fd
        • returns when the context has stopped for some reason, e.g.
          • exit or forceful abort
          • callback from SPU to PPU
          • can be interrupted by signals
    • 73. PPE programming interfaces
      • Asynchronous SPE thread API (“libspe 1.x”)
      • spe_create_thread
      • spe_wait
      • spe_kill
      • . . .
    • 74. spe create thread implementation
      • Allocate virtual SPE (spu create)
      • Load SPE application code into context
      • Start PPE thread using pthread create
      • New thread calls spu run
    • 75. More libspe interfaces
      • Event notification
        • int spe get event(struct spe event *, int nevents, int timeout);
      • Message passing
        • spe read out mbox(speid t speid);
        • spe write in mbox(speid t speid);
        • spe write signal(speid t speid, unsigned reg, unsigned data);
      • Local store access
        • void *spe get ls(speid t speid);
    • 76. GNU tool chain
      • PPE support
        • Just another PowerPC variant. . .
      • SPE support
        • Just another embedded processor. . .
      • Cell/B.E. support
        • More than just PPE + SPE!
    • 77. Object file format
      • PPE: regular ppc/ppc64 ELF binaries
      • SPE: new ELF flavour EM SPU
        • 32-bit big-endian
        • No shared libraries
        • Manipulated via cross-binutils
        • New: Code overlay support
      • Cell/B.E.: combined object files
        • embedspu: link into one binary
        • .rodata.spuelf section in PPE object
        • CESOF: SPE− >PPE symbol references
    • 78. gcc on the PPE
      • handled by “rs6000” back end
      • Processor-specific tuning
      • pipeline description
    • 79. gcc on the SPE
      • Merged Jan 3rd
      • Built as cross-compiler
      • Handles vector data types, intrinsics
      • Middle-end support: branch hints, aggressive if-conversion
      • GCC 4.1 port exploiting auto-vectorization
      • No Java
    • 80. Existing proprietary applications
      • Games
      • Volume rendering
      • Real-time Raytracing
      • Digital Video
      • Monte Carlo simulation
    • 81. Obviously missing
      • ffmpeg, mplayer, VLC
      • VDR, mythTV
      • Xorg acceleration
      • OpenSSL
      • Your project here !!!
    • 82. Questions! Thank you very much for your attention.
    • 83. 7 Appendix
    • 84. Documentation (new or recently updated)
      • Cell Broadband Engine
        • Cell Broadband Engine Architecture V1.0
        • Cell Broadband Engine Programming Handbook V1.0
        • Cell Broadband Engine Registers V1.3
        • SPU C/C++ Language Extensions V2.1
        • Synergistic Processor Unit (SPU) Instruction Set Architecture V1.1
        • SPU Application Binary Interface Specification V1.4
        • SPU Assembly Language Specification V1.3
      • Cell Broadband Engine Programming using the SDK
        • Cell Broadband Engine SDK Installation and User's Guide V1.1
        • Cell Broadband Engine Programming Tutorial V1.1
        • Cell Broadband Engine Linux Reference Implementation ABI V1.0
        • SPE Runtime Management library documentation V1.1
        • SDK Sample Library documentation V1.1
        • IDL compiler documentation V1.1
      • New developerWorks Articles
        • Maximizing the power of the Cell Broadband Engine processor
        • Debugging Cell Broadband Engine systems
    • 85. Documentation (new or recently updated)
      • IBM Cell Broadband Engine Full-System Simulator
        • IBM Full-System Simulator Users Guide
        • IBM Full-System Simulator Command Reference
        • Performance Analysis with the IBM Full-System Simulator
        • IBM Full-System Simulator BogusNet HowTo
      • PowerPC Architecture Book
        • Book I: PowerPC User Instruction Set Architecture Version 2.02
        • Book II: PowerPC Virtual Environment Architecture Version 2.02
        • Book III: PowerPC Operating Environment Architecture Version 2.02
        • Vector/SIMD Multimedia Extension Technology Programming Environments Manual Version 2.06c
    • 86. Links
      • Cell Broadband Engine http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
      • IBM BladeCenter QS20 http://www-03.ibm.com/technology/splash/qs20/
      • Cell Broadband Engine resource center http://www-128.ibm.com/developerworks/power/cell/
      • Cell Broadband Engine resource center - Documentation archive http://www-128.ibm.com/developerworks/power/cell/docs_documentation.html
      • Cell Broadband Engine technology http://www.alphaworks.ibm.com/topics/cell
      • Power.org's Cell Developers Corner http://www.power.org/resources/devcorner/cellcorner
      • Barcelona Supercomputer Center - Linux on Cell http://www.bsc.es/projects/deepcomputing/linuxoncell/
      • Barcelona Supercomputer Center - Documentation http://www.bsc.es/plantillaH.php?cat_id=262
      • Heiko J Schick's Cell Bookmarks http://del.icio.us/schihei/Cell

    ×