The Cell Processor Computing of tomorrow  or yesterday? Open Systems Design and Development 2007-04-12  |  Heiko J Schick <schickhj@de.ibm.com> © 2007 IBM Corporation
Agenda Introduction Limiters to Processor Performance Cell Architecture Cell Platform Cell Applications Cell Programming Appendix
1 Introduction
Cell History IBM, SCEI / Sony and Toshiba Alliance formed in 2000  Design Center opened in March 2001  (Based in Austin, Texas) Single Cell BE operational Spring 2004  2-way SMP operational Summer 2004  February 7, 2005: First technical disclosures  November 9, 2005: Open Source SDK Published
The problem is… … the view from the computer room!
Outlook Source: Kurzweil “ Computer performance increases since 100 years exponential !!!”
But what could you do if all  objects  were   intelligent… … and connected?
What could you do with  unlimited computing power…   for pennies? Could you predict the path of a storm  down to the square kilometer? Could you identify another 20% of proven oil reserves without drilling one hole?
2 Limiters to Processor  Performance
Power Wall / Voltage Wall Power components: Active Power Passive Power Gate leakage Sub-threshold leakage (source-drain leakage) Source: Tom’s Hardware Guide 1
Memory Wall Main memory now nearly 1000 cycles from the processor  Situation worse with (on-chip) SMP  Memory latency penalties drive inefficiency in the design  Expensive and sophisticated hardware to try and deal with it  Programmers that try to gain control of cache content are hindered by the hardware mechanisms  Latency induced bandwidth limitations  Much of the bandwidth to memory in systems can only be used speculatively 2
Frequency Wall Increasing frequencies and deeper pipelines have reached diminishing returns on performance  Returns negative if power is taken into account  Results of studies depend on issue width of processor  The wider the processor the slower it wants to be  Simultaneous Multithreading helps to use issue slots efficiently  Results depend on number of architected registers and workload  More registers tolerate deeper pipeline  Fewer random branches in application tolerates deeper pipelines 3
Microprocessor Efficiency Gelsinger’s law  1.4x more performance for 2x more   Hofstee’s corollary   1/1.4x efficiency loss in every generation  Examples: Cache size, Out-of-Order, Super-scalar, etc. Source: Tom’s Hardware Guide Increasing performance requires increasing efficiency !!!
Attacking the Performance Walls Multi-Core Non-Homogeneous Architecture  Control Plane vs. Data Plane processors  Attacks Power Wall  3-level Model of Memory  Main Memory, Local Store, Registers  Attacks Memory Wall  Large Shared Register File & SW Controlled Branching  Allows deeper pipelines (11FO4 helps power)  Attacks Frequency Wall
3 Cell Architecture
Cell BE Processor ~250M transistors ~235mm2 Top frequency >3GHz 9 cores, 10 threads > 200+ GFlops (SP) @3.2 GHz > 20+ GFlops (DP) @3.2 GHz Up to 25.6GB/s memory B/W Up to 76,8GB/s I/O B/W ~400M$(US) design investment
Key Attributes of Cell Cell is Multi-Core  Contains 64-bit Power Architecture TM  Contains 8 Synergistic Processor Elements (SPE)  Cell is a Flexible Architecture  Multi-OS support (including Linux) with Virtualization technology  Path for OS, legacy apps, and software development  Cell is a Broadband Architecture  SPE is RISC architecture with SIMD organization and Local Store  128+ concurrent transactions to memory per processor  Cell is a Real-Time Architecture  Resource allocation (for Bandwidth Measurement)  Locking Caches (via Replacement Management Tables)  Cell is a Security Enabled Architecture  SPE dynamically reconfigurable as secure processors
Power Processor Element (PPE) 64-bit Power Architecture™ with VMX In-order, 2-way hardware Multi-threading Coherent Load/Store with 32KB I & D L1 and 512KB L2 Controls the SPEs
Synergistic Processor Elements (SPEs) SPE provides computational performance Dual issue, up to 16-way 128-bit SIMD Dedicated resources: 128 128-bit register file, 256KB Local Store Each can be dynamically configured to protect resources Dedicated DMA engine: Up to 16 outstanding request Memory flow controller for DMA 25 GB/s DMA data transfer “ I/O Channels” for IPC Seperate Cores Simple Implementation  (e.g. no branch prediction) No Caches No protected instructions
SPE BLOCK DIAGRAM Permute Unit Load-Store Unit Floating-Point Unit Fixed-Point Unit Branch Unit Channel Unit Result Forwarding and Staging Register File Local Store (256kB) Single Port SRAM 128B Read 128B Write DMA Unit Instruction Issue Unit / Instruction Line Buffer 8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle 64 Byte/Cycle On-Chip Coherent Bus
Element Interconnect Bus Four 16 byte data rings, supporting multiple transfers 96B/cycle peak bandwidth Over 100 outstanding requests 300+ GByte/sec @ 3.2 GHz Element Interconnect Bus (EIB)
Four 16B data rings connecting 12 bus elements Two clockwise / Two counter-clockwise Physically overlaps all processor elements Central arbiter supports up to three concurrent transfers per data ring Two stage, dual round robin arbiter Each element port simultaneously supports 16B in and 16B out data path Ring topology is transparent to element data interface Element Interconnect Bus (EIB) 16B 16B 16B 16B Data Arb 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B SPE0 SPE2 SPE4 SPE6 SPE7 SPE5 SPE3 SPE1 MIC PPE BIF/IOIF0 IOIF1
Example of eight concurrent transactions MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF1 Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Controller Controller Ramp 0 Controller Ramp 1 Controller Ramp 2 Controller Ramp 3 Controller Ramp 4 Controller Ramp 5 Controller Ramp 6 Controller Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Data Arbiter Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Controller Controller Ramp 5 Controller Ramp 4 Controller Ramp 3 Controller Ramp 2 Controller Ramp 1 Controller Ramp 0 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF0 Ring1 Ring3 Ring0 Ring2 controls
I/O and Memory Interfaces   I/O Provides wide bandwidth Dual XDR TM  controller (25.6GB/s @ 3.2Gbps) Two configurable interfaces (76.8GB/s @6.4Gbps) Configurable number of Bytes Coherent or I/O Protection   Allows for multiple system configurations
4 Cell Platform
Game console systems Blades HDTV Home media servers Supercomputers ...... ? Cell  processor can support many systems Cell BE Processor XDR tm XDR tm IOIF0 IOIF1 Cell BE Processor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF Cell BE Proessor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF Cell BE Processor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF SW
Chassis   Standard IBM BladeCenter with: 7 Blades (for 2 slots each) with full performance  2 switches (1Gb Ethernet) with 4 external ports each  Updated Management Module Firmware. External Infiniband Switches with optional FC ports. Blade (400 GFLOPs) Game Processor and Support Logic: Dual Processor Configuration  Single SMP OS image 1GB XDRAM Optionally PCI-exp attached standard graphics adapter BladeCenter Interface ( Based on  IBM JS20): New Blade Power System and Sense Logic Control  Firmware to connect processor & support logic to H8 service processor  Signal Level Converters for processor & support logic 2 Infiniband (IB) Host Adapters with 2x IB 4x each  Physical link drivers (GbE Phy etc)   Chassis 2x (+12V  RS-485,USB,GbEn) Rambus Design: DRAM 1/2GB Cell BE Processor H8 SP Blade Input Power &Sense Level Convert GbE Phy BladeCenter  Interface Blade Cell BE Processor South Bridge Rambus Design: DRAM 1/2GB South Bridge IB 4X IB 4X Blade QS20 Hardware Description
QS20 Blade (w/o heatsinks)
QS20 Blade Assembly ATA Disk Service Proc.  South Bridges InfiniBand Cards Blade Bezel
Up to 2 InfiniBand Cards can be attached. Standard PC InfiniBand Card with special bezel MHEA28-1TCSB Dual-Port HCA  PCI Express x8 interface Dual 10 Gb/s  InfiniBand 4X Ports 128 MB Local Memory IBTA v1.1 Compatible Design Options - InfiniBand
Cell Software Stack Firmware Applications SLOF powerpc architecture dependent code Cell Broadband Engine Linux memory management device drivers gcc ppc64, spu backend glibc Hardware RTAS Secondary Boot Loader powerpc- and cell- specific Linux code Low-level FW scheduler (pSeries) (PMac) cell User space Linux common code device drivers
Cell BE Development Platform Cell BE Firmware Graphics Std Devices Developer Workstation Cell Linux kernel Lower-level programming interface Basic Cell runtime: lib_spe, spelibc, … Basic Cell toolchain: gcc, binutils, gdb, oprofile, … Cell aware tooling Application Framework (segment specific) Standard Linux Development Environment    ppc64 Cell optimized libraries Cell specialized compilers Higher-level programming interface Application-level programming interface Tooling Libraries Cell enablement Cell exploitation Cell is an exotic platform and hard to program Challenging to exploit SPEs:   Limited local memory (256 KB) – need to DMA   data and code fragments back and forth   Multi-level parallelism – 8 SPEs, 128-bit wide   SIMD units in each SPE If done right, the result is impressive performance… Make Cell easier to program Hide complexity in critical libraries Compiler support for standard tasks, e.g., overlays, global data access, SW-managed cache, auto vectorization, auto parallelization, … Smart tooling Make Cell a standard platform Middleware and frameworks provide architecture-specific components and hide Cell –specifics from application developer
Alpha Quality SDK hosted on FC4 / X86 OS: Initial Linux Cell 2.6.14 patches SPE Threads runtime  XLC  Cell C Compiler SPE gdb debugger Cell Coding Sample Source Documentation Installation Scripts Cell Hardware Specs Programming Docs SDK1.0 GCC Tools from SCEA gcc 3.0 for Cell Binutils for Cell Alpha Quality SDK hosted on FC5 / X86 Critical Linux Cell Performance Enhancements Cell Enhanced Functions Critical Cell RAS Functions Machine Check, System Error Performance Analysis Tools Oprofile – PPU Cycle only profiling (No SPU) GNU Toolchain updates Mambo Updates Julia Set Sample SDK1.1 Execution platform: Cell Simulator  Hosting platform: Linux/86 (FC4) 11/2005 7/2006 SDK 2.0 12/2006 XL C/C++  Linux/x86, LoP  Overlay prototype Auto-SIMD enhancements Linux Kernel updates Performance Enhancements RAS/ Debug support SPE runtime extensions Interrupt controller enhancements GNU Toolchain updates FSF integration GDB multi-thread support Newlib library optimization Prog model support for overlay Programming Model Preview Overlay support Accelerated Libraries Framework Library enhancements Vector Math Library – Phase 1  MASS Library for PPU, MASSV Library for PPU/SPU IDE Tool integration Remote tool support Performance Analysis Visualization tools Bandwidth, Latency, Lock analyzers Performance debug tools Oprofile – SDK 1.1 plus PPU event based profiling Mambo Performance model correlation Visualization SDK1.0.1 Execution platform: Cell Simulator  Cell Blade 1 rev 2 Hosting platform: Linux/86 (FC4) Linux/Cell (FC4)* Linux/Power (FC4)* Execution platform: Cell Simulator  Cell Blade 1 rev 3 Hosting platform: Linux/86 (FC5) Linux/Cell (FC5)* Linux/Power (FC5)* Refresh Execution platform: Cell Simulator  Cell Blade 1 rev 3 Hosting platform: Linux/86 (FC5) Linux/Cell (FC5)* Linux/Power (FC5)* 2/2006 Refresh 9/2006 SDK1.1.1 Documentation Mambo updates for CB1 and 64-bit hosting ISO image update * Subset of tools
Cell library  content (source) ~ 156k loc Standard SPE C library subset optimized SPE C functions including stdlib c lib, math and etc. Audio resample - resampling audio signals FFT - 1D and 2D fft functions gmath - mathematic functions optimized for gaming environment image - convolution functions intrinsics - generic intrinsic conversion functions large-matrix - functions performing large matrix operations matrix - basic matrix operations mpm- multi-precision math functions noise - noise generation functions oscillator - basic sound generation functions sim- providing I/O channels to simulated environments surface - a set of bezier curve and surface functions sync - synchronization library vector - vector operation functions http://www.alphaworks.ibm.com/tech/cellsw
5 Cell Applications
Peak GFLOPs FreeScale  DC 1.5 GHz PPC 970  2.2 GHz AMD DC  2.2 GHz Intel SC 3.6 GHz Cell  3.0 GHz
Cell Processor Example Application Areas Cell is a processor that excels at processing of rich media content in the context of broad connectivity Digital content creation (games and movies)  Game playing and game serving  Distribution of (dynamic, media rich) content  Imaging and image processing  Image analysis (e.g. video surveillance)  Next-generation physics-based visualization  Video conferencing (3D?)  Streaming applications (codecs etc.)  Physical simulation & science
Opportunities for Cell BE Blade Aerospace & Defense Signal & Image Processing Security, Surveillance Simulation & Training, … Petroleum Industry Seismic computing Reservoir Modeling, … Communications Equipment LAN/MAN Routers Access Converged Networks Security, … Medical Imaging CT Scan Ultrasound, … Consumer / Digital Media Digital Content Creation Media Platform Video Surveillance, … Public Sector / Gov’t & Higher Educ. Signal & Image Processing Computational Chemistry, … Finance Trade modeling Industrial Semiconductor / LCD Video Conference Petroleum Industry A&D Comm Industrial Cell Assets Consumer Public Finance
Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation of:  Protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine.  Folding@Home utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. 14,000 PlayStation 3’s are literally outperforming 159,000 Windows Computers  by more than Double! In fact they out perform all the other clients combined. http://folding.stanford.edu/FAQ-PS3.html Dr. V. S. Pande, folding@home, Distributed Computing Project, Stanford University
Ported by 235 584 tetrahedra 48 000 nodes 28 iterations in NKMG solver In  3.8 seconds Sustained Performance for large Objects:  52 GFLOP/s Multigrid Finite Element Solver on Cell using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
Computational Fluid Dynamics Solver on Cell Ported by Sustained Performance for large Objects:  Not yet benchmarked (3/2007) using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
Computational Fluid Dynamics Solver on Cell A Lattice-Boltzmann Solver Developed by Fraunhofer IWTM http://www.itwm.fraunhofer.de/
Terrain Rendering Engine (TRE) and IBM Blades Systems and Technology Group Commodity Cell BE Blade Add Live Video, Aerial Information, Combat Situational Awareness Next-Gen GCS Combine Data & Render Aircraft data / Field Data BladeCenter-1  Chassis QS20
Example:  Medical Computer Tomography (CT) Scans Image whole heart in 1 rotation 4D CT –  includes time 2 slices 4 slices 8 slices 16 slices 32 slices 64  slices 128 slices 256  slices Current CT  Products   Future CT  Products
The moving image is aligned to the fixed image as the registration proceeds. Fixed Image Moving Image Registration Process “ Image Registration” Using Cell
6 Cell Programming
Small single-SPE models – a sample /* spe_foo.c: * A C program to be compiled into an executable called “spe_foo”  */ int main( int speid,  addr64 argp, addr64 envp ) { char i; /* do something intelligent here */ i = func_foo ( argp ); /* when the syscall is supported */ printf ( “Hello world! my result is %d \n”, i); return i ; }
extern spe_program_handle  spe_foo ;  /* the spe image handle from CESOF */ int main() { int rc, status; speid_t spe_id; /* load & start the spe_foo program on an allocated spe */ spe_id =  spe_create_thread  (0, &spe_foo, 0, NULL, -1, 0); /* wait for spe prog. to complete and return final status */ rc =  spe_wait   (spe_id, &status, 0); return status; } Small single-SPE models – PPE controlling program
Using SPEs (1)  Simple Function Offload Remote Procedure Call Style SPE working set fits in Local Store PPE initiates DMA data/code transfers Could be easily supported by a programming env, e.g., RPC Style IDL Compiler  Compiler Directives (pragmas) Libraries Or even automatic scheduling of code/data to SPEs (2)  Typical (Complex) Function Offload SPE working set larger than Local Store PPE initially loads SPE LS with small startup code SPE initiates DMAs (code/data staging)  Stream data through code  Stream code through data Latency hiding required in most cases Requires &quot;high locality of reference&quot; characteristics Can be extended to a “services offload model” PowerPC (PPE) SPU Local Store MFC N SPE Puts Results PPE Puts Text Static Data Parameters SPE executes PowerPC (PPE) SPU Local Store MFC N SPE Puts Results PPE Puts Initial Text Static Data Parameters System Memory SPE Independently Stages Text & Intermediate Data Transfers while executing
Using SPEs (3)  Pipelining for complex functions Functions split up in processing stages Direct LS to LS communication possible Including LS to LS DMA Avoid PPE / System Memory bottlenecks (4)  Parallel stages for very compute-intense functions PPE partitions and distributes work to multiple SPEs SPU Local Store MFC N SPU Local Store MFC N Parallel-stages PowerPC (PPE) System Memory PowerPC (PPE) System Memory SPU Local Store MFC N SPU Local Store MFC N Multi-stage Pipeline  SPU Local Store MFC N
Large single-SPE programming models Data or code working set cannot fit completely into a local store The PPE controlling process, kernel, and libspe runtime set up the system memory mapping as SPE’s secondary memory store The SPE program accesses the secondary memory store via its software-controlled SPE DMA engine - Memory Flow Controller (MFC) SPE  Program System Memory PPE controller  maps system memory for  SPE DMA trans. DMA  transactions Local Store
Large single-SPE programming models – I/O data System memory for large size input / output data e.g.  Streaming model System memory int ip[32] int op[32] SPE program: op = func(ip) DMA DMA Local store int g_ip[512*1024] int g_op[512*1024]
Large single-SPE programming models System memory as secondary memory store Manual management of data buffers Automatic software-managed data cache  Software cache framework libraries Compiler runtime support System memory SW cache entries SPE program Local store Global objects
Large single-SPE programming models System memory as secondary memory store Manual loading of plug-in into code buffer Plug-in framework libraries Automatic software-managed code overlay Compiler generated overlaying code System memory Local store SPE plug-in b SPE plug-in a SPE plug-in e SPE plug-in a SPE plug-in b SPE plug-in c SPE plug-in d SPE plug-in e SPE plug-in f
Large single-SPE prog. models – Job Queue Code and data packaged together as inputs to an SPE kernel program A multi-tasking model  – more discussion later Job queue System memory Local store code/data n code/data n+1 code/data n+2 code/data … Code n Data n SPE kernel DMA
Large single-SPE programming models - DMA DMA latency handling is critical to overall performance for SPE programs moving large data or code Data pre-fetching is a key technique to hide DMA latency e.g. double-buffering Time I Buf 1 (n) O Buf 1 (n) I Buf 2 (n+1) O Buf 2 (n-1) SPE program: Func (n) output n-2 input n Output n-1 Func (input n ) Input n+1 Func (input n+1 ) Func (input n-1 ) output n Input n+2 DMAs SPE exec. DMAs SPE exec.
Large single-SPE programming models - CESOF C ell  E mbedded  S PE  O bject  F ormat (CESOF) and PPE/SPE toolchains support the resolution of SPE references to the global system memory objects in the effective-address space. _EAR_g_foo structure Local Store Space Effective Address Space DMA  transactions CESOF EAR symbol resolution Char g_foo[512] Char local_foo[512]
Parallel programming models – Job Queue Large set of jobs fed through a group of SPE programs Streaming is a special case of job queue with regular and sequential data Each SPE program locks on the shared job queue to obtain next job For uneven jobs, workloads are self-balanced among available SPEs PPE SPE1 Kernel() SPE0 Kernel() SPE7 Kernel() System Memory I n . I 7 I 6 I 5 I 4 I 3 I 2 I 1 I 0 O n . O 7 O 6 O 5 O 4 O 3 O 2 O 1 O 0 … ..
Parallel programming models – Pipeline / Streaming Use LS to LS DMA bandwidth, not system memory bandwidth Flexibility in connecting pipeline functions Larger collective code size per pipeline Load-balance is harder PPE SPE1 Kernel 1 () SPE0 Kernel 0 () SPE7 Kernel 7 () System Memory I n . . I 6 I 5 I 4 I 3 I 2 I 1 I 0 O n . . O 6 O 5 O 4 O 3 O 2 O 1 O 0 … .. DMA DMA
Multi-tasking SPEs – LS resident multi-tasking Simplest multi-tasking programming model No memory protection among tasks  Co-operative, Non-preemptive, event-driven scheduling Task a Task b Task c Task d Task x Event Dispatcher Local Store SPE n Event Queue a c a d x a c d
Multi-tasking SPEs – Self-managed multi-tasking Non-LS resident Blocked job context is swapped out of LS and scheduled back later to the job queue once unblocked System memory Local store task n task n+1 task n+2 Task … Code n Data n SPE kernel task n’ task queue Job queue
libspe sample code #include <libspe.h> int main(int argc, char *argv[], char *envp[]) { spe_program_handle_t *binary; speid_t spe_thread; int status; binary = spe_open_image(argv[1]); if (!binary) return 1; spe_thread = spe_create_thread(0, binary, argv+1,    envp, -1, 0); if (!spe_thread) return 2; spe_wait(spe_thread, &status, 0); spe_close_image(binary); return status; }
libspe sample code #include <libspe.h> int main(int argc, char *argv[], char *envp[]) { spe_program_handle_t *binary; speid_t spe_thread; int status; binary = spe_open_image(argv[1]); if (!binary) return 1; spe_thread = spe_create_thread(0, binary, argv+1,    envp, -1, 0); if (!spe_thread) return 2; spe_wait(spe_thread, &status, 0); spe_close_image(binary); return status; }
libspe sample code #include <libspe.h> int main(int argc, char *argv[], char *envp[]) { spe_program_handle_t *binary; speid_t spe_thread; int status; binary = spe_open_image(argv[1]); if (!binary) return 1; spe_thread = spe_create_thread(0, binary, argv+1,    envp, -1, 0); if (!spe_thread) return 2; spe_wait(spe_thread, &status, 0); spe_close_image(binary); return status; }
Linux on Cell/B.E. kernel components Platform abstraction arch/powerpc/platforms/{cell,ps3,beat} Integrated Interrupt Handling I/O Memory Management Unit Power Management Hypervisor abstractions South Bridge drivers SPU file system
SPU file system Virtual File System /spu holds SPU contexts as directories Files are primary user interfaces New system calls: spu create and spu run SPU contexts abstracted from real SPU Preemptive context switching (W.I.P)
PPE on Cell is a 100% compliant ppc64! A solid base… Everything in a distribution, all middleware runs out of the box All tools available BUT: not optimized to exploit Cell Toolchain needs to cover Cell aspects Optimized, critical “middleware” for Cell needed Depending on workload requirements
Using SPEs: Task Based Abstraction    APIs provided by user space libraries SPE programs controlled via PPE-originated thread function calls spe_create_thread(), ... Calls on PPE and SPE Mailboxes DMA Events Simple runtime support (local store heap management, etc.) Lots of library extensions Encryption, signal processing, math operations
spu_create int spu create(const char *pathname, int flags, mode t mode); creates a new context in pathname returns an open file descriptor context is gets destroyed when fd is closed
spu_run uint32 t spu run(int fd, uint32 t *npc, uint32 t *status); transfers flow of control to SPU context fd returns when the context has stopped for some reason, e.g. exit or forceful abort callback from SPU to PPU can be interrupted by signals
PPE programming interfaces Asynchronous SPE thread API (“libspe 1.x”) spe_create_thread spe_wait spe_kill . . .
spe create thread implementation Allocate virtual SPE (spu create) Load SPE application code into context Start PPE thread using pthread create New thread calls spu run
More libspe interfaces Event notification int spe get event(struct spe event *, int nevents, int timeout); Message passing spe read out mbox(speid t speid); spe write in mbox(speid t speid); spe write signal(speid t speid, unsigned reg, unsigned data); Local store access void *spe get ls(speid t speid);
GNU tool chain PPE support Just another PowerPC variant. . . SPE support Just another embedded processor. . . Cell/B.E. support More than just PPE + SPE!
Object file format PPE: regular ppc/ppc64 ELF binaries SPE: new ELF flavour EM SPU 32-bit big-endian No shared libraries Manipulated via cross-binutils New: Code overlay support Cell/B.E.: combined object files embedspu: link into one binary .rodata.spuelf section in PPE object CESOF: SPE− >PPE symbol references
gcc on the PPE handled by “rs6000” back end Processor-specific tuning pipeline description
gcc on the SPE Merged Jan 3rd Built as cross-compiler Handles vector data types, intrinsics Middle-end support: branch hints, aggressive if-conversion GCC 4.1 port exploiting auto-vectorization No Java
Existing proprietary applications Games Volume rendering Real-time Raytracing Digital Video Monte Carlo simulation
Obviously missing ffmpeg, mplayer, VLC VDR, mythTV Xorg acceleration OpenSSL Your project here !!!
Questions! Thank you very much for your attention.
7 Appendix
Documentation  (new or recently updated) Cell Broadband Engine  Cell Broadband Engine Architecture V1.0  Cell Broadband Engine Programming Handbook V1.0  Cell Broadband Engine Registers V1.3 SPU C/C++ Language Extensions V2.1  Synergistic Processor Unit (SPU) Instruction Set Architecture V1.1  SPU Application Binary Interface Specification V1.4  SPU Assembly Language Specification V1.3  Cell Broadband Engine Programming using the SDK  Cell Broadband Engine SDK Installation and User's Guide V1.1   Cell Broadband Engine Programming Tutorial V1.1  Cell Broadband Engine Linux Reference Implementation ABI V1.0  SPE Runtime Management library documentation V1.1  SDK Sample Library documentation V1.1  IDL compiler documentation V1.1 New developerWorks Articles Maximizing the power of the Cell Broadband Engine processor Debugging Cell Broadband Engine systems
Documentation  (new or recently updated) IBM Cell Broadband Engine Full-System Simulator IBM Full-System Simulator Users Guide IBM Full-System Simulator Command Reference Performance Analysis with the IBM Full-System Simulator IBM Full-System Simulator BogusNet HowTo PowerPC Architecture Book Book I: PowerPC User Instruction Set Architecture Version 2.02 Book II: PowerPC Virtual Environment Architecture Version 2.02 Book III: PowerPC Operating Environment Architecture Version 2.02 Vector/SIMD Multimedia Extension Technology Programming Environments Manual Version 2.06c
Links Cell Broadband Engine http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine IBM BladeCenter QS20 http://www-03.ibm.com/technology/splash/qs20/ Cell Broadband Engine resource center http://www-128.ibm.com/developerworks/power/cell/ Cell Broadband Engine resource center - Documentation archive http://www-128.ibm.com/developerworks/power/cell/docs_documentation.html Cell Broadband Engine technology http://www.alphaworks.ibm.com/topics/cell Power.org's Cell Developers Corner http://www.power.org/resources/devcorner/cellcorner Barcelona Supercomputer Center - Linux on Cell http://www.bsc.es/projects/deepcomputing/linuxoncell/ Barcelona Supercomputer Center - Documentation http://www.bsc.es/plantillaH.php?cat_id=262 Heiko J Schick's Cell Bookmarks http://del.icio.us/schihei/Cell
The Cell Processor

The Cell Processor

  • 1.
    The Cell ProcessorComputing of tomorrow or yesterday? Open Systems Design and Development 2007-04-12 | Heiko J Schick <schickhj@de.ibm.com> © 2007 IBM Corporation
  • 2.
    Agenda Introduction Limitersto Processor Performance Cell Architecture Cell Platform Cell Applications Cell Programming Appendix
  • 3.
  • 4.
    Cell History IBM,SCEI / Sony and Toshiba Alliance formed in 2000 Design Center opened in March 2001 (Based in Austin, Texas) Single Cell BE operational Spring 2004 2-way SMP operational Summer 2004 February 7, 2005: First technical disclosures November 9, 2005: Open Source SDK Published
  • 5.
    The problem is…… the view from the computer room!
  • 6.
    Outlook Source: Kurzweil“ Computer performance increases since 100 years exponential !!!”
  • 7.
    But what couldyou do if all objects were intelligent… … and connected?
  • 8.
    What could youdo with unlimited computing power… for pennies? Could you predict the path of a storm down to the square kilometer? Could you identify another 20% of proven oil reserves without drilling one hole?
  • 9.
    2 Limiters toProcessor Performance
  • 10.
    Power Wall /Voltage Wall Power components: Active Power Passive Power Gate leakage Sub-threshold leakage (source-drain leakage) Source: Tom’s Hardware Guide 1
  • 11.
    Memory Wall Mainmemory now nearly 1000 cycles from the processor Situation worse with (on-chip) SMP Memory latency penalties drive inefficiency in the design Expensive and sophisticated hardware to try and deal with it Programmers that try to gain control of cache content are hindered by the hardware mechanisms Latency induced bandwidth limitations Much of the bandwidth to memory in systems can only be used speculatively 2
  • 12.
    Frequency Wall Increasingfrequencies and deeper pipelines have reached diminishing returns on performance Returns negative if power is taken into account Results of studies depend on issue width of processor The wider the processor the slower it wants to be Simultaneous Multithreading helps to use issue slots efficiently Results depend on number of architected registers and workload More registers tolerate deeper pipeline Fewer random branches in application tolerates deeper pipelines 3
  • 13.
    Microprocessor Efficiency Gelsinger’slaw 1.4x more performance for 2x more Hofstee’s corollary 1/1.4x efficiency loss in every generation Examples: Cache size, Out-of-Order, Super-scalar, etc. Source: Tom’s Hardware Guide Increasing performance requires increasing efficiency !!!
  • 14.
    Attacking the PerformanceWalls Multi-Core Non-Homogeneous Architecture Control Plane vs. Data Plane processors Attacks Power Wall 3-level Model of Memory Main Memory, Local Store, Registers Attacks Memory Wall Large Shared Register File & SW Controlled Branching Allows deeper pipelines (11FO4 helps power) Attacks Frequency Wall
  • 15.
  • 16.
    Cell BE Processor~250M transistors ~235mm2 Top frequency >3GHz 9 cores, 10 threads > 200+ GFlops (SP) @3.2 GHz > 20+ GFlops (DP) @3.2 GHz Up to 25.6GB/s memory B/W Up to 76,8GB/s I/O B/W ~400M$(US) design investment
  • 17.
    Key Attributes ofCell Cell is Multi-Core Contains 64-bit Power Architecture TM Contains 8 Synergistic Processor Elements (SPE) Cell is a Flexible Architecture Multi-OS support (including Linux) with Virtualization technology Path for OS, legacy apps, and software development Cell is a Broadband Architecture SPE is RISC architecture with SIMD organization and Local Store 128+ concurrent transactions to memory per processor Cell is a Real-Time Architecture Resource allocation (for Bandwidth Measurement) Locking Caches (via Replacement Management Tables) Cell is a Security Enabled Architecture SPE dynamically reconfigurable as secure processors
  • 19.
    Power Processor Element(PPE) 64-bit Power Architecture™ with VMX In-order, 2-way hardware Multi-threading Coherent Load/Store with 32KB I & D L1 and 512KB L2 Controls the SPEs
  • 20.
    Synergistic Processor Elements(SPEs) SPE provides computational performance Dual issue, up to 16-way 128-bit SIMD Dedicated resources: 128 128-bit register file, 256KB Local Store Each can be dynamically configured to protect resources Dedicated DMA engine: Up to 16 outstanding request Memory flow controller for DMA 25 GB/s DMA data transfer “ I/O Channels” for IPC Seperate Cores Simple Implementation (e.g. no branch prediction) No Caches No protected instructions
  • 21.
    SPE BLOCK DIAGRAMPermute Unit Load-Store Unit Floating-Point Unit Fixed-Point Unit Branch Unit Channel Unit Result Forwarding and Staging Register File Local Store (256kB) Single Port SRAM 128B Read 128B Write DMA Unit Instruction Issue Unit / Instruction Line Buffer 8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle 64 Byte/Cycle On-Chip Coherent Bus
  • 22.
    Element Interconnect BusFour 16 byte data rings, supporting multiple transfers 96B/cycle peak bandwidth Over 100 outstanding requests 300+ GByte/sec @ 3.2 GHz Element Interconnect Bus (EIB)
  • 23.
    Four 16B datarings connecting 12 bus elements Two clockwise / Two counter-clockwise Physically overlaps all processor elements Central arbiter supports up to three concurrent transfers per data ring Two stage, dual round robin arbiter Each element port simultaneously supports 16B in and 16B out data path Ring topology is transparent to element data interface Element Interconnect Bus (EIB) 16B 16B 16B 16B Data Arb 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B 16B SPE0 SPE2 SPE4 SPE6 SPE7 SPE5 SPE3 SPE1 MIC PPE BIF/IOIF0 IOIF1
  • 24.
    Example of eightconcurrent transactions MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF1 Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Controller Controller Ramp 0 Controller Ramp 1 Controller Ramp 2 Controller Ramp 3 Controller Ramp 4 Controller Ramp 5 Controller Ramp 6 Controller Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Data Arbiter Ramp 7 Controller Ramp 8 Controller Ramp 9 Controller Ramp 10 Controller Ramp 11 Controller Controller Ramp 5 Controller Ramp 4 Controller Ramp 3 Controller Ramp 2 Controller Ramp 1 Controller Ramp 0 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 MIC SPE0 SPE2 SPE4 SPE6 BIF / IOIF0 Ring1 Ring3 Ring0 Ring2 controls
  • 25.
    I/O and MemoryInterfaces I/O Provides wide bandwidth Dual XDR TM controller (25.6GB/s @ 3.2Gbps) Two configurable interfaces (76.8GB/s @6.4Gbps) Configurable number of Bytes Coherent or I/O Protection Allows for multiple system configurations
  • 27.
  • 28.
    Game console systemsBlades HDTV Home media servers Supercomputers ...... ? Cell processor can support many systems Cell BE Processor XDR tm XDR tm IOIF0 IOIF1 Cell BE Processor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF Cell BE Proessor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF Cell BE Processor XDR tm XDR tm IOIF BIF Cell BE Processor XDR tm XDR tm IOIF SW
  • 29.
    Chassis Standard IBM BladeCenter with: 7 Blades (for 2 slots each) with full performance 2 switches (1Gb Ethernet) with 4 external ports each Updated Management Module Firmware. External Infiniband Switches with optional FC ports. Blade (400 GFLOPs) Game Processor and Support Logic: Dual Processor Configuration Single SMP OS image 1GB XDRAM Optionally PCI-exp attached standard graphics adapter BladeCenter Interface ( Based on IBM JS20): New Blade Power System and Sense Logic Control Firmware to connect processor & support logic to H8 service processor Signal Level Converters for processor & support logic 2 Infiniband (IB) Host Adapters with 2x IB 4x each Physical link drivers (GbE Phy etc) Chassis 2x (+12V RS-485,USB,GbEn) Rambus Design: DRAM 1/2GB Cell BE Processor H8 SP Blade Input Power &Sense Level Convert GbE Phy BladeCenter Interface Blade Cell BE Processor South Bridge Rambus Design: DRAM 1/2GB South Bridge IB 4X IB 4X Blade QS20 Hardware Description
  • 30.
    QS20 Blade (w/oheatsinks)
  • 31.
    QS20 Blade AssemblyATA Disk Service Proc. South Bridges InfiniBand Cards Blade Bezel
  • 32.
    Up to 2InfiniBand Cards can be attached. Standard PC InfiniBand Card with special bezel MHEA28-1TCSB Dual-Port HCA PCI Express x8 interface Dual 10 Gb/s InfiniBand 4X Ports 128 MB Local Memory IBTA v1.1 Compatible Design Options - InfiniBand
  • 33.
    Cell Software StackFirmware Applications SLOF powerpc architecture dependent code Cell Broadband Engine Linux memory management device drivers gcc ppc64, spu backend glibc Hardware RTAS Secondary Boot Loader powerpc- and cell- specific Linux code Low-level FW scheduler (pSeries) (PMac) cell User space Linux common code device drivers
  • 34.
    Cell BE DevelopmentPlatform Cell BE Firmware Graphics Std Devices Developer Workstation Cell Linux kernel Lower-level programming interface Basic Cell runtime: lib_spe, spelibc, … Basic Cell toolchain: gcc, binutils, gdb, oprofile, … Cell aware tooling Application Framework (segment specific) Standard Linux Development Environment  ppc64 Cell optimized libraries Cell specialized compilers Higher-level programming interface Application-level programming interface Tooling Libraries Cell enablement Cell exploitation Cell is an exotic platform and hard to program Challenging to exploit SPEs: Limited local memory (256 KB) – need to DMA data and code fragments back and forth Multi-level parallelism – 8 SPEs, 128-bit wide SIMD units in each SPE If done right, the result is impressive performance… Make Cell easier to program Hide complexity in critical libraries Compiler support for standard tasks, e.g., overlays, global data access, SW-managed cache, auto vectorization, auto parallelization, … Smart tooling Make Cell a standard platform Middleware and frameworks provide architecture-specific components and hide Cell –specifics from application developer
  • 35.
    Alpha Quality SDKhosted on FC4 / X86 OS: Initial Linux Cell 2.6.14 patches SPE Threads runtime XLC Cell C Compiler SPE gdb debugger Cell Coding Sample Source Documentation Installation Scripts Cell Hardware Specs Programming Docs SDK1.0 GCC Tools from SCEA gcc 3.0 for Cell Binutils for Cell Alpha Quality SDK hosted on FC5 / X86 Critical Linux Cell Performance Enhancements Cell Enhanced Functions Critical Cell RAS Functions Machine Check, System Error Performance Analysis Tools Oprofile – PPU Cycle only profiling (No SPU) GNU Toolchain updates Mambo Updates Julia Set Sample SDK1.1 Execution platform: Cell Simulator Hosting platform: Linux/86 (FC4) 11/2005 7/2006 SDK 2.0 12/2006 XL C/C++ Linux/x86, LoP Overlay prototype Auto-SIMD enhancements Linux Kernel updates Performance Enhancements RAS/ Debug support SPE runtime extensions Interrupt controller enhancements GNU Toolchain updates FSF integration GDB multi-thread support Newlib library optimization Prog model support for overlay Programming Model Preview Overlay support Accelerated Libraries Framework Library enhancements Vector Math Library – Phase 1 MASS Library for PPU, MASSV Library for PPU/SPU IDE Tool integration Remote tool support Performance Analysis Visualization tools Bandwidth, Latency, Lock analyzers Performance debug tools Oprofile – SDK 1.1 plus PPU event based profiling Mambo Performance model correlation Visualization SDK1.0.1 Execution platform: Cell Simulator Cell Blade 1 rev 2 Hosting platform: Linux/86 (FC4) Linux/Cell (FC4)* Linux/Power (FC4)* Execution platform: Cell Simulator Cell Blade 1 rev 3 Hosting platform: Linux/86 (FC5) Linux/Cell (FC5)* Linux/Power (FC5)* Refresh Execution platform: Cell Simulator Cell Blade 1 rev 3 Hosting platform: Linux/86 (FC5) Linux/Cell (FC5)* Linux/Power (FC5)* 2/2006 Refresh 9/2006 SDK1.1.1 Documentation Mambo updates for CB1 and 64-bit hosting ISO image update * Subset of tools
  • 36.
    Cell library content (source) ~ 156k loc Standard SPE C library subset optimized SPE C functions including stdlib c lib, math and etc. Audio resample - resampling audio signals FFT - 1D and 2D fft functions gmath - mathematic functions optimized for gaming environment image - convolution functions intrinsics - generic intrinsic conversion functions large-matrix - functions performing large matrix operations matrix - basic matrix operations mpm- multi-precision math functions noise - noise generation functions oscillator - basic sound generation functions sim- providing I/O channels to simulated environments surface - a set of bezier curve and surface functions sync - synchronization library vector - vector operation functions http://www.alphaworks.ibm.com/tech/cellsw
  • 37.
  • 38.
    Peak GFLOPs FreeScale DC 1.5 GHz PPC 970 2.2 GHz AMD DC 2.2 GHz Intel SC 3.6 GHz Cell 3.0 GHz
  • 39.
    Cell Processor ExampleApplication Areas Cell is a processor that excels at processing of rich media content in the context of broad connectivity Digital content creation (games and movies) Game playing and game serving Distribution of (dynamic, media rich) content Imaging and image processing Image analysis (e.g. video surveillance) Next-generation physics-based visualization Video conferencing (3D?) Streaming applications (codecs etc.) Physical simulation & science
  • 40.
    Opportunities for CellBE Blade Aerospace & Defense Signal & Image Processing Security, Surveillance Simulation & Training, … Petroleum Industry Seismic computing Reservoir Modeling, … Communications Equipment LAN/MAN Routers Access Converged Networks Security, … Medical Imaging CT Scan Ultrasound, … Consumer / Digital Media Digital Content Creation Media Platform Video Surveillance, … Public Sector / Gov’t & Higher Educ. Signal & Image Processing Computational Chemistry, … Finance Trade modeling Industrial Semiconductor / LCD Video Conference Petroleum Industry A&D Comm Industrial Cell Assets Consumer Public Finance
  • 41.
    Since 2000, Folding@Home(FAH) has led to a major jump in the capabilities of molecular simulation of: Protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine. Folding@Home utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. 14,000 PlayStation 3’s are literally outperforming 159,000 Windows Computers by more than Double! In fact they out perform all the other clients combined. http://folding.stanford.edu/FAQ-PS3.html Dr. V. S. Pande, folding@home, Distributed Computing Project, Stanford University
  • 42.
    Ported by 235584 tetrahedra 48 000 nodes 28 iterations in NKMG solver In 3.8 seconds Sustained Performance for large Objects: 52 GFLOP/s Multigrid Finite Element Solver on Cell using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
  • 43.
    Computational Fluid DynamicsSolver on Cell Ported by Sustained Performance for large Objects: Not yet benchmarked (3/2007) using the free SDK www.digitalmedics.de ls7-www.cs.uni-dortmund.de
  • 44.
    Computational Fluid DynamicsSolver on Cell A Lattice-Boltzmann Solver Developed by Fraunhofer IWTM http://www.itwm.fraunhofer.de/
  • 45.
    Terrain Rendering Engine(TRE) and IBM Blades Systems and Technology Group Commodity Cell BE Blade Add Live Video, Aerial Information, Combat Situational Awareness Next-Gen GCS Combine Data & Render Aircraft data / Field Data BladeCenter-1 Chassis QS20
  • 46.
    Example: MedicalComputer Tomography (CT) Scans Image whole heart in 1 rotation 4D CT – includes time 2 slices 4 slices 8 slices 16 slices 32 slices 64 slices 128 slices 256 slices Current CT Products Future CT Products
  • 47.
    The moving imageis aligned to the fixed image as the registration proceeds. Fixed Image Moving Image Registration Process “ Image Registration” Using Cell
  • 48.
  • 49.
    Small single-SPE models– a sample /* spe_foo.c: * A C program to be compiled into an executable called “spe_foo” */ int main( int speid, addr64 argp, addr64 envp ) { char i; /* do something intelligent here */ i = func_foo ( argp ); /* when the syscall is supported */ printf ( “Hello world! my result is %d \n”, i); return i ; }
  • 50.
    extern spe_program_handle spe_foo ; /* the spe image handle from CESOF */ int main() { int rc, status; speid_t spe_id; /* load & start the spe_foo program on an allocated spe */ spe_id = spe_create_thread (0, &spe_foo, 0, NULL, -1, 0); /* wait for spe prog. to complete and return final status */ rc = spe_wait (spe_id, &status, 0); return status; } Small single-SPE models – PPE controlling program
  • 51.
    Using SPEs (1) Simple Function Offload Remote Procedure Call Style SPE working set fits in Local Store PPE initiates DMA data/code transfers Could be easily supported by a programming env, e.g., RPC Style IDL Compiler Compiler Directives (pragmas) Libraries Or even automatic scheduling of code/data to SPEs (2) Typical (Complex) Function Offload SPE working set larger than Local Store PPE initially loads SPE LS with small startup code SPE initiates DMAs (code/data staging)  Stream data through code  Stream code through data Latency hiding required in most cases Requires &quot;high locality of reference&quot; characteristics Can be extended to a “services offload model” PowerPC (PPE) SPU Local Store MFC N SPE Puts Results PPE Puts Text Static Data Parameters SPE executes PowerPC (PPE) SPU Local Store MFC N SPE Puts Results PPE Puts Initial Text Static Data Parameters System Memory SPE Independently Stages Text & Intermediate Data Transfers while executing
  • 52.
    Using SPEs (3) Pipelining for complex functions Functions split up in processing stages Direct LS to LS communication possible Including LS to LS DMA Avoid PPE / System Memory bottlenecks (4) Parallel stages for very compute-intense functions PPE partitions and distributes work to multiple SPEs SPU Local Store MFC N SPU Local Store MFC N Parallel-stages PowerPC (PPE) System Memory PowerPC (PPE) System Memory SPU Local Store MFC N SPU Local Store MFC N Multi-stage Pipeline SPU Local Store MFC N
  • 53.
    Large single-SPE programmingmodels Data or code working set cannot fit completely into a local store The PPE controlling process, kernel, and libspe runtime set up the system memory mapping as SPE’s secondary memory store The SPE program accesses the secondary memory store via its software-controlled SPE DMA engine - Memory Flow Controller (MFC) SPE Program System Memory PPE controller maps system memory for SPE DMA trans. DMA transactions Local Store
  • 54.
    Large single-SPE programmingmodels – I/O data System memory for large size input / output data e.g. Streaming model System memory int ip[32] int op[32] SPE program: op = func(ip) DMA DMA Local store int g_ip[512*1024] int g_op[512*1024]
  • 55.
    Large single-SPE programmingmodels System memory as secondary memory store Manual management of data buffers Automatic software-managed data cache Software cache framework libraries Compiler runtime support System memory SW cache entries SPE program Local store Global objects
  • 56.
    Large single-SPE programmingmodels System memory as secondary memory store Manual loading of plug-in into code buffer Plug-in framework libraries Automatic software-managed code overlay Compiler generated overlaying code System memory Local store SPE plug-in b SPE plug-in a SPE plug-in e SPE plug-in a SPE plug-in b SPE plug-in c SPE plug-in d SPE plug-in e SPE plug-in f
  • 57.
    Large single-SPE prog.models – Job Queue Code and data packaged together as inputs to an SPE kernel program A multi-tasking model – more discussion later Job queue System memory Local store code/data n code/data n+1 code/data n+2 code/data … Code n Data n SPE kernel DMA
  • 58.
    Large single-SPE programmingmodels - DMA DMA latency handling is critical to overall performance for SPE programs moving large data or code Data pre-fetching is a key technique to hide DMA latency e.g. double-buffering Time I Buf 1 (n) O Buf 1 (n) I Buf 2 (n+1) O Buf 2 (n-1) SPE program: Func (n) output n-2 input n Output n-1 Func (input n ) Input n+1 Func (input n+1 ) Func (input n-1 ) output n Input n+2 DMAs SPE exec. DMAs SPE exec.
  • 59.
    Large single-SPE programmingmodels - CESOF C ell E mbedded S PE O bject F ormat (CESOF) and PPE/SPE toolchains support the resolution of SPE references to the global system memory objects in the effective-address space. _EAR_g_foo structure Local Store Space Effective Address Space DMA transactions CESOF EAR symbol resolution Char g_foo[512] Char local_foo[512]
  • 60.
    Parallel programming models– Job Queue Large set of jobs fed through a group of SPE programs Streaming is a special case of job queue with regular and sequential data Each SPE program locks on the shared job queue to obtain next job For uneven jobs, workloads are self-balanced among available SPEs PPE SPE1 Kernel() SPE0 Kernel() SPE7 Kernel() System Memory I n . I 7 I 6 I 5 I 4 I 3 I 2 I 1 I 0 O n . O 7 O 6 O 5 O 4 O 3 O 2 O 1 O 0 … ..
  • 61.
    Parallel programming models– Pipeline / Streaming Use LS to LS DMA bandwidth, not system memory bandwidth Flexibility in connecting pipeline functions Larger collective code size per pipeline Load-balance is harder PPE SPE1 Kernel 1 () SPE0 Kernel 0 () SPE7 Kernel 7 () System Memory I n . . I 6 I 5 I 4 I 3 I 2 I 1 I 0 O n . . O 6 O 5 O 4 O 3 O 2 O 1 O 0 … .. DMA DMA
  • 62.
    Multi-tasking SPEs –LS resident multi-tasking Simplest multi-tasking programming model No memory protection among tasks Co-operative, Non-preemptive, event-driven scheduling Task a Task b Task c Task d Task x Event Dispatcher Local Store SPE n Event Queue a c a d x a c d
  • 63.
    Multi-tasking SPEs –Self-managed multi-tasking Non-LS resident Blocked job context is swapped out of LS and scheduled back later to the job queue once unblocked System memory Local store task n task n+1 task n+2 Task … Code n Data n SPE kernel task n’ task queue Job queue
  • 64.
    libspe sample code#include <libspe.h> int main(int argc, char *argv[], char *envp[]) { spe_program_handle_t *binary; speid_t spe_thread; int status; binary = spe_open_image(argv[1]); if (!binary) return 1; spe_thread = spe_create_thread(0, binary, argv+1, envp, -1, 0); if (!spe_thread) return 2; spe_wait(spe_thread, &status, 0); spe_close_image(binary); return status; }
  • 65.
    libspe sample code#include <libspe.h> int main(int argc, char *argv[], char *envp[]) { spe_program_handle_t *binary; speid_t spe_thread; int status; binary = spe_open_image(argv[1]); if (!binary) return 1; spe_thread = spe_create_thread(0, binary, argv+1, envp, -1, 0); if (!spe_thread) return 2; spe_wait(spe_thread, &status, 0); spe_close_image(binary); return status; }
  • 66.
    libspe sample code#include <libspe.h> int main(int argc, char *argv[], char *envp[]) { spe_program_handle_t *binary; speid_t spe_thread; int status; binary = spe_open_image(argv[1]); if (!binary) return 1; spe_thread = spe_create_thread(0, binary, argv+1, envp, -1, 0); if (!spe_thread) return 2; spe_wait(spe_thread, &status, 0); spe_close_image(binary); return status; }
  • 67.
    Linux on Cell/B.E.kernel components Platform abstraction arch/powerpc/platforms/{cell,ps3,beat} Integrated Interrupt Handling I/O Memory Management Unit Power Management Hypervisor abstractions South Bridge drivers SPU file system
  • 68.
    SPU file systemVirtual File System /spu holds SPU contexts as directories Files are primary user interfaces New system calls: spu create and spu run SPU contexts abstracted from real SPU Preemptive context switching (W.I.P)
  • 69.
    PPE on Cellis a 100% compliant ppc64! A solid base… Everything in a distribution, all middleware runs out of the box All tools available BUT: not optimized to exploit Cell Toolchain needs to cover Cell aspects Optimized, critical “middleware” for Cell needed Depending on workload requirements
  • 70.
    Using SPEs: TaskBased Abstraction  APIs provided by user space libraries SPE programs controlled via PPE-originated thread function calls spe_create_thread(), ... Calls on PPE and SPE Mailboxes DMA Events Simple runtime support (local store heap management, etc.) Lots of library extensions Encryption, signal processing, math operations
  • 71.
    spu_create int spucreate(const char *pathname, int flags, mode t mode); creates a new context in pathname returns an open file descriptor context is gets destroyed when fd is closed
  • 72.
    spu_run uint32 tspu run(int fd, uint32 t *npc, uint32 t *status); transfers flow of control to SPU context fd returns when the context has stopped for some reason, e.g. exit or forceful abort callback from SPU to PPU can be interrupted by signals
  • 73.
    PPE programming interfacesAsynchronous SPE thread API (“libspe 1.x”) spe_create_thread spe_wait spe_kill . . .
  • 74.
    spe create threadimplementation Allocate virtual SPE (spu create) Load SPE application code into context Start PPE thread using pthread create New thread calls spu run
  • 75.
    More libspe interfacesEvent notification int spe get event(struct spe event *, int nevents, int timeout); Message passing spe read out mbox(speid t speid); spe write in mbox(speid t speid); spe write signal(speid t speid, unsigned reg, unsigned data); Local store access void *spe get ls(speid t speid);
  • 76.
    GNU tool chainPPE support Just another PowerPC variant. . . SPE support Just another embedded processor. . . Cell/B.E. support More than just PPE + SPE!
  • 77.
    Object file formatPPE: regular ppc/ppc64 ELF binaries SPE: new ELF flavour EM SPU 32-bit big-endian No shared libraries Manipulated via cross-binutils New: Code overlay support Cell/B.E.: combined object files embedspu: link into one binary .rodata.spuelf section in PPE object CESOF: SPE− >PPE symbol references
  • 78.
    gcc on thePPE handled by “rs6000” back end Processor-specific tuning pipeline description
  • 79.
    gcc on theSPE Merged Jan 3rd Built as cross-compiler Handles vector data types, intrinsics Middle-end support: branch hints, aggressive if-conversion GCC 4.1 port exploiting auto-vectorization No Java
  • 80.
    Existing proprietary applicationsGames Volume rendering Real-time Raytracing Digital Video Monte Carlo simulation
  • 81.
    Obviously missing ffmpeg,mplayer, VLC VDR, mythTV Xorg acceleration OpenSSL Your project here !!!
  • 82.
    Questions! Thank youvery much for your attention.
  • 83.
  • 84.
    Documentation (newor recently updated) Cell Broadband Engine Cell Broadband Engine Architecture V1.0 Cell Broadband Engine Programming Handbook V1.0 Cell Broadband Engine Registers V1.3 SPU C/C++ Language Extensions V2.1 Synergistic Processor Unit (SPU) Instruction Set Architecture V1.1 SPU Application Binary Interface Specification V1.4 SPU Assembly Language Specification V1.3 Cell Broadband Engine Programming using the SDK Cell Broadband Engine SDK Installation and User's Guide V1.1 Cell Broadband Engine Programming Tutorial V1.1 Cell Broadband Engine Linux Reference Implementation ABI V1.0 SPE Runtime Management library documentation V1.1 SDK Sample Library documentation V1.1 IDL compiler documentation V1.1 New developerWorks Articles Maximizing the power of the Cell Broadband Engine processor Debugging Cell Broadband Engine systems
  • 85.
    Documentation (newor recently updated) IBM Cell Broadband Engine Full-System Simulator IBM Full-System Simulator Users Guide IBM Full-System Simulator Command Reference Performance Analysis with the IBM Full-System Simulator IBM Full-System Simulator BogusNet HowTo PowerPC Architecture Book Book I: PowerPC User Instruction Set Architecture Version 2.02 Book II: PowerPC Virtual Environment Architecture Version 2.02 Book III: PowerPC Operating Environment Architecture Version 2.02 Vector/SIMD Multimedia Extension Technology Programming Environments Manual Version 2.06c
  • 86.
    Links Cell BroadbandEngine http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine IBM BladeCenter QS20 http://www-03.ibm.com/technology/splash/qs20/ Cell Broadband Engine resource center http://www-128.ibm.com/developerworks/power/cell/ Cell Broadband Engine resource center - Documentation archive http://www-128.ibm.com/developerworks/power/cell/docs_documentation.html Cell Broadband Engine technology http://www.alphaworks.ibm.com/topics/cell Power.org's Cell Developers Corner http://www.power.org/resources/devcorner/cellcorner Barcelona Supercomputer Center - Linux on Cell http://www.bsc.es/projects/deepcomputing/linuxoncell/ Barcelona Supercomputer Center - Documentation http://www.bsc.es/plantillaH.php?cat_id=262 Heiko J Schick's Cell Bookmarks http://del.icio.us/schihei/Cell

Editor's Notes

  • #20 VMX AltiVec SIMD instructions on IBM PowerPC processors Less speculative logic
  • #23 VMX AltiVec SIMD instructions on IBM PowerPC processors Less speculative logic
  • #29 Switch gibt es noch nicht
  • #42 Dr. V. S. Pande, Distributed Computing Project, Stanford University (permission given for showing the video as well) Folding@Home on the PS3: the Cure@PS3 project INTRODUCTION Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation. By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine. FAH has targeted the study of of protein folding and protein folding disease, and numerous scientific advances have come from the project. Now in 2006, we are looking forward to another major advance in capabilities. This advance utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. With this new technology (as well as new advances with GPUs ), we will likely be able to attain performance on the 100 gigaflop scale per computer. With about 10,000 such machines, we would be able to achieve performance on the petaflop scale . With software from Sony, the PlayStation 3 will now be able to contribute to the Folding@Home project, pushing Folding@Home a major step forward. Our goal is to apply this new technology to push Folding@Home into a new level of capabilities, applying our simulations to further study of protein folding and related diseases, including Alzheimer’s Disease, Huntington&apos;s Disease, and certain forms of cancer. With these computational advances, coupled with new simulation methodologies to harness the new techniques, we will be able to address questions previously considered impossible to tackle computationally, and make even greater impacts on our knowledge of folding and folding related diseases. ADVANCED FEATURES FOR THE PS3 The PS3 client will also support some advanced visualization features. While the Cell microprocessor does most of the calculation processing of the simulation, the graphic chip of the PLAYSTATION 3 system (the RSX) displays the actual folding process in real-time using new technologies such as HDR and ISO surface rendering. It is possible to navigate the 3D space of the molecule using the interactive controller of the PS3, allowing us to look at the protein from different angles in real-time. For a preview of a prototype of the GUI for the PS3 client, check out a screenshot or one of these videos ( 355K avi , 866K avi , 6MB avi , 6MB avi -- more videos and formats to come). There is also a &amp;quot;bootleg&amp;quot; video of Sony&apos;s presentation on FAH that is now on YouTube (although the audio and video quality is pretty bad). http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats
  • #46 Cell Blade systems compute and compress images. These images are then delivered via the network to clients for decompression and display. GPStream framework can be used to deliver the images to mobile clients via wireless. This is really an example of situational awareness. In this specific case, the Predator Unmanned Aerial Vehicle has a small camera mounted in the nose (blue circle would be live video), the surroundings would be rendered for the remote pilot to help them avoid turning into a mountain or no fly zone. We this this is also valid for commercial aircraft for night, poor weather, etc.
  • #47 Ein erfahrener Arzt kann aus Schnittbildern sehr viel herauslesen. Aber 3dimensionale Bilder, dynamisch, d.h. Unter Einschluß des Faktors Zeit eröffnen völlig neue Diagnosemöglichkeiten. Medical imaging is another area that is progressing rapidly and creating a new more demanding workload. Today an average exam generates 1GByte of data you can’t go to the future adding time dependent analysis without an application-optimized system. An average exam generates 1GBytes of data (for one Digital x-ray or simple CT scan - much more for complicated CT or MRI studies) We estimate that 10^2-10^4 floating point operations are used to capture, process and analyze a Byte of medical data So, a typical exam requires 10^11- 10^13 operations Assume an exam must be completed in “real time” (5 minutes?) to be of diagnostic use This requires 0.3- 33GF/s of compute power – delivered today by single processor Intel workstations Scanner technology will rapidly evolve to generate 10-20x the amount of data in the same scan time Sixteen Slice CT Scanner 600-2000 slices per exam  300 MB – 1 GB per exam CT Scan workflow – typical helical scan multi-slice acquisition Stage 1: Interpolate data to generate equivalent “step-and-shoot” slices Stage 2: Filtered Back-Projection to generate 2D slice view (Fourier filter + numerical integration) Stage 3: Volume rendering (optional—many radiologists prefer to look at slices, but with increasing resolution/slice count, it may become mandatory) Note (1) Stage 2 should be trivially parallelizable (scale out) Note (2) Increase in the number of slices acquired simultaneously  increased computational cost for “cone-effect” corrections. Note (3) There are claims that improved algorithms can reduce the computational burden enormously (UIUC Technology Licensing Office) Example: 313MB of raw scan data  5 x 1MB images (cross-sections?). Each image takes 19 seconds to process on a 3GHz Wintel box. High resolution 3000 slice run (from machines like the new Siemens Somatom 64) might take ~16 hours to process on such a commodity system. Note that the 3GB of 2D image data can be accommodated within main memory. PV-4D (www.pv-4d.com) Showcase at Supercomputing 2005 / Cebit 2006 About 4 times faster than Opteron with same algorithm If fully optimized, projected about &gt; 6 times faster than Opteron Last minute prototype running on four Cell blades Stereo display using shutter glasses, 8-10 frames per second - Achieving this frame rate using two blades at a time - Four blades required for data set size Data sets about 1.6GB in size - Beating heart (400x400x400 voxels, 6 samples) - CFD simulation (~600x200x100 voxels, 40 samples)
  • #50 Handling large data Handling large code SIMD aspect?
  • #51 Q: What’s the parameters to spe_create_thread…
  • #54 Handling large data Handling large code SIMD aspect?
  • #60 Handling large data Handling large code SIMD aspect?
  • #70 Middleware / libraries likely to be optimized - media, e.g., mplayer - encryption, e.g., OpenSSH PPE = P ower P rocessor E lement