Introduction Cell Processor
Why Cell Processor Performance improvement with increase in frequency Possible due to increase in transistor density Clock frequency is timing reference for a processor Power density Leakage currents increase with reducing the transistor density Increase the idle power consumption
History of Cell Processor A powerful processor of next generation of PS2 Powerful multimedia and broadband network interface IBM contribution in shaping the concept of Cell processor Collaboration with Toshiba STI Alliance
History of Cell Processor Development of Cell 1999: Sony proposed partnership with IBM for successor of PS2 2001: STI alliance initiated the development on Cell 2004: first prototype of Cell 2005: Sony unveil the PS3 in an E3 2006: official release of PS3, Cell SDK by IBM 2008: IBM Roadrunner become fastest supercomputer in the world (1.026 pflops)
Overview of Cell  Design and Animation Game Programming Graphics Programming Matthew Scarpino
Overview of Cell  6.189 IAP 2007 MIT
Cell components Memory Interface Controller (MIC) Bus Interface Controller (BIC) PowerPC Processor Element/Unit (PPE/PPU) Synergistic processing Element/Unit (SPE/SPU) Element Interconnect Bus (EIB) Input/Output InterFace (IOIF)
Cell components MIC Connects the processor with system memory Two channels to system memory Xteram Data Rate Dynamic Random Access Memory (XDR DRAM) Can support 8 data transfers per second Provides high data flow at low frequency PS3 contains 256 MB XDR DRAM
Cell components PPU Based on IBM PowerPC architecture RISC architecture Cell control center Runs operating system Manages interrupts Manages L2 shared cache Issues work to SPU
Cell components PPU Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components PPU 64bit architecture Supports SIMD  Supports cell related functions Dual thread processor Computation power is reduced  PPU is not computational element in Cell Reduces power consumption
Cell components Functional units of PPU Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components Instruction unit (IU) Fetches and executes the instruction Load and Store Unit Receives the memory access request Vector/Scalar Unit (VSU) Contains Floating Point Unit Performs FP operations on individual or multiple operands  Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components Fixed point unit (FPU) Performs fix point operations Arithmetic and logical operations Memory Management Unit (MMU) Performs virtual memory management PPU registers Provides quick access to operands Some functional unit can access only processor registers  Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components 32 general purpose registers 32 floating point registers Link register Holds branch address of upcoming target Count register Holds branch address of upcoming target (or) Holds loop counter Fixed point exception register Holds carry and overflow bits for fixed point op.  Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components Condition register Holds status of arithmetic, logical or comparison  Floating point status and control register  Status of scalar FP operation Vector registers Contains data for vector operations Vector status and control register Holds saturation bit for vector operation Vector register save and restore register Saves vector registers in case of context switch  Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components SPU  Basic work horse of Cell Designed to executes SIMD  Separate Instruction set Takes the work for PPU Does have any cache No virtual memory Each SPU can contain only 256KB of memory
Cell components SPU  SPU can only access its own 256KB memory directly Dynamic Memory Access is required to transfer the required data to SPU Memory alignment is required to pass data to SPU Different methods to communicates with PPU and other memory
Cell components Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components Purpose of SPU Take 128-bit data to local register Apply operation on it  Save the result to local memory Two distinct pipelines Even pipeline handles mathematical operations Odd pipeline handles everything else
Cell components SPU Control Unit (SCN) Fetches and dispatches the instructions Perform branching and other control operations SPU even fixed point unit Handles logic/arithmetic operations Performs comparisons and reciprocations for FP SPU odd fixed point unit Performs bit level shifts, rotations, and shuffling
Cell components SPU floating point unit Performs floating point operations  SPU load/store unit Performs loads and stores Manages branch targets and DMA to Local store SPU channel and DMA unit Communicates with Memory Flow Controller  Controls DMA transfer
Cell components SPU registers 128 general purpose registers Floating point status and control registers Contains status and results of floating point operations SPU local Store (LS) Each SPU contains very low latency 256KB memory It acts as local cache for SPU All data transfer is responsibility of the programmer
Cell components SPU local Store (LS) Not a cache just an SRAM Only one read/write operations per second Operations accessing the LS DMA Transfer data from main memory to LS SPU load/store Reads/writes 16 bytes at a time Instruction fetch Reads 128 bytes of the LS at once
Cell components SPU local Store (LS) Does not support virtual memory Tradeoff between cache coherence and fetching the data to LS LS is low latency memory Cache coherence protocols are used for other processors Data is transferred to LS using high throughput EIB via DMA instead of cache coherence protocols Make the hardware simple
Cell components communications between SPU and other system DMA Mailboxes Events and signals
Cell components DMA Transfers data to LS  Asynchronous in nature SPU continues its operation while DMA Transfers data in chunk of bytes of size power of 2 Provides control to manage and synchronize the data transfer One DMA can maximum transfer 16KB
Cell components Design and Animation Game Programming Graphics Programming Matthew Scarpino
Cell components EIB Connects all the system components Consists of four data ring (two clockwise and two counter-clockwise) One ring is for control signals One bus cycles can transfer 16 bytes of data Each ring can carry three DMA requests simultaneously Each DMA takes at least 8 cycles to complete
Cell components MFC Coprocessor to communicate between SPU and EIB Process data transfer without interrupting the SPU SPU requests the MFC to get the data MFC processes the rest of data transfer
Cell components Mailboxes  Simplest way to transfer the data between PPU and SPU Can only transfer 4 bytes of data Provides one-to-one communication  Mailbox channels Outgoing mailbox Outgoing interrupt mailbox Holds the data for outside world and cause interrupt if applicable Incoming mailbox
Cell components Events and signals Commonly used for DMA notifications Signals can be sent directly to outside world Signals can provide one-to-many style communication
Cell components Events and signals Commonly used for DMA notifications Signals can be sent directly to outside world Signals can provide one-to-many style communication
Software development of Cell Different instruction sets for SPU and PPU Different compilers are required to compile the applications for two codes Embedding the SPU code in PPU executable
Software development of Cell Tools to compile the application for Cell PPU compiler ppu-gcc SPU compiler spu-gcc Embed SPU code to PPU ppu-embedspu
Software development of Cell Cell simulator Full System Simulator Emulates all system components Can provides cycle accurate information Provides graphical interface to se and interact with system components
Software development of Cell IBM Full System Simulator user guide
Software development of Cell Three modes Fast mode Simple mode Cycle mode Graphical visualization of SPU and PPU Provides debugging and profiling information Provides system utilization information
Software development of Cell
Software development of Cell Design and Animation Game Programming Graphics Programming Matthew Scarpino

Introduction Cell Processor

  • 1.
  • 2.
    Why Cell ProcessorPerformance improvement with increase in frequency Possible due to increase in transistor density Clock frequency is timing reference for a processor Power density Leakage currents increase with reducing the transistor density Increase the idle power consumption
  • 3.
    History of CellProcessor A powerful processor of next generation of PS2 Powerful multimedia and broadband network interface IBM contribution in shaping the concept of Cell processor Collaboration with Toshiba STI Alliance
  • 4.
    History of CellProcessor Development of Cell 1999: Sony proposed partnership with IBM for successor of PS2 2001: STI alliance initiated the development on Cell 2004: first prototype of Cell 2005: Sony unveil the PS3 in an E3 2006: official release of PS3, Cell SDK by IBM 2008: IBM Roadrunner become fastest supercomputer in the world (1.026 pflops)
  • 5.
    Overview of Cell Design and Animation Game Programming Graphics Programming Matthew Scarpino
  • 6.
    Overview of Cell 6.189 IAP 2007 MIT
  • 7.
    Cell components MemoryInterface Controller (MIC) Bus Interface Controller (BIC) PowerPC Processor Element/Unit (PPE/PPU) Synergistic processing Element/Unit (SPE/SPU) Element Interconnect Bus (EIB) Input/Output InterFace (IOIF)
  • 8.
    Cell components MICConnects the processor with system memory Two channels to system memory Xteram Data Rate Dynamic Random Access Memory (XDR DRAM) Can support 8 data transfers per second Provides high data flow at low frequency PS3 contains 256 MB XDR DRAM
  • 9.
    Cell components PPUBased on IBM PowerPC architecture RISC architecture Cell control center Runs operating system Manages interrupts Manages L2 shared cache Issues work to SPU
  • 10.
    Cell components PPUDesign and Animation Game Programming Graphics Programming Matthew Scarpino
  • 11.
    Cell components PPU64bit architecture Supports SIMD Supports cell related functions Dual thread processor Computation power is reduced PPU is not computational element in Cell Reduces power consumption
  • 12.
    Cell components Functionalunits of PPU Design and Animation Game Programming Graphics Programming Matthew Scarpino
  • 13.
    Cell components Instructionunit (IU) Fetches and executes the instruction Load and Store Unit Receives the memory access request Vector/Scalar Unit (VSU) Contains Floating Point Unit Performs FP operations on individual or multiple operands Design and Animation Game Programming Graphics Programming Matthew Scarpino
  • 14.
    Cell components Fixedpoint unit (FPU) Performs fix point operations Arithmetic and logical operations Memory Management Unit (MMU) Performs virtual memory management PPU registers Provides quick access to operands Some functional unit can access only processor registers Design and Animation Game Programming Graphics Programming Matthew Scarpino
  • 15.
    Cell components 32general purpose registers 32 floating point registers Link register Holds branch address of upcoming target Count register Holds branch address of upcoming target (or) Holds loop counter Fixed point exception register Holds carry and overflow bits for fixed point op. Design and Animation Game Programming Graphics Programming Matthew Scarpino
  • 16.
    Cell components Conditionregister Holds status of arithmetic, logical or comparison Floating point status and control register Status of scalar FP operation Vector registers Contains data for vector operations Vector status and control register Holds saturation bit for vector operation Vector register save and restore register Saves vector registers in case of context switch Design and Animation Game Programming Graphics Programming Matthew Scarpino
  • 17.
    Cell components SPU Basic work horse of Cell Designed to executes SIMD Separate Instruction set Takes the work for PPU Does have any cache No virtual memory Each SPU can contain only 256KB of memory
  • 18.
    Cell components SPU SPU can only access its own 256KB memory directly Dynamic Memory Access is required to transfer the required data to SPU Memory alignment is required to pass data to SPU Different methods to communicates with PPU and other memory
  • 19.
    Cell components Designand Animation Game Programming Graphics Programming Matthew Scarpino
  • 20.
    Cell components Purposeof SPU Take 128-bit data to local register Apply operation on it Save the result to local memory Two distinct pipelines Even pipeline handles mathematical operations Odd pipeline handles everything else
  • 21.
    Cell components SPUControl Unit (SCN) Fetches and dispatches the instructions Perform branching and other control operations SPU even fixed point unit Handles logic/arithmetic operations Performs comparisons and reciprocations for FP SPU odd fixed point unit Performs bit level shifts, rotations, and shuffling
  • 22.
    Cell components SPUfloating point unit Performs floating point operations SPU load/store unit Performs loads and stores Manages branch targets and DMA to Local store SPU channel and DMA unit Communicates with Memory Flow Controller Controls DMA transfer
  • 23.
    Cell components SPUregisters 128 general purpose registers Floating point status and control registers Contains status and results of floating point operations SPU local Store (LS) Each SPU contains very low latency 256KB memory It acts as local cache for SPU All data transfer is responsibility of the programmer
  • 24.
    Cell components SPUlocal Store (LS) Not a cache just an SRAM Only one read/write operations per second Operations accessing the LS DMA Transfer data from main memory to LS SPU load/store Reads/writes 16 bytes at a time Instruction fetch Reads 128 bytes of the LS at once
  • 25.
    Cell components SPUlocal Store (LS) Does not support virtual memory Tradeoff between cache coherence and fetching the data to LS LS is low latency memory Cache coherence protocols are used for other processors Data is transferred to LS using high throughput EIB via DMA instead of cache coherence protocols Make the hardware simple
  • 26.
    Cell components communicationsbetween SPU and other system DMA Mailboxes Events and signals
  • 27.
    Cell components DMATransfers data to LS Asynchronous in nature SPU continues its operation while DMA Transfers data in chunk of bytes of size power of 2 Provides control to manage and synchronize the data transfer One DMA can maximum transfer 16KB
  • 28.
    Cell components Designand Animation Game Programming Graphics Programming Matthew Scarpino
  • 29.
    Cell components EIBConnects all the system components Consists of four data ring (two clockwise and two counter-clockwise) One ring is for control signals One bus cycles can transfer 16 bytes of data Each ring can carry three DMA requests simultaneously Each DMA takes at least 8 cycles to complete
  • 30.
    Cell components MFCCoprocessor to communicate between SPU and EIB Process data transfer without interrupting the SPU SPU requests the MFC to get the data MFC processes the rest of data transfer
  • 31.
    Cell components Mailboxes Simplest way to transfer the data between PPU and SPU Can only transfer 4 bytes of data Provides one-to-one communication Mailbox channels Outgoing mailbox Outgoing interrupt mailbox Holds the data for outside world and cause interrupt if applicable Incoming mailbox
  • 32.
    Cell components Eventsand signals Commonly used for DMA notifications Signals can be sent directly to outside world Signals can provide one-to-many style communication
  • 33.
    Cell components Eventsand signals Commonly used for DMA notifications Signals can be sent directly to outside world Signals can provide one-to-many style communication
  • 34.
    Software development ofCell Different instruction sets for SPU and PPU Different compilers are required to compile the applications for two codes Embedding the SPU code in PPU executable
  • 35.
    Software development ofCell Tools to compile the application for Cell PPU compiler ppu-gcc SPU compiler spu-gcc Embed SPU code to PPU ppu-embedspu
  • 36.
    Software development ofCell Cell simulator Full System Simulator Emulates all system components Can provides cycle accurate information Provides graphical interface to se and interact with system components
  • 37.
    Software development ofCell IBM Full System Simulator user guide
  • 38.
    Software development ofCell Three modes Fast mode Simple mode Cycle mode Graphical visualization of SPU and PPU Provides debugging and profiling information Provides system utilization information
  • 39.
  • 40.
    Software development ofCell Design and Animation Game Programming Graphics Programming Matthew Scarpino