Broadband Processor Architecture
© 2005 IBM Corporation
A novel SIMD architecture for the Cell
heterogeneous chip-multiprocessor
Michael Gschwind, Peter Hofstee,
Brian Flachs, Martin Hopkins,
Yukio Watanabe, Takeshi Yamazaki
Broadband Processor Architecture
© 2005 IBM Corporation
2
Acknowledgements
 Cell is the result of a partnership between
SCEI/Sony, Toshiba, and IBM
 Cell represents the work of more than 400 people
starting in 2000 and a design investment of about
$400M
Broadband Processor Architecture
© 2005 IBM Corporation
3
Cell Design Goals
 Provide the platform for the future of computing
– 10× performance of desktop systems shipping in 2005
– Ability to reach 1 TF with a 4-node Cell system
 Computing density as main challenge
– Dramatically increase performance per X
• X = Area, Power, Volume, Cost,…
 Single core designs offer diminishing returns on investment
– In power, area, design complexity and verification cost
 Exploit application parallelism to provide a quantum leap in
performance
Broadband Processor Architecture
© 2005 IBM Corporation
4
Shifting the Balance of Power with Cell
 Today’s architectures are built on a 40 year old data model
– Efficiency as defined in 1964
– Big overhead per data operation
– Data parallelism added as an after-thought
 Cell provides parallelism at all levels of system
abstraction
– Thread-level parallelism  multi-core design approach
– Instruction-level parallelism  statically scheduled & power
aware
– Data parallelism  data-paralleI instructions
 Data processor instead of control system
Broadband Processor Architecture
© 2005 IBM Corporation
5
Cell Features
 Heterogeneous multi-
core system architecture
– Power Processor Element
for control tasks
– Synergistic Processor
Elements for data-intensive
processing
 Synergistic Processor
Element (SPE) consists
of
– Synergistic Processor Unit
(SPU)
– Synergistic Memory Flow
Control (SMF)
• Data movement and
synchronization
• Interface to high-
performance Element
Interconnect Bus
16B/cycle (2x)
16B/cycle
BIC
FlexIOTM
MIC
Dual
XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXU
SPU
SMF
PXU
L1
PPU
16B/cycle
L2
32B/cycle
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
LS
SXU
SPU
SMF
Broadband Processor Architecture
© 2005 IBM Corporation
6
Powering Cell – the Synergistic Processor Unit
VRF
VRF
PERM
PERM LSU
LSU
V
V F
F P
P U
U V
V F
F X
X U
U
Local Store
Local Store
Single Port
Single Port
SRAM
SRAM
Issue /
Issue /
Branch
Branch
Fetch
ILB
16B x 2
2 instructions
16B x 3 x 2
64B
64B
16B
16B
128B
128B
Broadband Processor Architecture
© 2005 IBM Corporation
7
Density Computing in SPEs
 Today, execution units only fraction of core area and power
– Bigger fraction goes to other functions
• Address translation and privilege levels
• Instruction reordering
• Register renaming
• Cache hierarchy
 Cell changes this ratio to increase performance per area
and power
– Architectural focus on data processing
• Wide datapaths
• More and wide architectural registers
• Data privatization and single level processor-local store
• All code executes in a single (user) privilege level
• Static scheduling
Broadband Processor Architecture
© 2005 IBM Corporation
8
Streamlined Architecture
 Architectural focus on simplicity
– Aids achievable operating frequency
– Optimize circuits for common performance case
– Compiler aids in layering traditional hardware functions
– Leverage 20 years of architecture research
 Focus on statically scheduled data parallelism
– Focus on data parallel instructions
• No separate scalar execution units
• Scalar operations mapped onto data parallel dataflow
– Exploit wide data paths
• data processing
• instruction fetch
– Address impediments to static scheduling
• Large register set
• Reduce latencies by eliminating non-essential functionality
Broadband Processor Architecture
© 2005 IBM Corporation
9
data
parallel
select
static
prediction
simpler
µarch
sequential
fetch
local
store
high
frequency
shorter
pipeline
large
basic
blocks
determ.
latency
large
register
file
ILP
opt.
instruction
bundling
DLP
wide
data
paths
compute
density
Synergistic Processing
single
port
static
scheduling
scalar
layering
Broadband Processor Architecture
© 2005 IBM Corporation
10
Architected Registers have Lagged Behind Latency
0
5
10
15
20
25
30
35
40
45
50
1989 1993 1997 2001
_
_
(2FPU)
high-freq core
(1FPU)
extrapolated
core generation
register
pressure
(2
unique
operands
per
FMA)
FMA latency* FMAs L1 latency* FMAs L2 latency* FMAs mem latency* FMAs
32 registers
since RISC
introduced
Broadband Processor Architecture
© 2005 IBM Corporation
11
Restoring the Architectural Register Balance
 Unified 128 entry, 128b wide register file
– Used as 128x1, 64x2, 32x4, 16x8, 8x16, 1x128
– Emphasis on 32x4 integer and FP arithmetic
– Register file also stores addresses and condition values
 Unified register file gives programmers more flexibility in
resource allocation
– Stores all data types
• address, integer, SP/DP FP, Boolean
– Stores scalar and vector SIMD data
 Large register file facilitates static scheduling and loop
optimization for latency hiding
Broadband Processor Architecture
© 2005 IBM Corporation
12
Pervasive Data Parallel Computing (PDPC)
 Scalar processing supported on data-parallel
substrate
– All instructions are data parallel and operate on vectors
of elements
– Scalar operation defined by instruction use, not opcode
• Vector instruction form used to perform operation
 Preferred slot paradigm
– Scalar arguments to instructions found in “preferred slot”
– Computation can be performed in any slot
Broadband Processor Architecture
© 2005 IBM Corporation
13
Register Scalar Data Layout
 Preferred slot in bytes 0-3
– By convention for procedure interfaces
– Used by instructions expecting scalar data
• Addresses, branch conditions, generate controls for insert
Broadband Processor Architecture
© 2005 IBM Corporation
14
PDPC Memory Access Architecture
 Data memory interface optimized for quadword access
– Simplified alignment network
– Always access 128 bits of data
 Processing scalar data
– Extracted using rotate
– Stored with read-modify-write sequence
 Four address formats
– register + displacement, register-indexed
– PC-relative, absolute address
 Local Store Limit Register (LSLR)
– Address mask register to define maximum address range
– Addresses wrap at LSLR value
Broadband Processor Architecture
© 2005 IBM Corporation
15
PDPC Branch Architecture
 Register-indirect target address in preferred slot
of general purpose register
– Indirect branch (function return)
– Indirect prepare-to-branch
 Link address deposited in preferred slot of
specified general purpose register
– Register 0 ($ra) used as return register by software
convention
Broadband Processor Architecture
© 2005 IBM Corporation
16
A Statically Scheduled PDPC Architecture
 Bundle concept
– Sequential semantics
– Static scheduling key ingredient to great performance
– Strong performance bias to properly aligned instructions
• Align instructions to map on simplified issue routing logic
• Maximize fetch group utilization
 Branch and control architecture
– Static branch prediction with Prepare-to-branch instruction
– Simplify implementations, improve utility of sequential fetch
 Explicit inline instruction fetch
– To avoid fetch starvation during bursts of high-priority load/store traffic
– Compiler manages 256KB unified, single ported memory
Broadband Processor Architecture
© 2005 IBM Corporation
17
Data Selection Architecture
 Branch is scalar, select is data parallel
– Branch inherently inefficient
 No exception  safe to compute both paths through
conditional assignment
 Use data parallel compute flow
– Data parallel if-conversion
– Select instruction independently selects result for each slot
– Compare instructions generate type-specific mask fields
Broadband Processor Architecture
© 2005 IBM Corporation
18
Data Parallel Select Operation
for (i =0; i< VL; i++)
if (a[i]>b[i])
m[i] = a[i]*2;
else
m[i] = b[i]*3;
a[0]>b[0]
m[0]=a[0]*2; m[0]=b[0]*3;
a[1]>b[1]
m[1]=a[1]*2; m[1]=b[1]*3;
a[2]>b[2]
m[2]=a[2]*2; m[2]=b[2]*3;
a[3]>b[3]
m[3]=a[3]*2; m[3]=b[3]*3;
a’[0] =a[0]*2; b’[0]=b[0]*3;
s[0]=a[0]>b[0]
m[0]=s[0]?
a’[0]:b’[0]
a’[1]=a[1]*2; b’[1]=b[1]*3;
s[1]=a[1]>b[1]
m[1]=s[1]?
a’[1]:b’[1]
a’[2]=a[2]*2; b’[2]=b[2]*3;
s[2]=a[2]>b[2]
m[2]=s[2]?
a’[2]:b’[2]
a’[3]=a[3]*2; b’[3]=b[3]*3;
s[3]=a[3]>b[3]
m[3]=s[3]?
a’[3]:b’[3]
Exploit data parallelism
Long
latency
Broadband Processor Architecture
© 2005 IBM Corporation
19
Compare and Data Selection Architecture
 Only limited number of compares
– Compare for equality
• CEQ
– Compare for ordering
• CGT, CGTL
– Other conditions can be derived
• operand order, inverted condition
 Generates mask of data type width for use with select
– Data-parallel conditional operation
Broadband Processor Architecture
© 2005 IBM Corporation
20
Integer Data Processing for Advanced Media Computing
 Ensure data fidelity
– Avoid accumulated saturation error in content creation
– Emphasis on data quality first, quantity second
– Concentrate on 16b and 32b integer data
– Concentrate on modulo arithmetic
• Consistent with common C/C++/Java semantics
 Selected byte operations to maximize performance on key
operations
– Encryption/decryption performance
• Security in electronic transactions will be key to enable future
business models
– Motion estimation
• Video encoding
Broadband Processor Architecture
© 2005 IBM Corporation
21
Reconciling Data Fidelity and Compact Data Representation
 Use wide data types for intermediate results in computation
– Avoid saturation of intermediate results
– Prevent loss of dynamic range
 Pack-and-saturate operation to store data in dense memory
format
– Keep data footprint in memory compact
– Pack and saturate once for bulk data storage in memory
 Saturating computation only important for low-cost content
rendering devices
– No/less accumulation of error across rendering steps
– e.g., MPEG synchronize with I-frame
Broadband Processor Architecture
© 2005 IBM Corporation
22
Floating Point Processing in the SPE
 Support standard IEEE FP and graphics-optimized arithmetic
– Floating-point naturally saturating
– Always use IEEE data layout
 Graphics oriented SP-FP using IEEE compatible data layout
– Extended range / Simplified format
• Improved estimate instructions to reduce “banding”
– Media workloads impose realtime processing requirements
• No floating point exceptions
– Graphics oriented SP-FP mode added to Cell Power
Processor Element
 DP-FP follows IEEE conventions
Broadband Processor Architecture
© 2005 IBM Corporation
23
Cell: a Synergistic System Architecture
 Cell is not a collection of different processors, but a synergistic whole
– Operation paradigms, data formats and semantics consistent
– Share address translation and memory protection model
 SPE optimized for efficient data processing
– SPEs share Cell system functions provided by Power Architecture
– SMF implements interface to memory
• Copy in/copy out to local storage
 Power Architecture provides system functions
– Virtualization
– Address translation and protection
– External exception handling
 EIB (Element Interconnect Bus) integrates system as data transport hub
Broadband Processor Architecture
© 2005 IBM Corporation
24
Compiling for Cell
 The lesson of “RISC computing”
– Architecture provides fast, streamlined primitives to compiler
– Compiler uses primitives to implement higher-level idioms
– If the compiler can’t target it  do not include in architecture
 Compiler focus throughout project
– Prototype compiler soon after first proposal
– Cell compiler team has made significant advances in
• Automatic SIMD code generation
• Automatic parallelization
• Data privatization
Broadband Processor Architecture
© 2005 IBM Corporation
25
Permute Unit
Load-Store Unit
Floating-Point Unit
Fixed-Point Unit
Branch Unit
Channel Unit
Result Forwarding and Staging
Register File
Local Store
(256kB)
Single Port SRAM
128B Read 128B Write
DMA Unit
Instruction Issue Unit / Instruction Line Buffer
8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle
64 Byte/Cycle
On-Chip Coherent Bus
SPE Block Diagram
Broadband Processor Architecture
© 2005 IBM Corporation
26
SPE PIPELINE FRONT END
SPE PIPELINE BACK END
Branch Instruction
Load/Store Instruction
IF Instruction Fetch
IB Instruction Buffer
ID Instruction Decode
IS Instruction Issue
RF Register File Access
EX Execution
WB Write Back
Floating Point Instruction
Permute Instruction
EX1 EX2 EX3 EX4
EX1 EX2 EX3 EX4 EX5 EX6
EX1 EX2 WB
WB
RF1 RF2
WB
Fixed Point Instruction
EX1 EX2 EX3 EX4 EX5 EX6 WB
IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2
SPE Pipeline
Broadband Processor Architecture
© 2005 IBM Corporation
27
Cell Broadband Engine
Broadband Processor Architecture
© 2005 IBM Corporation
28
Cell Implementation Characteristics
 241M transistors
 235mm2
 Design operates across wide
frequency range
– Optimize for power & yield
 > 200 GFlops (SP) @3.2GHz
 > 20 GFlops (DP) @3.2GHz
 Up to 25.6 GB/s memory bandwidth
 Up to 75 GB/s I/O bandwidth
 100+ simultaneous bus
transactions
– 16+8 entry DMA queue per SPE
Relative
Frequency Increase vs. Power Consumption
Voltage
Broadband Processor Architecture
© 2005 IBM Corporation
29
Conclusion
 Single chip heterogeneous multicore system takes chip
performance to a new level
– By exploiting thread level parallelism
– By exploiting instruction level parallelism
– By exploiting data level parallelism
 Novel pervasive vector-centric architecture
– Data parallelism at the center
– Redefines high-performance architecture
 SPE compute engine offers high compute density
– Power-efficient
– Area-efficient
 Cell delivers unprecedented supercomputer power for
consumer applications
Broadband Processor Architecture
© 2005 IBM Corporation
30
Additional Information
 Additional details on the SPE architecture and the
Cell implementation can be found at
– http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell
– http://www.research.ibm.com/cell
Broadband Processor Architecture
© 2005 IBM Corporation
31
© Copyright International Business Machines Corporation 2005.
All Rights Reserved. Printed in the United States August 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could resul
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.

M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multiprocessor

  • 1.
    Broadband Processor Architecture ©2005 IBM Corporation A novel SIMD architecture for the Cell heterogeneous chip-multiprocessor Michael Gschwind, Peter Hofstee, Brian Flachs, Martin Hopkins, Yukio Watanabe, Takeshi Yamazaki
  • 2.
    Broadband Processor Architecture ©2005 IBM Corporation 2 Acknowledgements  Cell is the result of a partnership between SCEI/Sony, Toshiba, and IBM  Cell represents the work of more than 400 people starting in 2000 and a design investment of about $400M
  • 3.
    Broadband Processor Architecture ©2005 IBM Corporation 3 Cell Design Goals  Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 – Ability to reach 1 TF with a 4-node Cell system  Computing density as main challenge – Dramatically increase performance per X • X = Area, Power, Volume, Cost,…  Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost  Exploit application parallelism to provide a quantum leap in performance
  • 4.
    Broadband Processor Architecture ©2005 IBM Corporation 4 Shifting the Balance of Power with Cell  Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Data parallelism added as an after-thought  Cell provides parallelism at all levels of system abstraction – Thread-level parallelism  multi-core design approach – Instruction-level parallelism  statically scheduled & power aware – Data parallelism  data-paralleI instructions  Data processor instead of control system
  • 5.
    Broadband Processor Architecture ©2005 IBM Corporation 5 Cell Features  Heterogeneous multi- core system architecture – Power Processor Element for control tasks – Synergistic Processor Elements for data-intensive processing  Synergistic Processor Element (SPE) consists of – Synergistic Processor Unit (SPU) – Synergistic Memory Flow Control (SMF) • Data movement and synchronization • Interface to high- performance Element Interconnect Bus 16B/cycle (2x) 16B/cycle BIC FlexIOTM MIC Dual XDRTM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU SMF PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF
  • 6.
    Broadband Processor Architecture ©2005 IBM Corporation 6 Powering Cell – the Synergistic Processor Unit VRF VRF PERM PERM LSU LSU V V F F P P U U V V F F X X U U Local Store Local Store Single Port Single Port SRAM SRAM Issue / Issue / Branch Branch Fetch ILB 16B x 2 2 instructions 16B x 3 x 2 64B 64B 16B 16B 128B 128B
  • 7.
    Broadband Processor Architecture ©2005 IBM Corporation 7 Density Computing in SPEs  Today, execution units only fraction of core area and power – Bigger fraction goes to other functions • Address translation and privilege levels • Instruction reordering • Register renaming • Cache hierarchy  Cell changes this ratio to increase performance per area and power – Architectural focus on data processing • Wide datapaths • More and wide architectural registers • Data privatization and single level processor-local store • All code executes in a single (user) privilege level • Static scheduling
  • 8.
    Broadband Processor Architecture ©2005 IBM Corporation 8 Streamlined Architecture  Architectural focus on simplicity – Aids achievable operating frequency – Optimize circuits for common performance case – Compiler aids in layering traditional hardware functions – Leverage 20 years of architecture research  Focus on statically scheduled data parallelism – Focus on data parallel instructions • No separate scalar execution units • Scalar operations mapped onto data parallel dataflow – Exploit wide data paths • data processing • instruction fetch – Address impediments to static scheduling • Large register set • Reduce latencies by eliminating non-essential functionality
  • 9.
    Broadband Processor Architecture ©2005 IBM Corporation 9 data parallel select static prediction simpler µarch sequential fetch local store high frequency shorter pipeline large basic blocks determ. latency large register file ILP opt. instruction bundling DLP wide data paths compute density Synergistic Processing single port static scheduling scalar layering
  • 10.
    Broadband Processor Architecture ©2005 IBM Corporation 10 Architected Registers have Lagged Behind Latency 0 5 10 15 20 25 30 35 40 45 50 1989 1993 1997 2001 _ _ (2FPU) high-freq core (1FPU) extrapolated core generation register pressure (2 unique operands per FMA) FMA latency* FMAs L1 latency* FMAs L2 latency* FMAs mem latency* FMAs 32 registers since RISC introduced
  • 11.
    Broadband Processor Architecture ©2005 IBM Corporation 11 Restoring the Architectural Register Balance  Unified 128 entry, 128b wide register file – Used as 128x1, 64x2, 32x4, 16x8, 8x16, 1x128 – Emphasis on 32x4 integer and FP arithmetic – Register file also stores addresses and condition values  Unified register file gives programmers more flexibility in resource allocation – Stores all data types • address, integer, SP/DP FP, Boolean – Stores scalar and vector SIMD data  Large register file facilitates static scheduling and loop optimization for latency hiding
  • 12.
    Broadband Processor Architecture ©2005 IBM Corporation 12 Pervasive Data Parallel Computing (PDPC)  Scalar processing supported on data-parallel substrate – All instructions are data parallel and operate on vectors of elements – Scalar operation defined by instruction use, not opcode • Vector instruction form used to perform operation  Preferred slot paradigm – Scalar arguments to instructions found in “preferred slot” – Computation can be performed in any slot
  • 13.
    Broadband Processor Architecture ©2005 IBM Corporation 13 Register Scalar Data Layout  Preferred slot in bytes 0-3 – By convention for procedure interfaces – Used by instructions expecting scalar data • Addresses, branch conditions, generate controls for insert
  • 14.
    Broadband Processor Architecture ©2005 IBM Corporation 14 PDPC Memory Access Architecture  Data memory interface optimized for quadword access – Simplified alignment network – Always access 128 bits of data  Processing scalar data – Extracted using rotate – Stored with read-modify-write sequence  Four address formats – register + displacement, register-indexed – PC-relative, absolute address  Local Store Limit Register (LSLR) – Address mask register to define maximum address range – Addresses wrap at LSLR value
  • 15.
    Broadband Processor Architecture ©2005 IBM Corporation 15 PDPC Branch Architecture  Register-indirect target address in preferred slot of general purpose register – Indirect branch (function return) – Indirect prepare-to-branch  Link address deposited in preferred slot of specified general purpose register – Register 0 ($ra) used as return register by software convention
  • 16.
    Broadband Processor Architecture ©2005 IBM Corporation 16 A Statically Scheduled PDPC Architecture  Bundle concept – Sequential semantics – Static scheduling key ingredient to great performance – Strong performance bias to properly aligned instructions • Align instructions to map on simplified issue routing logic • Maximize fetch group utilization  Branch and control architecture – Static branch prediction with Prepare-to-branch instruction – Simplify implementations, improve utility of sequential fetch  Explicit inline instruction fetch – To avoid fetch starvation during bursts of high-priority load/store traffic – Compiler manages 256KB unified, single ported memory
  • 17.
    Broadband Processor Architecture ©2005 IBM Corporation 17 Data Selection Architecture  Branch is scalar, select is data parallel – Branch inherently inefficient  No exception  safe to compute both paths through conditional assignment  Use data parallel compute flow – Data parallel if-conversion – Select instruction independently selects result for each slot – Compare instructions generate type-specific mask fields
  • 18.
    Broadband Processor Architecture ©2005 IBM Corporation 18 Data Parallel Select Operation for (i =0; i< VL; i++) if (a[i]>b[i]) m[i] = a[i]*2; else m[i] = b[i]*3; a[0]>b[0] m[0]=a[0]*2; m[0]=b[0]*3; a[1]>b[1] m[1]=a[1]*2; m[1]=b[1]*3; a[2]>b[2] m[2]=a[2]*2; m[2]=b[2]*3; a[3]>b[3] m[3]=a[3]*2; m[3]=b[3]*3; a’[0] =a[0]*2; b’[0]=b[0]*3; s[0]=a[0]>b[0] m[0]=s[0]? a’[0]:b’[0] a’[1]=a[1]*2; b’[1]=b[1]*3; s[1]=a[1]>b[1] m[1]=s[1]? a’[1]:b’[1] a’[2]=a[2]*2; b’[2]=b[2]*3; s[2]=a[2]>b[2] m[2]=s[2]? a’[2]:b’[2] a’[3]=a[3]*2; b’[3]=b[3]*3; s[3]=a[3]>b[3] m[3]=s[3]? a’[3]:b’[3] Exploit data parallelism Long latency
  • 19.
    Broadband Processor Architecture ©2005 IBM Corporation 19 Compare and Data Selection Architecture  Only limited number of compares – Compare for equality • CEQ – Compare for ordering • CGT, CGTL – Other conditions can be derived • operand order, inverted condition  Generates mask of data type width for use with select – Data-parallel conditional operation
  • 20.
    Broadband Processor Architecture ©2005 IBM Corporation 20 Integer Data Processing for Advanced Media Computing  Ensure data fidelity – Avoid accumulated saturation error in content creation – Emphasis on data quality first, quantity second – Concentrate on 16b and 32b integer data – Concentrate on modulo arithmetic • Consistent with common C/C++/Java semantics  Selected byte operations to maximize performance on key operations – Encryption/decryption performance • Security in electronic transactions will be key to enable future business models – Motion estimation • Video encoding
  • 21.
    Broadband Processor Architecture ©2005 IBM Corporation 21 Reconciling Data Fidelity and Compact Data Representation  Use wide data types for intermediate results in computation – Avoid saturation of intermediate results – Prevent loss of dynamic range  Pack-and-saturate operation to store data in dense memory format – Keep data footprint in memory compact – Pack and saturate once for bulk data storage in memory  Saturating computation only important for low-cost content rendering devices – No/less accumulation of error across rendering steps – e.g., MPEG synchronize with I-frame
  • 22.
    Broadband Processor Architecture ©2005 IBM Corporation 22 Floating Point Processing in the SPE  Support standard IEEE FP and graphics-optimized arithmetic – Floating-point naturally saturating – Always use IEEE data layout  Graphics oriented SP-FP using IEEE compatible data layout – Extended range / Simplified format • Improved estimate instructions to reduce “banding” – Media workloads impose realtime processing requirements • No floating point exceptions – Graphics oriented SP-FP mode added to Cell Power Processor Element  DP-FP follows IEEE conventions
  • 23.
    Broadband Processor Architecture ©2005 IBM Corporation 23 Cell: a Synergistic System Architecture  Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent – Share address translation and memory protection model  SPE optimized for efficient data processing – SPEs share Cell system functions provided by Power Architecture – SMF implements interface to memory • Copy in/copy out to local storage  Power Architecture provides system functions – Virtualization – Address translation and protection – External exception handling  EIB (Element Interconnect Bus) integrates system as data transport hub
  • 24.
    Broadband Processor Architecture ©2005 IBM Corporation 24 Compiling for Cell  The lesson of “RISC computing” – Architecture provides fast, streamlined primitives to compiler – Compiler uses primitives to implement higher-level idioms – If the compiler can’t target it  do not include in architecture  Compiler focus throughout project – Prototype compiler soon after first proposal – Cell compiler team has made significant advances in • Automatic SIMD code generation • Automatic parallelization • Data privatization
  • 25.
    Broadband Processor Architecture ©2005 IBM Corporation 25 Permute Unit Load-Store Unit Floating-Point Unit Fixed-Point Unit Branch Unit Channel Unit Result Forwarding and Staging Register File Local Store (256kB) Single Port SRAM 128B Read 128B Write DMA Unit Instruction Issue Unit / Instruction Line Buffer 8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle 64 Byte/Cycle On-Chip Coherent Bus SPE Block Diagram
  • 26.
    Broadband Processor Architecture ©2005 IBM Corporation 26 SPE PIPELINE FRONT END SPE PIPELINE BACK END Branch Instruction Load/Store Instruction IF Instruction Fetch IB Instruction Buffer ID Instruction Decode IS Instruction Issue RF Register File Access EX Execution WB Write Back Floating Point Instruction Permute Instruction EX1 EX2 EX3 EX4 EX1 EX2 EX3 EX4 EX5 EX6 EX1 EX2 WB WB RF1 RF2 WB Fixed Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2 SPE Pipeline
  • 27.
    Broadband Processor Architecture ©2005 IBM Corporation 27 Cell Broadband Engine
  • 28.
    Broadband Processor Architecture ©2005 IBM Corporation 28 Cell Implementation Characteristics  241M transistors  235mm2  Design operates across wide frequency range – Optimize for power & yield  > 200 GFlops (SP) @3.2GHz  > 20 GFlops (DP) @3.2GHz  Up to 25.6 GB/s memory bandwidth  Up to 75 GB/s I/O bandwidth  100+ simultaneous bus transactions – 16+8 entry DMA queue per SPE Relative Frequency Increase vs. Power Consumption Voltage
  • 29.
    Broadband Processor Architecture ©2005 IBM Corporation 29 Conclusion  Single chip heterogeneous multicore system takes chip performance to a new level – By exploiting thread level parallelism – By exploiting instruction level parallelism – By exploiting data level parallelism  Novel pervasive vector-centric architecture – Data parallelism at the center – Redefines high-performance architecture  SPE compute engine offers high compute density – Power-efficient – Area-efficient  Cell delivers unprecedented supercomputer power for consumer applications
  • 30.
    Broadband Processor Architecture ©2005 IBM Corporation 30 Additional Information  Additional details on the SPE architecture and the Cell implementation can be found at – http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell – http://www.research.ibm.com/cell
  • 31.
    Broadband Processor Architecture ©2005 IBM Corporation 31 © Copyright International Business Machines Corporation 2005. All Rights Reserved. Printed in the United States August 2005. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could resul in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.