SlideShare a Scribd company logo
1 of 38
Download to read offline
IBM Research
© 2008
Feeding the
Multicore Beast:
It’s All About the Data!
Michael Perrone
IBM Master Inventor
Mgr, Cell Solutions Dept.
IBM Research
© 20082 mpp@us.ibm.com
Outline
 History: Data challenge
 Motivation for multicore
 Implications for programmers
 How Cell addresses these implications
 Examples
• 2D/3D FFT
– Medical Imaging, Petroleum, general HPC…
• Green’s Functions
– Seismic Imaging (Petroleum)
• String Matching
– Network Processing: DPI & Intrusion Detections
• Neural Networks
– Finance
IBM Research
© 20083 mpp@us.ibm.com
Chapter 1:
The Beast is Hungry!
IBM Research
© 20084 mpp@us.ibm.com
The Hungry Beast
Processor
(“beast”)
Data
(“food”)
Data Pipe
 Pipe too small = starved beast
 Pipe big enough = well-fed beast
 Pipe too big = wasted resources
IBM Research
© 20085 mpp@us.ibm.com
The Hungry Beast
Processor
(“beast”)
Data
(“food”)
Data Pipe
 Pipe too small = starved beast
 Pipe big enough = well-fed beast
 Pipe too big = wasted resources
 If flops grow faster than pipe capacity…
… the beast gets hungrier!
IBM Research
© 20086 mpp@us.ibm.com
Move the food closer
 Example: Intel Tulsa
– Xeon MP 7100 series
– 65nm, 349mm2, 2 Cores
– 3.4 GHz @ 150W
– ~54.4 SP GFlops
– http://www.intel.com/products
/processor/xeon/index.htm
 Large cache on chip
– ~50% of area
– Keeps data close for
efficient access
 If the data is local,
the beast is happy!
– True for many algorithms
IBM Research
© 20087 mpp@us.ibm.com
What happens if the beast is still hungry?
Data
Cache
 If the data set doesn’t fit in cache
– Cache misses
– Memory latency exposed
– Performance degraded
 Several important application classes don’t fit
– Graph searching algorithms
– Network security
– Natural language processing
– Bioinformatics
– Many HPC workloads
IBM Research
© 20088 mpp@us.ibm.com
Make the food bowl larger
Data
Cache
 Cache size steadily increasing
 Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
IBM Research
© 20089 mpp@us.ibm.com
Make the food bowl larger
Data
Cache
 Cache size steadily increasing
 Implications
– Chip real estate reserved for cache
– Less space on chip for computes
– More power required for fewer FLOPS
 But…
– Important application working sets are growing faster
– Multicore even more demanding on cache than uni-core
IBM Research
© 200810 mpp@us.ibm.com
Chapter 2:
The Beast Has Babies
IBM Research
© 200811 mpp@us.ibm.com
Power Density – The fundamental problem
1
10
100
1000
1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07
i386
i486
Pentium®
Pentium Pro®
Pentium II®
Pentium III
®
W/cm2
Hot Plate
Nuclear Reactor
Source: Fred Pollack, Intel. New Microprocessor Challenges
in the Coming Generations of CMOS Technologies, Micro32
IBM Research
© 200812 mpp@us.ibm.com
What’s causing the problem?
10S Tox=11AGate Stack
Gate dielectric approaching a
fundamental limit (a few atomic layers)
PowerDensity(W/cm2)
65 nM
Gate Length (microns)
1 0.010.1
1000
100
10
1
0.1
0.01
0.001
Power, signal jitter, etc...
IBM Research
© 200813 mpp@us.ibm.com
1.0E+02
1.0E+03
1.0E+04
1990 1995 2000 2005 2010
ClockSpeed(MHz)
Clock Speed
103
102
104
Diminishing Returns on Frequency
In a power-constrained environment, chip clock speed yields diminishing
returns. The industry has moved to lower frequency multicore architectures.
Frequency-
Driven
Design
Points
IBM Research
© 200814 mpp@us.ibm.com
Power vs Performance Trade Offs
Relative Performance
0
1
2
3
4
5
RelativePower
1
1.45
1.3.85 1.7
We need to adapt our algorithms to
get performance out of multicore
IBM Research
© 200815 mpp@us.ibm.com
Implications of Multicore
 There are more mouths to feed
– Data movement will take center stage
 Complexity of cores will stop increasing
… and has started to decrease in some cases
 Complexity increases will center around communication
 Assumption
– Achieving a significant % or peak performance is important
IBM Research
© 200816 mpp@us.ibm.com
Chapter 3:
The Proper Care and Feeding
of Hungry Beasts
IBM Research
© 200817 mpp@us.ibm.com
Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W
IBM Research
© 200818 mpp@us.ibm.com
Feeding the Cell Processor
 8 SPEs each with
– LS
– MFC
– SXU
 PPE
– OS functions
– Disk IO
– Network IO
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual
XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXU
SPU
MFC
PXUL1
PPU
16B/cycle
L2
32B/cycle
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
LS
SXU
SPU
MFC
IBM Research
© 200819 mpp@us.ibm.com
Cell Approach: Feed the beast more efficiently
 Explicitly “orchestrate” the data flow between main
memory and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between
memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it
• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow
• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
IBM Research
© 200820 mpp@us.ibm.com
Cell Approach: Feed the beast more efficiently
 Explicitly “orchestrate” the data flow between main
memory and each SPE’s local store
– Use SPE’s DMA engine to gather & scatter data between
memory main memory and local store
– Enables detailed programmer control of data flow
• Get/Put data when & where you want it
• Hides latency: Simultaneous reads, writes & computes
– Avoids restrictive HW cache management
• Unlikely to determine optimal data flow
• Potentially very inefficient
– Allows more efficient use of the existing bandwidth
 BOTTOM LINE:
It’s all about the data!
IBM Research
© 200821 mpp@us.ibm.com
Cell Comparison: ~4x the FLOPS @ ~½ the power
Both 65nm technology
(to scale)
IBM Research
© 200822 mpp@us.ibm.com
Memory Managing Processor vs. Traditional General Purpose Processor
IBM
AMD
Intel
Cell
BE
IBM Research
© 200823 mpp@us.ibm.com
Examples of Feeding Cell
 2D and 3D FFTs
 Seismic Imaging
 String Matching
 Neural Networks (function approximation)
IBM Research
© 200824 mpp@us.ibm.com
Feeding FFTs to Cell
Buffer
Input
Image
Transposed
Image
Tile
Transposed
Tile
Transposed
Buffer
 SIMDized data
 DMAs double buffered
 Pass 1: For each buffer
• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
 Pass 2: For each buffer
• DMA Get buffer
• Do four 1D FFTs in SIMD
• Transpose tiles
• DMA Put buffer
IBM Research
© 200825 mpp@us.ibm.com
3D FFTs
Long stride trashes cache
Cell DMA allows prefetch
Single Element Data envelope
Stride 1
Stride
N2
N
IBM Research
© 200826 mpp@us.ibm.com
Feeding Seismic Imaging to Cell
(X,Y)
 New G at each (x,y)
 Radial symmetry of G reduces BW requirements
Data
Green’s Function
 
ij
jiyxGjyixD ),,,(),(
IBM Research
© 200827 mpp@us.ibm.com
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200828 mpp@us.ibm.com
Feeding Seismic Imaging to Cell Data
SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
IBM Research
© 200829 mpp@us.ibm.com
Feeding Seismic Imaging to Cell
 For each X
– Load next column of data
– Load next column of indices
– For each Y
• Load Green’s functions
• SIMDize Green’s functions
• Compute convolution at
(X,Y)
– Cycle buffers
H
2R+1
1
Data buffer
Green’s Index buffer
(X,Y)
R
2
IBM Research
© 200830 mpp@us.ibm.com
Feeding String Matching to Cell
 Find (lots of) substrings in (long) string
 Build graph of words & represent as DFA
 Problem: Graph doesn’t fit in LS
Sample Word List:
“the”
“that”
“math”
IBM Research
© 200831 mpp@us.ibm.com
Feeding String Matching to Cell
IBM Research
© 200832 mpp@us.ibm.com
Hiding Main Memory Latency
IBM Research
© 200833 mpp@us.ibm.com
Software Multithreading
IBM Research
© 200834 mpp@us.ibm.com
Feeding Neural Networks to Cell
 Neural net function F(X)
– RBF, MLP, KNN, etc.
 If too big for LS, BW Bound
N Basis functions: dot product + nonlinearity
D Input dimensions
DxN Matrix of parameters
Output
F
X
IBM Research
© 200835 mpp@us.ibm.com
Convert BW Bound to Compute Bound
 Split function over multiple SPEs
 Avoids unnecessary memory traffic
 Reduce compute time per SPE
 Minimal merge overhead
Merge
IBM Research
© 200836 mpp@us.ibm.com
Moral of the Story:
It’s All About the Data!
 The data problem is growing: multicore
 Intelligent software prefetching
– Use DMA engines
– Don’t rely on HW prefetching
 Efficient data management
– Multibuffering: Hide the latency!
– BW utilization: Make every byte count!
– SIMDization: Make every vector count!
– Problem/data partitioning: Make every core work!
– Software multithreading: Keep every core busy!
IBM Research
© 200837 mpp@us.ibm.com
Backup
IBM Research
© 200838 mpp@us.ibm.com
Abstract
Technological obstacles have prevented the microprocessor
industry from achieving increased performance through increased
chip clock speeds. In a reaction to these restrictions, the industry
has chosen the multicore processors path. Multicore processors
promise tremendous GFLOPS performance but raise the challenge
of how one programs them. In this talk, I will discuss the motivation
for multicore, the implications to programmers and how the
Cell/B.E. processors design addresses these challenges. As an
example, I will review one or two applications that highlight the
strengths of Cell.

More Related Content

What's hot

"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ..."NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...Edge AI and Vision Alliance
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCSlide_N
 
TotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalviewtech
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/Lmsramakrishna
 
Blue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputerBlue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputerIsaaq Mohammed
 
Blue gene technology
Blue gene technologyBlue gene technology
Blue gene technologyVivek Jha
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...inside-BigData.com
 
01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to FugakuRCCSRENKEI
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensOscar Law
 
High performance computing
High performance computingHigh performance computing
High performance computingMaher Alshammari
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer FugakuRCCSRENKEI
 

What's hot (20)

"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ..."NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
"NovuTensor: Hardware Acceleration of Deep Convolutional Neural Networks for ...
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
TotalView Debugger On Blue Gene
TotalView Debugger On Blue GeneTotalView Debugger On Blue Gene
TotalView Debugger On Blue Gene
 
Rama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/LRama krishna ppts for blue gene/L
Rama krishna ppts for blue gene/L
 
Blue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputerBlue gene- IBM's SuperComputer
Blue gene- IBM's SuperComputer
 
Blue gene technology
Blue gene technologyBlue gene technology
Blue gene technology
 
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
A64fx and Fugaku - A Game Changing, HPC / AI Optimized Arm CPU to enable Exas...
 
Bluegene
BluegeneBluegene
Bluegene
 
blue gene ppt
blue gene pptblue gene ppt
blue gene ppt
 
Bluegene
BluegeneBluegene
Bluegene
 
01 From K to Fugaku
01 From K to Fugaku01 From K to Fugaku
01 From K to Fugaku
 
Blue gene
Blue geneBlue gene
Blue gene
 
Blue Gene
Blue GeneBlue Gene
Blue Gene
 
Super Computer
Super ComputerSuper Computer
Super Computer
 
Machine Learning with New Hardware Challegens
Machine Learning with New Hardware ChallegensMachine Learning with New Hardware Challegens
Machine Learning with New Hardware Challegens
 
BLUE GENE/L
BLUE GENE/LBLUE GENE/L
BLUE GENE/L
 
High performance computing
High performance computingHigh performance computing
High performance computing
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
08 Supercomputer Fugaku
08 Supercomputer Fugaku08 Supercomputer Fugaku
08 Supercomputer Fugaku
 

Similar to Feeding the Multicore Beast:It’s All About the Data!

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)Heiko Joerg Schick
 
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorHardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorSlide_N
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationSlide_N
 
Enterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologyEnterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologysolarisyougood
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Slide_N
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...Michael Gschwind
 
Cell/B.E. Servers: A Platform for Real Time Scalable Computing and Visualization
Cell/B.E. Servers: A Platform for Real Time Scalable Computing and VisualizationCell/B.E. Servers: A Platform for Real Time Scalable Computing and Visualization
Cell/B.E. Servers: A Platform for Real Time Scalable Computing and VisualizationSlide_N
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010TELECOM I+D
 
Using GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT DevicesUsing GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT DevicesCAST, Inc.
 
Energy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT DevicesEnergy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT DevicesCAST, Inc.
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentationReservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentationHans Haringa
 
Connection Machine
Connection MachineConnection Machine
Connection Machinebutest
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptxclaudio48
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.Slide_N
 

Similar to Feeding the Multicore Beast:It’s All About the Data! (20)

QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
QPACE - QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
 
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processorHardware and Software Architectures for the CELL BROADBAND ENGINE processor
Hardware and Software Architectures for the CELL BROADBAND ENGINE processor
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Enterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technologyEnterprise power systems transition to power7 technology
Enterprise power systems transition to power7 technology
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
Multiple Cores, Multiple Pipes, Multiple Threads – Do we have more Parallelis...
 
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
Cell/B.E. Servers: A Platform for Real Time Scalable Computing and Visualization
Cell/B.E. Servers: A Platform for Real Time Scalable Computing and VisualizationCell/B.E. Servers: A Platform for Real Time Scalable Computing and Visualization
Cell/B.E. Servers: A Platform for Real Time Scalable Computing and Visualization
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
Using GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT DevicesUsing GZIP Data Compression to Reduce Power Consumption in IoT Devices
Using GZIP Data Compression to Reduce Power Consumption in IoT Devices
 
Energy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT DevicesEnergy Savings Using GZIP IP Within IoT Devices
Energy Savings Using GZIP IP Within IoT Devices
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentationReservoir engineering in a HPC (zettaflops) world:  a ‘disruptive’ presentation
Reservoir engineering in a HPC (zettaflops) world: a ‘disruptive’ presentation
 
Connection Machine
Connection MachineConnection Machine
Connection Machine
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 
The future of tape
The future of tapeThe future of tape
The future of tape
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.
 
Deeplearningusingcloudpakfordata
DeeplearningusingcloudpakfordataDeeplearningusingcloudpakfordata
Deeplearningusingcloudpakfordata
 

More from Slide_N

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiSlide_N
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSlide_N
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60 Slide_N
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomSlide_N
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignSlide_N
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotSlide_N
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: TheorySlide_N
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMSlide_N
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Slide_N
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionSlide_N
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerSlide_N
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesSlide_N
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3Slide_N
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplaySlide_N
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac PhysicsSlide_N
 
SPU Shaders
SPU ShadersSPU Shaders
SPU ShadersSlide_N
 
SPU Physics
SPU PhysicsSPU Physics
SPU PhysicsSlide_N
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Slide_N
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War IIISlide_N
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneSlide_N
 

More from Slide_N (20)

New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 
The Visual Computing Revolution Continues
The Visual Computing Revolution ContinuesThe Visual Computing Revolution Continues
The Visual Computing Revolution Continues
 
MLAA on PS3
MLAA on PS3MLAA on PS3
MLAA on PS3
 
SPU gameplay
SPU gameplaySPU gameplay
SPU gameplay
 
Insomniac Physics
Insomniac PhysicsInsomniac Physics
Insomniac Physics
 
SPU Shaders
SPU ShadersSPU Shaders
SPU Shaders
 
SPU Physics
SPU PhysicsSPU Physics
SPU Physics
 
Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
 
Practical SPU Programming in God of War III
Practical SPU Programming in God of War IIIPractical SPU Programming in God of War III
Practical SPU Programming in God of War III
 
The Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s FortuneThe Technology of Uncharted: Drake’s Fortune
The Technology of Uncharted: Drake’s Fortune
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Feeding the Multicore Beast:It’s All About the Data!

  • 1. IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
  • 2. IBM Research © 20082 mpp@us.ibm.com Outline  History: Data challenge  Motivation for multicore  Implications for programmers  How Cell addresses these implications  Examples • 2D/3D FFT – Medical Imaging, Petroleum, general HPC… • Green’s Functions – Seismic Imaging (Petroleum) • String Matching – Network Processing: DPI & Intrusion Detections • Neural Networks – Finance
  • 3. IBM Research © 20083 mpp@us.ibm.com Chapter 1: The Beast is Hungry!
  • 4. IBM Research © 20084 mpp@us.ibm.com The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe  Pipe too small = starved beast  Pipe big enough = well-fed beast  Pipe too big = wasted resources
  • 5. IBM Research © 20085 mpp@us.ibm.com The Hungry Beast Processor (“beast”) Data (“food”) Data Pipe  Pipe too small = starved beast  Pipe big enough = well-fed beast  Pipe too big = wasted resources  If flops grow faster than pipe capacity… … the beast gets hungrier!
  • 6. IBM Research © 20086 mpp@us.ibm.com Move the food closer  Example: Intel Tulsa – Xeon MP 7100 series – 65nm, 349mm2, 2 Cores – 3.4 GHz @ 150W – ~54.4 SP GFlops – http://www.intel.com/products /processor/xeon/index.htm  Large cache on chip – ~50% of area – Keeps data close for efficient access  If the data is local, the beast is happy! – True for many algorithms
  • 7. IBM Research © 20087 mpp@us.ibm.com What happens if the beast is still hungry? Data Cache  If the data set doesn’t fit in cache – Cache misses – Memory latency exposed – Performance degraded  Several important application classes don’t fit – Graph searching algorithms – Network security – Natural language processing – Bioinformatics – Many HPC workloads
  • 8. IBM Research © 20088 mpp@us.ibm.com Make the food bowl larger Data Cache  Cache size steadily increasing  Implications – Chip real estate reserved for cache – Less space on chip for computes – More power required for fewer FLOPS
  • 9. IBM Research © 20089 mpp@us.ibm.com Make the food bowl larger Data Cache  Cache size steadily increasing  Implications – Chip real estate reserved for cache – Less space on chip for computes – More power required for fewer FLOPS  But… – Important application working sets are growing faster – Multicore even more demanding on cache than uni-core
  • 10. IBM Research © 200810 mpp@us.ibm.com Chapter 2: The Beast Has Babies
  • 11. IBM Research © 200811 mpp@us.ibm.com Power Density – The fundamental problem 1 10 100 1000 1.5 1 0.7 0.5 0.35 0.25 0.18 0.13 0.1 0.07 i386 i486 Pentium® Pentium Pro® Pentium II® Pentium III ® W/cm2 Hot Plate Nuclear Reactor Source: Fred Pollack, Intel. New Microprocessor Challenges in the Coming Generations of CMOS Technologies, Micro32
  • 12. IBM Research © 200812 mpp@us.ibm.com What’s causing the problem? 10S Tox=11AGate Stack Gate dielectric approaching a fundamental limit (a few atomic layers) PowerDensity(W/cm2) 65 nM Gate Length (microns) 1 0.010.1 1000 100 10 1 0.1 0.01 0.001 Power, signal jitter, etc...
  • 13. IBM Research © 200813 mpp@us.ibm.com 1.0E+02 1.0E+03 1.0E+04 1990 1995 2000 2005 2010 ClockSpeed(MHz) Clock Speed 103 102 104 Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency- Driven Design Points
  • 14. IBM Research © 200814 mpp@us.ibm.com Power vs Performance Trade Offs Relative Performance 0 1 2 3 4 5 RelativePower 1 1.45 1.3.85 1.7 We need to adapt our algorithms to get performance out of multicore
  • 15. IBM Research © 200815 mpp@us.ibm.com Implications of Multicore  There are more mouths to feed – Data movement will take center stage  Complexity of cores will stop increasing … and has started to decrease in some cases  Complexity increases will center around communication  Assumption – Achieving a significant % or peak performance is important
  • 16. IBM Research © 200816 mpp@us.ibm.com Chapter 3: The Proper Care and Feeding of Hungry Beasts
  • 17. IBM Research © 200817 mpp@us.ibm.com Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W
  • 18. IBM Research © 200818 mpp@us.ibm.com Feeding the Cell Processor  8 SPEs each with – LS – MFC – SXU  PPE – OS functions – Disk IO – Network IO 16B/cycle (2x)16B/cycle BIC FlexIOTM MIC Dual XDRTM 16B/cycle EIB (up to 96B/cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU MFC PXUL1 PPU 16B/cycle L2 32B/cycle LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC LS SXU SPU MFC
  • 19. IBM Research © 200819 mpp@us.ibm.com Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store – Use SPE’s DMA engine to gather & scatter data between memory main memory and local store – Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes – Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient – Allows more efficient use of the existing bandwidth
  • 20. IBM Research © 200820 mpp@us.ibm.com Cell Approach: Feed the beast more efficiently  Explicitly “orchestrate” the data flow between main memory and each SPE’s local store – Use SPE’s DMA engine to gather & scatter data between memory main memory and local store – Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes – Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient – Allows more efficient use of the existing bandwidth  BOTTOM LINE: It’s all about the data!
  • 21. IBM Research © 200821 mpp@us.ibm.com Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology (to scale)
  • 22. IBM Research © 200822 mpp@us.ibm.com Memory Managing Processor vs. Traditional General Purpose Processor IBM AMD Intel Cell BE
  • 23. IBM Research © 200823 mpp@us.ibm.com Examples of Feeding Cell  2D and 3D FFTs  Seismic Imaging  String Matching  Neural Networks (function approximation)
  • 24. IBM Research © 200824 mpp@us.ibm.com Feeding FFTs to Cell Buffer Input Image Transposed Image Tile Transposed Tile Transposed Buffer  SIMDized data  DMAs double buffered  Pass 1: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer  Pass 2: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer
  • 25. IBM Research © 200825 mpp@us.ibm.com 3D FFTs Long stride trashes cache Cell DMA allows prefetch Single Element Data envelope Stride 1 Stride N2 N
  • 26. IBM Research © 200826 mpp@us.ibm.com Feeding Seismic Imaging to Cell (X,Y)  New G at each (x,y)  Radial symmetry of G reduces BW requirements Data Green’s Function   ij jiyxGjyixD ),,,(),(
  • 27. IBM Research © 200827 mpp@us.ibm.com Feeding Seismic Imaging to Cell Data SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
  • 28. IBM Research © 200828 mpp@us.ibm.com Feeding Seismic Imaging to Cell Data SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7
  • 29. IBM Research © 200829 mpp@us.ibm.com Feeding Seismic Imaging to Cell  For each X – Load next column of data – Load next column of indices – For each Y • Load Green’s functions • SIMDize Green’s functions • Compute convolution at (X,Y) – Cycle buffers H 2R+1 1 Data buffer Green’s Index buffer (X,Y) R 2
  • 30. IBM Research © 200830 mpp@us.ibm.com Feeding String Matching to Cell  Find (lots of) substrings in (long) string  Build graph of words & represent as DFA  Problem: Graph doesn’t fit in LS Sample Word List: “the” “that” “math”
  • 31. IBM Research © 200831 mpp@us.ibm.com Feeding String Matching to Cell
  • 32. IBM Research © 200832 mpp@us.ibm.com Hiding Main Memory Latency
  • 33. IBM Research © 200833 mpp@us.ibm.com Software Multithreading
  • 34. IBM Research © 200834 mpp@us.ibm.com Feeding Neural Networks to Cell  Neural net function F(X) – RBF, MLP, KNN, etc.  If too big for LS, BW Bound N Basis functions: dot product + nonlinearity D Input dimensions DxN Matrix of parameters Output F X
  • 35. IBM Research © 200835 mpp@us.ibm.com Convert BW Bound to Compute Bound  Split function over multiple SPEs  Avoids unnecessary memory traffic  Reduce compute time per SPE  Minimal merge overhead Merge
  • 36. IBM Research © 200836 mpp@us.ibm.com Moral of the Story: It’s All About the Data!  The data problem is growing: multicore  Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching  Efficient data management – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning: Make every core work! – Software multithreading: Keep every core busy!
  • 37. IBM Research © 200837 mpp@us.ibm.com Backup
  • 38. IBM Research © 200838 mpp@us.ibm.com Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell.