SlideShare a Scribd company logo
1 of 27
Download to read offline
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
1
© 2015 International Business Machines 1
Software and System Co-
Optimization in the
Era of Heterogeneous
Computing
Dr. Michael Gschwind
IBM TJ Watson Research Center
Yorktown Heights, NY
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
2
Recent Power History
Technology
POWER5
2004
POWER8
POWER6
2007
POWER7
2010
POWER7+
2012
Compute
Cores
Threads
Caching
On-chip
Off-chip
Bandwidth
Sust. Mem.
Peak I/O
130nm SOI 65nm SOI
45nm SOI
eDRAM
22nm SOI
eDRAM
2
SMT2
2
SMT2
8
SMT4
12
SMT8
1.9MB
36MB
8MB
32MB
2 + 32MB
None
6 + 96MB
128MB
15GB/s
6GB/s
30GB/s
20GB/s
100GB/s
40GB/s
230GB/s
64GB/s
32nm SOI
eDRAM
8
SMT4
2 + 80MB
None
100GB/s
40GB/s
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
3
POWER8 Chip Overview
▪ Up to 2.5x socket perf vs. P7+
▪ 649mm2 die size, 4.2B transistors
▪ 12 high-performance cores
▪ Large Caches
– L2: 512KB private SRAM per core
– L3: 96MB shared eDRAM w/ 8MB “fast access” partition per core
– L4: Up to 128MB, located on memory buffer chip
▪ 4 High Speed I/O interfaces
– Memory, On-Node SMP, Off-Node SMP, PCIe Gen3
Acc
On
Node
SMP
Fabric, Pervasive
PCI
Off
Node
SMP
MC
Mem0-3
Mem4-7
Off-Node SMPPCI PCI
On-Node SMP
MC
Core
L3 Quadrant
CoreCore
L2 L2L2
Core
L3 Quadrant
CoreCore
L2 L2L2
Core
L3 Quadrant
CoreCore
L2 L2L2
Core
L3 Quadrant
CoreCore
L2 L2L2
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
4
POWER8 Technology
▪ 22nm SOI
▪ 15 layer BEOL:
5-1x, 2-2x, 3-4x, 3-8x, 2-UTM
▪ 3-Vt thin-oxide logic transistors for power optimization
▪ Multiple thick-oxide transistors (for I/O and analog support)
▪ 3 app-optimized SRAM cells:
– 0.160µm² 6T perf-oriented
– 0.144µm² 6T perf-density balance for directories / L2
– 0.192µm² 8T multi-port
▪ Technology eDRAM cell: 0.026µm² 2-2x
3-4x
UTM
5-1x
3-8x
UTM
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
5
Large Block Structured Synthesis
▪ Enhanced process which included:
– Structured dataflow
– Congestion-aware stdcell placement
– Embedded “hard” IP (e.g. arrays, regfiles,
complex custom cells)
▪ 30% fewer unique blocks vs.
POWER7
▪ Improvements in block power and total
design area
– 15% area reduction IFU
VSU
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
6
POWER8 Core: Back bone of big data computing system
▪ Enhanced Micro-Architecture
▪ Increased Execution Bandwidth
▪ SMT 8
▪ Transactional Memory
▪ Vector/Scalar Unit
▪ High-performance Integer & FP Vector Processor
▪ Increased Performance for Data Rich Applications
VSU
FXU
IFU
DFU
ISU
PC
PC
LSU
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
7
Combined I/O Bandwidth = 7.6Tb/s
POWER8
Processor
Memory
Buffers
Memory
Buffers
PCI
DMI
PCI
POWER8
Processor
POWER8
Processor
DMI
DMI
DMI
DMI
DMI
DMI
DMI
NODE-to-NODE
ON-NODE SMP
Big Bandwidth
for
Big Data
Putting it all together with the memory links, on- and off-node SMP links
as well as PCIe, at 7.6Tb/s of chip I/O bandwidth
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
8
© 2015 International Business Machines 8
Big Data in a
Connected
World
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
9
Tectonic Shifts in Nature of Workloads
Graph
Analytics
Security, Fraud Detection
Genome Analysis
Social Network Analytics
Knowledge Graphs
Machine
Learning
Watson Health
Watson Analytics
Robotics
Education
Video,
Speech
Analytics
Multimodal Analytics
- Object recognition
- Complex video analytics
- Correlation and stitching
Automating
the
World
Learn
Predict
Ingest
Understanding
the
World
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
10
General-Purpose CPU Design
▪ Many competing requirements
– Branchy control-flow dominated code
– Code with unpredictable data access patterns
– Operating system code
– Multiple separate applications
– Multiple virtual machines at a time
▪ Result in low efficiency for any one metric
– Flops / area
– Integer ops / area
– Predictions / area
– …
VSU
FXU
IFU
DFU
ISU
PC
PC
LSU
Out of order
execution
Register
renaming
Branch
prediction
& prefetch
Robust
virtual
memory
support
dec
ode
I$
RF
int
D$
SIMD
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
11
Heterogeneous, Workload-optimized Acceleration
▪ On-chip integrated accelerators (SoC design)
– Compute accelerator (Cell BE)
– Compression (P7+)
– Encryption (P7+)
– Random number generation (P7+)
– …
▪ SoC design offers highest integration, but…
– Requires new chip design for accelerator
– Long time to market
– Requires very high volumes
Cell BE
POWER7+
decode
l
o
c
a
l
s
t
o
e
MMU
S
I
M
D
A
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
12
CAPI – Coherent Accelerator Processor Interface
▪ Open infrastructure for off-chip, memory-coherent accelerators
– Modular interface
– Third-party high value-add components
▪ Standardized, layered protocol
– architectural interface
– functional protocol
– PCIe signaling protocol
▪ Create workload-optimized innovative solutions
– Faster time to market
– Lower bar to entry
– Variety of implementation options
• FPGAs, ASICs
Coherence Bus
proxy
PSL
POWER8
* Power Service Layer
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
13
Heterogeneous System Challenges
▪The 4 ‘P’s of System Design
▪Programmer Productivity
▪Realize accelerator Performance benefits
▪Portability: Investment protection for applications
▪Partitioning for multi-user systems: processes, partitions
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
14
Application Acceleration
▪ Fine-grained data sharing
 coherent, shared memory
▪ Accelerator-initiated data accesses/transfers
 coherent, shared memory
▪ Pointer identity
 shared addressing
▪ Flexible synchronization
 symmetric, programmable interfaces
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
15
CAPI Acceleration overcomes Device Driver Deceleration
Typical I/O Model Flow:
Flow with Coherent Model:
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Return from DD
Completion
300 Instructions 10,000 Instructions 3,000 Instructions 1,000 InstructionsApplication
Dependent, but
Equal to Below
1,000 Instructions
Shared Mem
Notify Accelerator Acceleration
Shared Memory
Completion
Application
Dependent, but
Equal to Above
100 Instructions400 Instructions
0.3 µs 0.06 µs
Total ~0.36 µs
7.9 µs 4.9 µs
Total ~13 µs for data prep
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
16
Power GPU acceleration
▪ CUDA programming environment supported under LE Linux
– GPU as compute accelerator
– Offload regular compute-intensive application portions to GPU
▪ Advances in GPU Performance and Programmability
– UVA – Universal Virtual Addressing
– UM – Unified Memory
▪ Ongoing collaboration to co-optimize systems
– Next generation hardware enhancement
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
17
Relating content via concept graphs
@joe I wonder how do I use bitcoins
Apple’s digital wallet, if widely adopted,
could usher in a new era of ease and
convenience.
Icahn, who months ago called on eBay to spin off the lucrative online and
mobile payment service, continues to believe that the payments field must
be consolidated, either by PayPal buying up smaller rivals or by merging
with another major player.
Job ad: Lead Front-end Developer -
Virtual Currency Exchange
Conceptual
reasoning allows us
to relate content that
is hard to connect
otherwise
Watson Concept Insights
Constituency
parse
tree
Wikifier
(graph
linker)
Retrieve
concept vectors
from cache
(assumes static
graph!!!)
Merge concept
vectors to form
document vector
External
storageone CPU
socket
one CPU
socket
document
s
Reverse
conceptual
index
(Cassandra)
Compute related
concepts kernel
BASIC INGESTION
(only once per life of
document)
CONCEPTUAL INDEXING
Currently once per life of document, maybe 3-5 times
in future
USER INTERFACE
QUERY RUNTIME
(hopefully millions
of queries!!!)
CPU
Retrieve related
documents
Watson Concept Insights Workload Pipeline
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
19
Watson Concept Insights: Compute
Performance Comparison (CPU vs. GPU)
N-element
Vector
Page Rank
Calculations
5 Iterations
Pareto
Normalization
Scoring Combiner
M
Concepts*
Page Rank
Calculations
5 Iterations
Pareto
Normalization Scoring CombinerInit
Batched Execution with batch size of 64
(0.027 s)
(2.21 sec) (0.032 sec) (0.0048 sec)
(0.016 s)
Current CPU Execution
(55 sec) (3 sec) (1 sec)
Parallel Execution on GPU
CPU: 58 sec vs. GPU: 2.35 sec (25X)
HOST
HOST
M : Concepts under consideration (28 for the test)
N: Total number of concepts in Corpus (4.7M for Wikipedia)
*Ivy Bridge
*Nvidia K40
N*N Sparse
Matrix
Loaded only once
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
20
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
21
Over 2 million $136 billion
often do not reveal rare toxicity of
some drugs, and they are not
personalized
of in-hospital medication errors
caused by unforeseen drug-drug
interactions
Adverse Drug Reactions pose a serious challenge to
the healthcare system
serious adverse drug reactions
(ADRs) yearly: 100,000 deaths
ADR associated cost yearly
(> diabetic & cardiovascular care)
Clinical Trials 3–5%
Insight as a Service for Personalized and
Detailed Adverse Drug Reactions Prediction
Leverage large amount of data for personalized prediction of
nature, cause, and severity of adverse drug reactions
EMRs
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
23
Drug1 Drug2
Aspirin Probenecid
Aspirin Azilsartan
Learn PredictIngest
Personalized Medicine, Adverse Drug Reaction Prediction w/ ML
Drug1 Drug2 Sim
Salsalate Aspirin .7
Dicoumarol Warfarin .6
Drug1 Drug2
Aspirin Gliclazide
Aspirin Dicoumarol
Drug1 Drug2 Sim
Salsalate Aspirin .9
Dicoumaro
l
Warfarin .76
Known Interactions of type 1 to …
Drug1 Drug2 Best
Sim1*Sim1
Best
SimN*SimN
Salsalate Gliclazide .9*1 .7*1
Salsalate Warfarin .9*.76 .7*.6
Candidate Interactions of type i
Features
Chemical Similarity 1 to …
Drug1 Drug2 Prediction
Salsalate Gliclazide 0.85
Salsalate Warfarin 0.7
Interactions of type 1 Prediction
…
Drug1 Drug2 Prediction
Salsalate Gliclazide 0.53
Salsalate Warfarin 0.32
Interactions of type M Prediction
+ +
Machine Learning
Model
30X improvement in Learning performance
100s of TBs of data 50 million patients,
2000 drugs
2000 features
Personalized Medicine – Adverse Drug Reaction Workload
Personalization will result in massive increase in computation complexity
Real time prediction requirements for operational needs (< 1 minute for emergency situations)
• Computational pattern:
- Sparse cube to dense cube with patient as additional dimension
• Training:
- Number of patients above 50 Million
- Number of features around 1800
- Additional samples for training O(#patients)
- Number of cross-validation stages and #models per stage increases dramatically
- 100X increase in training complexity with ~100 TBs of Data
• Prediction:
- Input Model (#features) and dataset (# patients in the hospital)
- 1800 features and 500,000 patients
- Real time
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
25
Programming Heterogeneous Systems
OpenCL?
SystemC?
VHDL?
C++?
Java?
CUDA?
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
26
Portability and Optimization in
Heterogeneous Systems
Library Layer
Accelerator
X
CPU
enablement
GPU
enablement
FPGA interface
& configuration
Accelerator X
Enablement
Cognitive Middleware
Application
ApplicationApplication
M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
27
Accelerate
Processing
in a Connected
World
Enable Compute-Intensive
Cognitive Workloads
Exploit Best-of-Breed
Accelerators
Provide Abstraction
of Hardware Function

More Related Content

Similar to Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing

Mmsys slideshare-intel-nokia
Mmsys slideshare-intel-nokiaMmsys slideshare-intel-nokia
Mmsys slideshare-intel-nokiaRufael Mekuria
 
B Kindilien Finding Efficiency In Mach 120408
B Kindilien Finding Efficiency In Mach 120408B Kindilien Finding Efficiency In Mach 120408
B Kindilien Finding Efficiency In Mach 120408jgIpotiwon
 
Multi-Core (MC) Processor Qualification for Safety Critical Systems
Multi-Core (MC) Processor Qualification for Safety Critical SystemsMulti-Core (MC) Processor Qualification for Safety Critical Systems
Multi-Core (MC) Processor Qualification for Safety Critical SystemsAdaCore
 
Deploying Massive Scale Graphs for Realtime Insights
Deploying Massive Scale Graphs for Realtime InsightsDeploying Massive Scale Graphs for Realtime Insights
Deploying Massive Scale Graphs for Realtime InsightsNeo4j
 
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17Mark Goldstein
 
Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...
Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...
Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...Facultad de Informática UCM
 
FPGA-based soft-processors: 6G nodes and post-quantum security in space
 FPGA-based soft-processors: 6G nodes and post-quantum security in space FPGA-based soft-processors: 6G nodes and post-quantum security in space
FPGA-based soft-processors: 6G nodes and post-quantum security in spaceFacultad de Informática UCM
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC
 
OpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly HighlightsOpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly HighlightsOpenACC
 
Semantic Web for Advanced Engineering
Semantic Web for Advanced EngineeringSemantic Web for Advanced Engineering
Semantic Web for Advanced EngineeringMarta Sabou
 
SCADA a gyakorlatban - Accenture Industry X.0 Meetup
SCADA a gyakorlatban - Accenture Industry X.0 MeetupSCADA a gyakorlatban - Accenture Industry X.0 Meetup
SCADA a gyakorlatban - Accenture Industry X.0 MeetupAccenture Hungary
 
MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017
MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017
MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017Jose Gascon
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC
 
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMWalmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMRedis Labs
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorialcybercbm
 
Tool-Driven Technology Transfer in Software Engineering
Tool-Driven Technology Transfer in Software EngineeringTool-Driven Technology Transfer in Software Engineering
Tool-Driven Technology Transfer in Software EngineeringHeiko Koziolek
 
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC
 

Similar to Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing (20)

Mmsys slideshare-intel-nokia
Mmsys slideshare-intel-nokiaMmsys slideshare-intel-nokia
Mmsys slideshare-intel-nokia
 
B Kindilien Finding Efficiency In Mach 120408
B Kindilien Finding Efficiency In Mach 120408B Kindilien Finding Efficiency In Mach 120408
B Kindilien Finding Efficiency In Mach 120408
 
Multi-Core (MC) Processor Qualification for Safety Critical Systems
Multi-Core (MC) Processor Qualification for Safety Critical SystemsMulti-Core (MC) Processor Qualification for Safety Critical Systems
Multi-Core (MC) Processor Qualification for Safety Critical Systems
 
Deploying Massive Scale Graphs for Realtime Insights
Deploying Massive Scale Graphs for Realtime InsightsDeploying Massive Scale Graphs for Realtime Insights
Deploying Massive Scale Graphs for Realtime Insights
 
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17
SIMA AZ: Emerging Information Technology Innovations & Trends 11/15/17
 
Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...
Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...
Fast and energy-efficient eNVM based memory organisation at L3-L1 layers for ...
 
FPGA-based soft-processors: 6G nodes and post-quantum security in space
 FPGA-based soft-processors: 6G nodes and post-quantum security in space FPGA-based soft-processors: 6G nodes and post-quantum security in space
FPGA-based soft-processors: 6G nodes and post-quantum security in space
 
OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020OpenACC Monthly Highlights: July 2020
OpenACC Monthly Highlights: July 2020
 
Der nächste Quantensprung bei Datacenter Technologien steht vor der Tür
Der nächste Quantensprung bei Datacenter Technologien steht vor der Tür Der nächste Quantensprung bei Datacenter Technologien steht vor der Tür
Der nächste Quantensprung bei Datacenter Technologien steht vor der Tür
 
OpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly HighlightsOpenACC and Hackathons Monthly Highlights
OpenACC and Hackathons Monthly Highlights
 
VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016
 
Priorities Shift In IC Design
Priorities Shift In IC DesignPriorities Shift In IC Design
Priorities Shift In IC Design
 
Semantic Web for Advanced Engineering
Semantic Web for Advanced EngineeringSemantic Web for Advanced Engineering
Semantic Web for Advanced Engineering
 
SCADA a gyakorlatban - Accenture Industry X.0 Meetup
SCADA a gyakorlatban - Accenture Industry X.0 MeetupSCADA a gyakorlatban - Accenture Industry X.0 Meetup
SCADA a gyakorlatban - Accenture Industry X.0 Meetup
 
MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017
MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017
MIPM PCo to Kafka Faurecia SAP co-innovation at Hannover Messe 2017
 
OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020OpenACC Monthly Highlights: October2020
OpenACC Monthly Highlights: October2020
 
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBMWalmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
Walmart & IBM Revisit the Linear Road Benchmark- Roger Rea, IBM
 
Cluster Tutorial
Cluster TutorialCluster Tutorial
Cluster Tutorial
 
Tool-Driven Technology Transfer in Software Engineering
Tool-Driven Technology Transfer in Software EngineeringTool-Driven Technology Transfer in Software Engineering
Tool-Driven Technology Transfer in Software Engineering
 
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptxOpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: September 2022.pptx
 

More from Michael Gschwind

M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...Michael Gschwind
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureMichael Gschwind
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind
 
Michael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop Computing
Michael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop ComputingMichael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop Computing
Michael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop ComputingMichael Gschwind
 
Gschwind, PowerAI: A Co-Optimized Software Stack for AI on Power
Gschwind, PowerAI: A Co-Optimized Software Stack for AI on PowerGschwind, PowerAI: A Co-Optimized Software Stack for AI on Power
Gschwind, PowerAI: A Co-Optimized Software Stack for AI on PowerMichael Gschwind
 
Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...
Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...
Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...Michael Gschwind
 

More from Michael Gschwind (8)

M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
M. Gschwind, A novel SIMD architecture for the Cell heterogeneous chip multip...
 
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband EngineMichael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
Michael Gschwind, Chip Multiprocessing and the Cell Broadband Engine
 
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
Michael Gschwind, Cell Broadband Engine: Exploiting multiple levels of parall...
 
Synergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architectureSynergistic processing in cell's multicore architecture
Synergistic processing in cell's multicore architecture
 
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
Michael Gschwind et al, "An Open Source Environment for Cell Broadband Engine...
 
Michael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop Computing
Michael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop ComputingMichael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop Computing
Michael Gschwind, Blue Gene/Q: Design for Sustained Multi-Petaflop Computing
 
Gschwind, PowerAI: A Co-Optimized Software Stack for AI on Power
Gschwind, PowerAI: A Co-Optimized Software Stack for AI on PowerGschwind, PowerAI: A Co-Optimized Software Stack for AI on Power
Gschwind, PowerAI: A Co-Optimized Software Stack for AI on Power
 
Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...
Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...
Gschwind - AI Everywhere: democratize AI with an open platform and end-to -en...
 

Recently uploaded

High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 

Recently uploaded (20)

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 

Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing

  • 1. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 1 © 2015 International Business Machines 1 Software and System Co- Optimization in the Era of Heterogeneous Computing Dr. Michael Gschwind IBM TJ Watson Research Center Yorktown Heights, NY
  • 2. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 2 Recent Power History Technology POWER5 2004 POWER8 POWER6 2007 POWER7 2010 POWER7+ 2012 Compute Cores Threads Caching On-chip Off-chip Bandwidth Sust. Mem. Peak I/O 130nm SOI 65nm SOI 45nm SOI eDRAM 22nm SOI eDRAM 2 SMT2 2 SMT2 8 SMT4 12 SMT8 1.9MB 36MB 8MB 32MB 2 + 32MB None 6 + 96MB 128MB 15GB/s 6GB/s 30GB/s 20GB/s 100GB/s 40GB/s 230GB/s 64GB/s 32nm SOI eDRAM 8 SMT4 2 + 80MB None 100GB/s 40GB/s
  • 3. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 3 POWER8 Chip Overview ▪ Up to 2.5x socket perf vs. P7+ ▪ 649mm2 die size, 4.2B transistors ▪ 12 high-performance cores ▪ Large Caches – L2: 512KB private SRAM per core – L3: 96MB shared eDRAM w/ 8MB “fast access” partition per core – L4: Up to 128MB, located on memory buffer chip ▪ 4 High Speed I/O interfaces – Memory, On-Node SMP, Off-Node SMP, PCIe Gen3 Acc On Node SMP Fabric, Pervasive PCI Off Node SMP MC Mem0-3 Mem4-7 Off-Node SMPPCI PCI On-Node SMP MC Core L3 Quadrant CoreCore L2 L2L2 Core L3 Quadrant CoreCore L2 L2L2 Core L3 Quadrant CoreCore L2 L2L2 Core L3 Quadrant CoreCore L2 L2L2
  • 4. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 4 POWER8 Technology ▪ 22nm SOI ▪ 15 layer BEOL: 5-1x, 2-2x, 3-4x, 3-8x, 2-UTM ▪ 3-Vt thin-oxide logic transistors for power optimization ▪ Multiple thick-oxide transistors (for I/O and analog support) ▪ 3 app-optimized SRAM cells: – 0.160µm² 6T perf-oriented – 0.144µm² 6T perf-density balance for directories / L2 – 0.192µm² 8T multi-port ▪ Technology eDRAM cell: 0.026µm² 2-2x 3-4x UTM 5-1x 3-8x UTM
  • 5. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 5 Large Block Structured Synthesis ▪ Enhanced process which included: – Structured dataflow – Congestion-aware stdcell placement – Embedded “hard” IP (e.g. arrays, regfiles, complex custom cells) ▪ 30% fewer unique blocks vs. POWER7 ▪ Improvements in block power and total design area – 15% area reduction IFU VSU
  • 6. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 6 POWER8 Core: Back bone of big data computing system ▪ Enhanced Micro-Architecture ▪ Increased Execution Bandwidth ▪ SMT 8 ▪ Transactional Memory ▪ Vector/Scalar Unit ▪ High-performance Integer & FP Vector Processor ▪ Increased Performance for Data Rich Applications VSU FXU IFU DFU ISU PC PC LSU
  • 7. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 7 Combined I/O Bandwidth = 7.6Tb/s POWER8 Processor Memory Buffers Memory Buffers PCI DMI PCI POWER8 Processor POWER8 Processor DMI DMI DMI DMI DMI DMI DMI NODE-to-NODE ON-NODE SMP Big Bandwidth for Big Data Putting it all together with the memory links, on- and off-node SMP links as well as PCIe, at 7.6Tb/s of chip I/O bandwidth
  • 8. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 8 © 2015 International Business Machines 8 Big Data in a Connected World
  • 9. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 9 Tectonic Shifts in Nature of Workloads Graph Analytics Security, Fraud Detection Genome Analysis Social Network Analytics Knowledge Graphs Machine Learning Watson Health Watson Analytics Robotics Education Video, Speech Analytics Multimodal Analytics - Object recognition - Complex video analytics - Correlation and stitching Automating the World Learn Predict Ingest Understanding the World
  • 10. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 10 General-Purpose CPU Design ▪ Many competing requirements – Branchy control-flow dominated code – Code with unpredictable data access patterns – Operating system code – Multiple separate applications – Multiple virtual machines at a time ▪ Result in low efficiency for any one metric – Flops / area – Integer ops / area – Predictions / area – … VSU FXU IFU DFU ISU PC PC LSU Out of order execution Register renaming Branch prediction & prefetch Robust virtual memory support dec ode I$ RF int D$ SIMD
  • 11. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 11 Heterogeneous, Workload-optimized Acceleration ▪ On-chip integrated accelerators (SoC design) – Compute accelerator (Cell BE) – Compression (P7+) – Encryption (P7+) – Random number generation (P7+) – … ▪ SoC design offers highest integration, but… – Requires new chip design for accelerator – Long time to market – Requires very high volumes Cell BE POWER7+ decode l o c a l s t o e MMU S I M D A
  • 12. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 12 CAPI – Coherent Accelerator Processor Interface ▪ Open infrastructure for off-chip, memory-coherent accelerators – Modular interface – Third-party high value-add components ▪ Standardized, layered protocol – architectural interface – functional protocol – PCIe signaling protocol ▪ Create workload-optimized innovative solutions – Faster time to market – Lower bar to entry – Variety of implementation options • FPGAs, ASICs Coherence Bus proxy PSL POWER8 * Power Service Layer
  • 13. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 13 Heterogeneous System Challenges ▪The 4 ‘P’s of System Design ▪Programmer Productivity ▪Realize accelerator Performance benefits ▪Portability: Investment protection for applications ▪Partitioning for multi-user systems: processes, partitions
  • 14. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 14 Application Acceleration ▪ Fine-grained data sharing  coherent, shared memory ▪ Accelerator-initiated data accesses/transfers  coherent, shared memory ▪ Pointer identity  shared addressing ▪ Flexible synchronization  symmetric, programmable interfaces
  • 15. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 15 CAPI Acceleration overcomes Device Driver Deceleration Typical I/O Model Flow: Flow with Coherent Model: DD Call Copy or Pin Source Data MMIO Notify Accelerator Acceleration Poll / Interrupt Completion Copy or Unpin Result Data Return from DD Completion 300 Instructions 10,000 Instructions 3,000 Instructions 1,000 InstructionsApplication Dependent, but Equal to Below 1,000 Instructions Shared Mem Notify Accelerator Acceleration Shared Memory Completion Application Dependent, but Equal to Above 100 Instructions400 Instructions 0.3 µs 0.06 µs Total ~0.36 µs 7.9 µs 4.9 µs Total ~13 µs for data prep
  • 16. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 16 Power GPU acceleration ▪ CUDA programming environment supported under LE Linux – GPU as compute accelerator – Offload regular compute-intensive application portions to GPU ▪ Advances in GPU Performance and Programmability – UVA – Universal Virtual Addressing – UM – Unified Memory ▪ Ongoing collaboration to co-optimize systems – Next generation hardware enhancement
  • 17. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 17 Relating content via concept graphs @joe I wonder how do I use bitcoins Apple’s digital wallet, if widely adopted, could usher in a new era of ease and convenience. Icahn, who months ago called on eBay to spin off the lucrative online and mobile payment service, continues to believe that the payments field must be consolidated, either by PayPal buying up smaller rivals or by merging with another major player. Job ad: Lead Front-end Developer - Virtual Currency Exchange Conceptual reasoning allows us to relate content that is hard to connect otherwise Watson Concept Insights
  • 18. Constituency parse tree Wikifier (graph linker) Retrieve concept vectors from cache (assumes static graph!!!) Merge concept vectors to form document vector External storageone CPU socket one CPU socket document s Reverse conceptual index (Cassandra) Compute related concepts kernel BASIC INGESTION (only once per life of document) CONCEPTUAL INDEXING Currently once per life of document, maybe 3-5 times in future USER INTERFACE QUERY RUNTIME (hopefully millions of queries!!!) CPU Retrieve related documents Watson Concept Insights Workload Pipeline M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
  • 19. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 19 Watson Concept Insights: Compute Performance Comparison (CPU vs. GPU) N-element Vector Page Rank Calculations 5 Iterations Pareto Normalization Scoring Combiner M Concepts* Page Rank Calculations 5 Iterations Pareto Normalization Scoring CombinerInit Batched Execution with batch size of 64 (0.027 s) (2.21 sec) (0.032 sec) (0.0048 sec) (0.016 s) Current CPU Execution (55 sec) (3 sec) (1 sec) Parallel Execution on GPU CPU: 58 sec vs. GPU: 2.35 sec (25X) HOST HOST M : Concepts under consideration (28 for the test) N: Total number of concepts in Corpus (4.7M for Wikipedia) *Ivy Bridge *Nvidia K40 N*N Sparse Matrix Loaded only once
  • 20. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 20
  • 21. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 21 Over 2 million $136 billion often do not reveal rare toxicity of some drugs, and they are not personalized of in-hospital medication errors caused by unforeseen drug-drug interactions Adverse Drug Reactions pose a serious challenge to the healthcare system serious adverse drug reactions (ADRs) yearly: 100,000 deaths ADR associated cost yearly (> diabetic & cardiovascular care) Clinical Trials 3–5%
  • 22. Insight as a Service for Personalized and Detailed Adverse Drug Reactions Prediction Leverage large amount of data for personalized prediction of nature, cause, and severity of adverse drug reactions EMRs M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
  • 23. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 23 Drug1 Drug2 Aspirin Probenecid Aspirin Azilsartan Learn PredictIngest Personalized Medicine, Adverse Drug Reaction Prediction w/ ML Drug1 Drug2 Sim Salsalate Aspirin .7 Dicoumarol Warfarin .6 Drug1 Drug2 Aspirin Gliclazide Aspirin Dicoumarol Drug1 Drug2 Sim Salsalate Aspirin .9 Dicoumaro l Warfarin .76 Known Interactions of type 1 to … Drug1 Drug2 Best Sim1*Sim1 Best SimN*SimN Salsalate Gliclazide .9*1 .7*1 Salsalate Warfarin .9*.76 .7*.6 Candidate Interactions of type i Features Chemical Similarity 1 to … Drug1 Drug2 Prediction Salsalate Gliclazide 0.85 Salsalate Warfarin 0.7 Interactions of type 1 Prediction … Drug1 Drug2 Prediction Salsalate Gliclazide 0.53 Salsalate Warfarin 0.32 Interactions of type M Prediction + + Machine Learning Model 30X improvement in Learning performance 100s of TBs of data 50 million patients, 2000 drugs 2000 features
  • 24. Personalized Medicine – Adverse Drug Reaction Workload Personalization will result in massive increase in computation complexity Real time prediction requirements for operational needs (< 1 minute for emergency situations) • Computational pattern: - Sparse cube to dense cube with patient as additional dimension • Training: - Number of patients above 50 Million - Number of features around 1800 - Additional samples for training O(#patients) - Number of cross-validation stages and #models per stage increases dramatically - 100X increase in training complexity with ~100 TBs of Data • Prediction: - Input Model (#features) and dataset (# patients in the hospital) - 1800 features and 500,000 patients - Real time M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
  • 25. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 25 Programming Heterogeneous Systems OpenCL? SystemC? VHDL? C++? Java? CUDA?
  • 26. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 26 Portability and Optimization in Heterogeneous Systems Library Layer Accelerator X CPU enablement GPU enablement FPGA interface & configuration Accelerator X Enablement Cognitive Middleware Application ApplicationApplication
  • 27. M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing 21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016 27 Accelerate Processing in a Connected World Enable Compute-Intensive Cognitive Workloads Exploit Best-of-Breed Accelerators Provide Abstraction of Hardware Function