Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing

M. Gschwind. Software and System Co-Optimization in the Era of Heterogeneous Computing
21st Asia and South Pacific Design Automation Conference (ASP-DAC 2016), Macao, January 2016
1
© 2015 International Business Machines 1
Software and System Co-
Optimization in the
Era of Heterogeneous
Computing
Dr. Michael Gschwind
IBM TJ Watson Research Center
Yorktown Heights, NY

2
Recent Power History
Technology
POWER5
2004
POWER8
POWER6
2007
POWER7
2010
POWER7+
2012
Compute
Cores
Threads
Caching
On-chip
Off-chip
Bandwidth
Sust. Mem.
Peak I/O
130nm SOI 65nm SOI
45nm SOI
eDRAM
22nm SOI
eDRAM
2
SMT2
2
SMT2
8
SMT4
12
SMT8
1.9MB
36MB
8MB
32MB
2 + 32MB
None
6 + 96MB
128MB
15GB/s
6GB/s
30GB/s
20GB/s
100GB/s
40GB/s
230GB/s
64GB/s
32nm SOI
eDRAM
8
SMT4
2 + 80MB
None
100GB/s
40GB/s

3
POWER8 Chip Overview
▪ Up to 2.5x socket perf vs. P7+
▪ 649mm2 die size, 4.2B transistors
▪ 12 high-performance cores
▪ Large Caches
– L2: 512KB private SRAM per core
– L3: 96MB shared eDRAM w/ 8MB “fast access” partition per core
– L4: Up to 128MB, located on memory buffer chip
▪ 4 High Speed I/O interfaces
– Memory, On-Node SMP, Off-Node SMP, PCIe Gen3
Acc
On
Node
SMP
Fabric, Pervasive
PCI
Off
Node
SMP
MC
Mem0-3
Mem4-7
Off-Node SMPPCI PCI
On-Node SMP
MC
Core
L3 Quadrant
CoreCore
L2 L2L2
Core
L3 Quadrant
CoreCore
L2 L2L2
Core
L3 Quadrant
CoreCore
L2 L2L2
Core
L3 Quadrant
CoreCore
L2 L2L2

4
POWER8 Technology
▪ 22nm SOI
▪ 15 layer BEOL:
5-1x, 2-2x, 3-4x, 3-8x, 2-UTM
▪ 3-Vt thin-oxide logic transistors for power optimization
▪ Multiple thick-oxide transistors (for I/O and analog support)
▪ 3 app-optimized SRAM cells:
– 0.160µm² 6T perf-oriented
– 0.144µm² 6T perf-density balance for directories / L2
– 0.192µm² 8T multi-port
▪ Technology eDRAM cell: 0.026µm² 2-2x
3-4x
UTM
5-1x
3-8x
UTM

5
Large Block Structured Synthesis
▪ Enhanced process which included:
– Structured dataflow
– Congestion-aware stdcell placement
– Embedded “hard” IP (e.g. arrays, regfiles,
complex custom cells)
▪ 30% fewer unique blocks vs.
POWER7
▪ Improvements in block power and total
design area
– 15% area reduction IFU
VSU

6
POWER8 Core: Back bone of big data computing system
▪ Enhanced Micro-Architecture
▪ Increased Execution Bandwidth
▪ SMT 8
▪ Transactional Memory
▪ Vector/Scalar Unit
▪ High-performance Integer & FP Vector Processor
▪ Increased Performance for Data Rich Applications
VSU
FXU
IFU
DFU
ISU
PC
PC
LSU

7
Combined I/O Bandwidth = 7.6Tb/s
POWER8
Processor
Memory
Buffers
Memory
Buffers
PCI
DMI
PCI
POWER8
Processor
POWER8
Processor
DMI
DMI
DMI
DMI
DMI
DMI
DMI
NODE-to-NODE
ON-NODE SMP
Big Bandwidth
for
Big Data
Putting it all together with the memory links, on- and off-node SMP links
as well as PCIe, at 7.6Tb/s of chip I/O bandwidth

8
© 2015 International Business Machines 8
Big Data in a
Connected
World

9
Tectonic Shifts in Nature of Workloads
Graph
Analytics
Security, Fraud Detection
Genome Analysis
Social Network Analytics
Knowledge Graphs
Machine
Learning
Watson Health
Watson Analytics
Robotics
Education
Video,
Speech
Analytics
Multimodal Analytics
- Object recognition
- Complex video analytics
- Correlation and stitching
Automating
the
World
Learn
Predict
Ingest
Understanding
the
World

10
General-Purpose CPU Design
▪ Many competing requirements
– Branchy control-flow dominated code
– Code with unpredictable data access patterns
– Operating system code
– Multiple separate applications
– Multiple virtual machines at a time
▪ Result in low efficiency for any one metric
– Flops / area
– Integer ops / area
– Predictions / area
– …
VSU
FXU
IFU
DFU
ISU
PC
PC
LSU
Out of order
execution
Register
renaming
Branch
prediction
& prefetch
Robust
virtual
memory
support
dec
ode
I$
RF
int
D$
SIMD

11
Heterogeneous, Workload-optimized Acceleration
▪ On-chip integrated accelerators (SoC design)
– Compute accelerator (Cell BE)
– Compression (P7+)
– Encryption (P7+)
– Random number generation (P7+)
– …
▪ SoC design offers highest integration, but…
– Requires new chip design for accelerator
– Long time to market
– Requires very high volumes
Cell BE
POWER7+
decode
l
o
c
a
l
s
t
o
e
MMU
S
I
M
D
A

12
CAPI – Coherent Accelerator Processor Interface
▪ Open infrastructure for off-chip, memory-coherent accelerators
– Modular interface
– Third-party high value-add components
▪ Standardized, layered protocol
– architectural interface
– functional protocol
– PCIe signaling protocol
▪ Create workload-optimized innovative solutions
– Faster time to market
– Lower bar to entry
– Variety of implementation options
• FPGAs, ASICs
Coherence Bus
proxy
PSL
POWER8
* Power Service Layer

13
Heterogeneous System Challenges
▪The 4 ‘P’s of System Design
▪Programmer Productivity
▪Realize accelerator Performance benefits
▪Portability: Investment protection for applications
▪Partitioning for multi-user systems: processes, partitions

14
Application Acceleration
▪ Fine-grained data sharing
 coherent, shared memory
▪ Accelerator-initiated data accesses/transfers
 coherent, shared memory
▪ Pointer identity
 shared addressing
▪ Flexible synchronization
 symmetric, programmable interfaces

15
CAPI Acceleration overcomes Device Driver Deceleration
Typical I/O Model Flow:
Flow with Coherent Model:
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Return from DD
Completion
300 Instructions 10,000 Instructions 3,000 Instructions 1,000 InstructionsApplication
Dependent, but
Equal to Below
1,000 Instructions
Shared Mem
Notify Accelerator Acceleration
Shared Memory
Completion
Application
Dependent, but
Equal to Above
100 Instructions400 Instructions
0.3 µs 0.06 µs
Total ~0.36 µs
7.9 µs 4.9 µs
Total ~13 µs for data prep

16
Power GPU acceleration
▪ CUDA programming environment supported under LE Linux
– GPU as compute accelerator
– Offload regular compute-intensive application portions to GPU
▪ Advances in GPU Performance and Programmability
– UVA – Universal Virtual Addressing
– UM – Unified Memory
▪ Ongoing collaboration to co-optimize systems
– Next generation hardware enhancement

17
Relating content via concept graphs
@joe I wonder how do I use bitcoins
Apple’s digital wallet, if widely adopted,
could usher in a new era of ease and
convenience.
Icahn, who months ago called on eBay to spin off the lucrative online and
mobile payment service, continues to believe that the payments field must
be consolidated, either by PayPal buying up smaller rivals or by merging
with another major player.
Job ad: Lead Front-end Developer -
Virtual Currency Exchange
Conceptual
reasoning allows us
to relate content that
is hard to connect
otherwise
Watson Concept Insights

Constituency
parse
tree
Wikifier
(graph
linker)
Retrieve
concept vectors
from cache
(assumes static
graph!!!)
Merge concept
vectors to form
document vector
External
storageone CPU
socket
one CPU
socket
document
s
Reverse
conceptual
index
(Cassandra)
Compute related
concepts kernel
BASIC INGESTION
(only once per life of
document)
CONCEPTUAL INDEXING
Currently once per life of document, maybe 3-5 times
in future
USER INTERFACE
QUERY RUNTIME
(hopefully millions
of queries!!!)
CPU
Retrieve related
documents
Watson Concept Insights Workload Pipeline

19
Watson Concept Insights: Compute
Performance Comparison (CPU vs. GPU)
N-element
Vector
Page Rank
Calculations
5 Iterations
Pareto
Normalization
Scoring Combiner
M
Concepts*
Page Rank
Calculations
5 Iterations
Pareto
Normalization Scoring CombinerInit
Batched Execution with batch size of 64
(0.027 s)
(2.21 sec) (0.032 sec) (0.0048 sec)
(0.016 s)
Current CPU Execution
(55 sec) (3 sec) (1 sec)
Parallel Execution on GPU
CPU: 58 sec vs. GPU: 2.35 sec (25X)
HOST
HOST
M : Concepts under consideration (28 for the test)
N: Total number of concepts in Corpus (4.7M for Wikipedia)
*Ivy Bridge
*Nvidia K40
N*N Sparse
Matrix
Loaded only once

20

21
Over 2 million $136 billion
often do not reveal rare toxicity of
some drugs, and they are not
personalized
of in-hospital medication errors
caused by unforeseen drug-drug
interactions
Adverse Drug Reactions pose a serious challenge to
the healthcare system
serious adverse drug reactions
(ADRs) yearly: 100,000 deaths
ADR associated cost yearly
(> diabetic & cardiovascular care)
Clinical Trials 3–5%

Insight as a Service for Personalized and
Detailed Adverse Drug Reactions Prediction
Leverage large amount of data for personalized prediction of
nature, cause, and severity of adverse drug reactions
EMRs

23
Drug1 Drug2
Aspirin Probenecid
Aspirin Azilsartan
Learn PredictIngest
Personalized Medicine, Adverse Drug Reaction Prediction w/ ML
Drug1 Drug2 Sim
Salsalate Aspirin .7
Dicoumarol Warfarin .6
Drug1 Drug2
Aspirin Gliclazide
Aspirin Dicoumarol
Drug1 Drug2 Sim
Salsalate Aspirin .9
Dicoumaro
l
Warfarin .76
Known Interactions of type 1 to …
Drug1 Drug2 Best
Sim1*Sim1
Best
SimN*SimN
Salsalate Gliclazide .9*1 .7*1
Salsalate Warfarin .9*.76 .7*.6
Candidate Interactions of type i
Features
Chemical Similarity 1 to …
Drug1 Drug2 Prediction
Salsalate Gliclazide 0.85
Salsalate Warfarin 0.7
Interactions of type 1 Prediction
…
Drug1 Drug2 Prediction
Salsalate Gliclazide 0.53
Salsalate Warfarin 0.32
Interactions of type M Prediction
+ +
Machine Learning
Model
30X improvement in Learning performance
100s of TBs of data 50 million patients,
2000 drugs
2000 features

Personalized Medicine – Adverse Drug Reaction Workload
Personalization will result in massive increase in computation complexity
Real time prediction requirements for operational needs (< 1 minute for emergency situations)
• Computational pattern:
- Sparse cube to dense cube with patient as additional dimension
• Training:
- Number of patients above 50 Million
- Number of features around 1800
- Additional samples for training O(#patients)
- Number of cross-validation stages and #models per stage increases dramatically
- 100X increase in training complexity with ~100 TBs of Data
• Prediction:
- Input Model (#features) and dataset (# patients in the hospital)
- 1800 features and 500,000 patients
- Real time

25
Programming Heterogeneous Systems
OpenCL?
SystemC?
VHDL?
C++?
Java?
CUDA?

26
Portability and Optimization in
Heterogeneous Systems
Library Layer
Accelerator
X
CPU
enablement
GPU
enablement
FPGA interface
& configuration
Accelerator X
Enablement
Cognitive Middleware
Application
ApplicationApplication

27
Accelerate
Processing
in a Connected
World
Enable Compute-Intensive
Cognitive Workloads
Exploit Best-of-Breed
Accelerators
Provide Abstraction
of Hardware Function

Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing

Recommended

Recommended

More Related Content

Similar to Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing

Similar to Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing (20)

More from Michael Gschwind

More from Michael Gschwind (8)

Recently uploaded

Recently uploaded (20)

Gschwind - Software and System Co-Optimization in the Era of Heterogeneous Computing