SlideShare a Scribd company logo
1 of 37
Download to read offline
PMaC
Performance Modeling and Characterization
1/35
AsHES 2012
Modeling and Predicting
Application Performance
on Hardware Accelerators
Authors: Mitesh Meswani*, Laura Carrington*,
Didem Unat, Allan Snavely, Scott Baden, and
Steve Poole
*San Diego Supercomputer Center,
*Performance Modeling and Characterization Lab
(PMaC)
PMaC
Performance Modeling and Characterization
2/35
AsHES 2012
PMaC Lab
•  Goal: Understand factors that affect runtime and now recently
energy performance of HPC apps on current and future HPC
systems
•  PMaC framework provides fast and accurate predictions
–  Input: software characteristics, input data, hardware
parameters
–  Output: Prediction model that predicts expected
performance
•  Tools : PEBIL, PMaCInst, PSiNSTracer Etracer, IOTracer,
ShmemTracer, PIR
•  Simulation: PSiNS, PSaPP
PMaC
Performance Modeling and Characterization
3/35
AsHES 2012
Prediction framework
PMaC
Performance Modeling and Characterization
4/35
AsHES 2012
Outline
•  Introduction
•  Methodology- developing models for FPGAs
and GPUs
•  Results- workload predictions on
accelerators
•  References
PMaC
Performance Modeling and Characterization
5/35
AsHES 2012
Why Accelerators?
•  Traditional processing
–  Solves the common case
–  Limited performance for specialized functions
•  Solution : Use special purpose co-
processors or Hardware Accelerators
–  Examples: FPGA, GPU
PMaC
Performance Modeling and Characterization
6/35
AsHES 2012
Application porting is time consuming
•  HPC apps can case exceed 100,000 lines of
code
•  Choice of accelerator is not apparent
•  Prudent to evaluate benefit prior to porting
•  Solution: performance predictions models
–  Allow fast evaluations without porting or running
–  Accuracy has to be high to be valuable
PMaC
Performance Modeling and Characterization
7/35
AsHES 2012
Methodology
•  First identify code sections that may benefit
by accelerators
•  HPC applications can be expressed by a
small set of commonly occurring compute
and data-access patterns also called as
idioms, example transpose, reduction.
•  Predict performance of idiom instances on
accelerators.
•  Port only instances that are predicted to run
faster
PMaC
Performance Modeling and Characterization
8/35
AsHES 2012
Our Study
•  Accelerators: Convey HC-1 FPGA system
and NVIDIA FERMI GPU (TESLA 2070)
•  Characterize accelerators for 8 common
HPC idioms
•  Develop and validate idiom models on two
real-world benchmarks.
•  Present a case study of a hypothetical
Supercomputer with FPGAs, GPUs for two
popular HPC apps predict speedups up to
20%
PMaC
Performance Modeling and Characterization
9/35
AsHES 2012
What are Idioms
•  Idiom is a pattern of computation and
memory access.
•  Example: Stream copy
for (int i=0;i<n;i++)
A[i] = B[i] ;
PMaC
Performance Modeling and Characterization
10/35
AsHES 2012
Idioms Used
•  Stream : A[i] = B[i] + C[i]
•  Gather: A[i] = B[C[i]]
•  Scatter: A[C[i]] = B[i]
•  Transpose: A[i][j] = B[j][i]
•  Reduction: s = s + A[i]
•  Stencil: A[i] = A[i-1] + A[i+1]
•  Matrix Vector Multiply: C[i] = A[i][j]*B[i]
•  Matrix Matrix Multiply: C[i][j] = A[i][j]*B[k][j]
PMaC
Performance Modeling and Characterization
11/35
AsHES 2012
Hardware Accelerator #1 – Convey HC-1 FPGA
Commodity Intel Server Convey FPGA-based Co-processor
PMaC
Performance Modeling and Characterization
12/35
AsHES 2012
Hardware Accelerator #2 – NVIDIA TESLA 2070C GPU
x86 Host
Host
Memory
SM0
(32 Cores,
64KB L1
cache,
Shared
memory)
SM1
(32 Cores,
64KB L1
cache,
Shared
memory)
SM15
(32 Cores,
64KB L1
cache,
Shared
memory)
L2 Cache
Device Memory
…
PMaC
Performance Modeling and Characterization
13/35
AsHES 2012
Accelerator Characterizations
•  Simple benchmarks to profile capabilities of
GPU, FPGA, and CPU to perform idiom
operations
•  Each benchmark ranges over different
memory sizes
PMaC
Performance Modeling and Characterization
14/35
AsHES 2012
Stream, Stencil
0
10
20
30
40
50
60
4.E+03
8.E+03
2.E+04
3.E+04
7.E+04
1.E+05
3.E+05
5.E+05
1.E+06
2.E+06
4.E+06
8.E+06
2.E+07
3.E+07
7.E+07
MemoryBandwidth(GB/s)
Data Size (Bytes)
Stream: A[i]=B[i]
BW_CPU
BW_FPGA
BW_GPU
0
5
10
15
20
25
30
35
40
45
4.E+3
8.E+3
2.E+4
3.E+4
7.E+4
1.E+5
3.E+5
5.E+5
1.E+6
2.E+6
4.E+6
8.E+6
2.E+7
3.E+7
7.E+7
MemoryBandwidth(GB/s)
Data Size (Bytes)
Stencil: A[i]=B[i-1]+B[i+1]
BW_CPU
BW_FPGA
BW_GPU
PMaC
Performance Modeling and Characterization
15/35
AsHES 2012
Transpose, Reduction
0
20
40
60
80
100
120
2.E+04
7.E+04
3.E+05
1.E+06
4.E+06
2.E+07
7.E+07
3.E+08
1.E+09
4.E+09
MemoryBandwidth(GB/s)
Data Size (Bytes)
Transpose: A[i,j]=A[j,i]
BW_CPU
BW_FPGA
BW_GPU
0
10
20
30
40
50
60
70
MemoryBandwidth(GB/s)
Data Size (Bytes)
Reduction: sum+=A[i]
BW_CPU
BW_FPGA
BW_GPU
PMaC
Performance Modeling and Characterization
16/35
AsHES 2012
Gather, Scatter
0
10
20
30
40
50
60
2.E+4
8.E+5
2.E+6
3.E+6
1.E+7
3.E+7
5.E+7
1.E+8
MemoryBandwidth(GB/s)
Data Size (Bytes)
Gather: A[i] = B[C[j]]
BW_CPU
BW_FPGA
BW_GPU
0
10
20
30
40
50
60
2.E+4
8.E+5
2.E+6
3.E+6
6.E+6
1.E+7
3.E+7
5.E+7
1.E+8
2.E+8
MemoryBandwidth(GB/s)
Data Size (Bytes)
Scatter: A[B[i]] = C[j]
BW_CPU
BW_FPGA
BW_GPU
PMaC
Performance Modeling and Characterization
17/35
AsHES 2012
Cost of Data Migration
0
0.5
1
1.5
2
2.5
2.E+03
4.E+03
8.E+03
2.E+04
3.E+04
7.E+04
1.E+05
3.E+05
5.E+05
1.E+06
2.E+06
4.E+06
8.E+06
2.E+07
3.E+07
7.E+07
1.E+08
3.E+08
5.E+08
1.E+09
MemoryBandwidth(GB/s)
Data Size (Bytes)
Data Migration
BW_FPGA
BW_GPU
Combining idiom plots and data migration costs illustrates the complexity of
determining the best achievable performance from the GPU/FPGA for a given
data size and it is interesting to note this space is complex – there is no clear
winner among CPU, FPGA, GPU it depends on the idiom and the dataset size.
PMaC
Performance Modeling and Characterization
18/35
AsHES 2012
Application Characterizations – finding
idioms
•  PMaC Idiom Recognizer (PIR):
–  GCC plugin recognizes idioms during compilation
using IR tree analysis
–  Users can specify different idioms using PIR’s
idiom expression syntax
File Line# Function Idiom Code
foo.c 623 Func1 gather a[i] = b[d[j]]
tmp.c 992 Func2 stream x[j]= c[i]
PMaC
Performance Modeling and Characterization
19/35
AsHES 2012
Application Characterizations – finding data
size per idiom
•  PEBIL – binary instrumentation tool
–  To find data size for an idiom:
•  Determine basic blocks belonging for the idiom
•  Instrument those basic blocks to capture data range
–  Run the instrumented binary and generate traces
PMaC
Performance Modeling and Characterization
20/35
AsHES 2012
Prediction Models
PMaC
Performance Modeling and Characterization
21/35
AsHES 2012
Model Validation – Fine-grained
•  Hmmer: Protein sequence code, run with 8-tasks on
GPU, FPGA systems
•  Flash: astrophysics code, sequential version run on
FPGA system.
Application Idiom Measured Predicted % Error1
Hmmer Stream (FPGA) 384.7 337.0 12.3%
Hmmer Stream (GPU) 18.4 18.5 0.3%
Hmmer Gather/Scatter
(GPU)
0.074 0.087 17.3%
Flash Gather /Scatter
(FPGA)
69 68 1.4%
PMaC
Performance Modeling and Characterization
22/35
AsHES 2012
Model Validation – Graph500
•  FPGA validated:
–  We ran scale 24 problem, 13 MTEPS
–  PIR analysis identifies scatter and stream idiom in
make_bfs
–  make_bfs ported by convey to FPGA, rest on CPU
–  We use CPU and FPGA models to predict
speedups
G500
(CPU)
actual
G500
(Ported)
actual
Bfs
speedup
actual
G500
(CPU)
predicted
G500
(Ported)
predicted
Bfs
speedup
predicted
5980 4686
(21.64%)
98X 5847 4757
(18.65%)
96x
PMaC
Performance Modeling and Characterization
23/35
AsHES 2012
Projection Study
•  Study production HPC system – Jaguar a
Cray XT5
–  224,256 AMD cores, 300 TB memory
•  Applications:
–  Hycom – 8 and 256 cpu runs
–  Milc – 8 and 256 cpu runs
•  Q: What would be projected speedup for an
appliction running on machine like Jaguar
but with FPGA and GPU on each node
PMaC
Performance Modeling and Characterization
24/35
AsHES 2012
Results – CPU predictions
Application Measured Predicted % Error1
Milc (8cpu) 278 277 0.4%
Milc (256cpu) 1,345 1,350 0.4%
HYCOM (8cpu) 262 246 6.1%
HYCOM (256cpu) 809 663 18.1%
PMaC
Performance Modeling and Characterization
25/35
AsHES 2012
Idiom instances and runtime %
HYCOM
(8cpu)
HYCOM
(256cpu)
Milc
(8cpu)
Milc
(256cpu)
Gather/scatter 14.2% 4.6% 1.2% 0.7%
stream 21.1% 16.9% 5.6% 3.0%
Idiom HYCOM Milc
G a t h e r /
scatter
1,797 156
stream 1,300 105
Idiom instances in source code
Contribution of idioms to runtime
PMaC
Performance Modeling and Characterization
26/35
AsHES 2012
Run times of all idiom instances on one device
H Y C O M
256cpu
CPU FPGA GPU
Gather/Scatter 7,768 495 638
Stream 28,459 2,302 44,166
Total 36,556 2,798 44,803
MILC
256cpu
CPU FPGA GPU
Gather/Scatter 2,376 334 399
Stream 10,452 771 1,087
Total 12,827 1,104 1,487
PMaC
Performance Modeling and Characterization
27/35
AsHES 2012
Optimal mapping of device
HYCOM
256cpu
CPU
vs. FPGA
CPU
vs. GPU
Optimal of
CPU, GPU,
FPGA
CPU FPGA GPU
G a t h e r /
Scatter
495 638 448 7,768 495 638
Stream 2,297 6,096 2,149 28,459 2,302 44,166
Total 2,792 6,734 2,596 36,556 2,798 44,803
MILC
256cpu
CPU
vs.
FPGA
CPU
vs. GPU
Optimal
of CPU,
GPU,
FPGA
CPU FPGA GPU
G a t h e r /
Scatter
334 399 334 2,376 334 399
Stream 770 1,087 765 10,452 771 1,087
Total 1,104 1,486 1,099 12,827 1,104 1,487
PMaC
Performance Modeling and Characterization
28/35
AsHES 2012
Summary of Results of porting
Overall improvement of porting idioms results
in improvements of 3.4% for Milc and 20% for
HYCOM.
PMaC
Performance Modeling and Characterization
29/35
AsHES 2012
Conclusions and Future Work
•  Device choice (CPU,GPU,FPGA) depends on
idiom and data size
•  Continue extension to other idioms and
model data transfer costs
•  Adapt the approach for modeling energy
consumption
PMaC
Performance Modeling and Characterization
30/35
AsHES 2012
References
•  M. R. Meswani, L. Carrington, et al., “Modeling and Predicting
Performance on Hardware Accelerators,” To appear as a Poster
at the International Symposium on Workload Characterization
(IISWC), 2011.
•  C. Olschanowsky, M. R. Meswani, et al., “PIR: PMaC's Idiom
Recognizer,” in Proceedings of Parallel Software Tools and
Tool Infrastructures (PSTI 2010) held in conjunction with ICPP
2010, San Diego, CA, Sep 2010.
•  L. Carrington, M. Tikir, et al., “An Idiom-finding Tool for
Increasing Productivity of Accelerators,” in Proceedings of Int’l
Conf of Supercomputing (ICS), 2011
PMaC
Performance Modeling and Characterization
31/35
AsHES 2012
Questions ?
•  Thank you for your attention !
•  PMaC URL: www.sdsc.edu/pmac
•  Email: mitesh.meswani@gmail.com
PMaC
Performance Modeling and Characterization
32/35
AsHES 2012
Extra Slides
PMaC
Performance Modeling and Characterization
33/35
AsHES 2012
Model Validation – Graph500
•  GPU validation underway:
–  Our CPU predictions are within 13%
–  CUDA version implemented and measured GPU
version speeds up make_bfs 3X-5X
–  Optimization continues and is ongoing work
PMaC
Performance Modeling and Characterization
34/35
AsHES 2012
Model Validation – Entire Application Run
•  Graph500: A data-intensive benchmark
–  Two main kernels: make_bfs and verify_bfs
–  Briefly the algorithm proceeds in two steps:
–  First generates the graph based on the scale
factor.
–  Second step it samples 64 random key and for
each key:
•  Executes make_bfs to generate parent array, and then
•  Executes validate routine on the parent array.
–  Performance metric called TEPS (travelled edges
per second) is calculated using time for make_bfs
PMaC
Performance Modeling and Characterization
35/35
AsHES 2012
Idiom Code Coverage
HYCOM Milc
Gather/scatter 1797 156
reduction 110 22
stream 1300 105
stencil 132 0
transpose 3986 286
Mat-Mat Mult 2161 6
Mat-Vec Mult 115 2
Fraction of
Loops Covered
67.45% 37.44%
PMaC
Performance Modeling and Characterization
36/35
AsHES 2012
Idiom runtime coverage
HYCOM
(8cpu)
HYCOM
(256cpu)
Milc
(8cpu)
Milc
(256cpu)
Gather/scatter 14.2% 4.6% 1.2% 0.7%
reduction 0.0% 0.1% 15.7% 13.9%
stream 21.1% 16.9% 5.6% 3.0%
stencil 4.7% 11.1% 0.0% 0.0%
transpose 0.9% 2.0% 0.0% 0.0%
Mat-Mat Mult 23.7% 8.6% 61.2% 58.6%
Mat-Vec Mult 0.0% 0.1% 10.5% 16.7%
All Idioms 64.6% 43.4% 94.2% 93.2%
PMaC
Performance Modeling and Characterization
37/35
AsHES 2012
Memory Time

More Related Content

Viewers also liked

Dhoni A Leader
Dhoni A LeaderDhoni A Leader
Dhoni A Leader
zaid22
 
Jallikattu - truths and mis-conceptions
Jallikattu - truths and mis-conceptionsJallikattu - truths and mis-conceptions
Jallikattu - truths and mis-conceptions
psenthilraja
 

Viewers also liked (7)

Tokoh kebanggaanku
Tokoh kebanggaankuTokoh kebanggaanku
Tokoh kebanggaanku
 
Tamilnadu
TamilnaduTamilnadu
Tamilnadu
 
ஜல்லிக்கட்டு தேவையா-Jallikattu Facts
ஜல்லிக்கட்டு தேவையா-Jallikattu Factsஜல்லிக்கட்டு தேவையா-Jallikattu Facts
ஜல்லிக்கட்டு தேவையா-Jallikattu Facts
 
DHONI - A LEADER
DHONI - A LEADER DHONI - A LEADER
DHONI - A LEADER
 
Dhoni A Leader
Dhoni A LeaderDhoni A Leader
Dhoni A Leader
 
Jallikattu Bullfight Of South India Old.
Jallikattu Bullfight Of South India Old.Jallikattu Bullfight Of South India Old.
Jallikattu Bullfight Of South India Old.
 
Jallikattu - truths and mis-conceptions
Jallikattu - truths and mis-conceptionsJallikattu - truths and mis-conceptions
Jallikattu - truths and mis-conceptions
 

Similar to AsHES-talk_Final_handouts

Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
DataWorks Summit
 
AMIT PATIL- Embedded OS Professional
AMIT PATIL- Embedded OS ProfessionalAMIT PATIL- Embedded OS Professional
AMIT PATIL- Embedded OS Professional
Amit Patil
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
 
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Databricks
 

Similar to AsHES-talk_Final_handouts (20)

Towards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component ModelTowards a metamodel for the Rubus Component Model
Towards a metamodel for the Rubus Component Model
 
Introduction to architecture exploration
Introduction to architecture explorationIntroduction to architecture exploration
Introduction to architecture exploration
 
Villar presentation.pdf
Villar presentation.pdfVillar presentation.pdf
Villar presentation.pdf
 
CAOS: A CAD Framework for FPGA-Based Systems
CAOS: A CAD Framework for FPGA-Based SystemsCAOS: A CAD Framework for FPGA-Based Systems
CAOS: A CAD Framework for FPGA-Based Systems
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
Functional verification techniques EW16 session
Functional verification techniques  EW16 sessionFunctional verification techniques  EW16 session
Functional verification techniques EW16 session
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
 
Dst
DstDst
Dst
 
Parallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based ModelingParallel Application Performance Prediction of Using Analysis Based Modeling
Parallel Application Performance Prediction of Using Analysis Based Modeling
 
Mirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP LibraryMirabilis_Design AMD Versal System-Level IP Library
Mirabilis_Design AMD Versal System-Level IP Library
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
AMIT PATIL- Embedded OS Professional
AMIT PATIL- Embedded OS ProfessionalAMIT PATIL- Embedded OS Professional
AMIT PATIL- Embedded OS Professional
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
 
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 

AsHES-talk_Final_handouts

  • 1. PMaC Performance Modeling and Characterization 1/35 AsHES 2012 Modeling and Predicting Application Performance on Hardware Accelerators Authors: Mitesh Meswani*, Laura Carrington*, Didem Unat, Allan Snavely, Scott Baden, and Steve Poole *San Diego Supercomputer Center, *Performance Modeling and Characterization Lab (PMaC)
  • 2. PMaC Performance Modeling and Characterization 2/35 AsHES 2012 PMaC Lab •  Goal: Understand factors that affect runtime and now recently energy performance of HPC apps on current and future HPC systems •  PMaC framework provides fast and accurate predictions –  Input: software characteristics, input data, hardware parameters –  Output: Prediction model that predicts expected performance •  Tools : PEBIL, PMaCInst, PSiNSTracer Etracer, IOTracer, ShmemTracer, PIR •  Simulation: PSiNS, PSaPP
  • 3. PMaC Performance Modeling and Characterization 3/35 AsHES 2012 Prediction framework
  • 4. PMaC Performance Modeling and Characterization 4/35 AsHES 2012 Outline •  Introduction •  Methodology- developing models for FPGAs and GPUs •  Results- workload predictions on accelerators •  References
  • 5. PMaC Performance Modeling and Characterization 5/35 AsHES 2012 Why Accelerators? •  Traditional processing –  Solves the common case –  Limited performance for specialized functions •  Solution : Use special purpose co- processors or Hardware Accelerators –  Examples: FPGA, GPU
  • 6. PMaC Performance Modeling and Characterization 6/35 AsHES 2012 Application porting is time consuming •  HPC apps can case exceed 100,000 lines of code •  Choice of accelerator is not apparent •  Prudent to evaluate benefit prior to porting •  Solution: performance predictions models –  Allow fast evaluations without porting or running –  Accuracy has to be high to be valuable
  • 7. PMaC Performance Modeling and Characterization 7/35 AsHES 2012 Methodology •  First identify code sections that may benefit by accelerators •  HPC applications can be expressed by a small set of commonly occurring compute and data-access patterns also called as idioms, example transpose, reduction. •  Predict performance of idiom instances on accelerators. •  Port only instances that are predicted to run faster
  • 8. PMaC Performance Modeling and Characterization 8/35 AsHES 2012 Our Study •  Accelerators: Convey HC-1 FPGA system and NVIDIA FERMI GPU (TESLA 2070) •  Characterize accelerators for 8 common HPC idioms •  Develop and validate idiom models on two real-world benchmarks. •  Present a case study of a hypothetical Supercomputer with FPGAs, GPUs for two popular HPC apps predict speedups up to 20%
  • 9. PMaC Performance Modeling and Characterization 9/35 AsHES 2012 What are Idioms •  Idiom is a pattern of computation and memory access. •  Example: Stream copy for (int i=0;i<n;i++) A[i] = B[i] ;
  • 10. PMaC Performance Modeling and Characterization 10/35 AsHES 2012 Idioms Used •  Stream : A[i] = B[i] + C[i] •  Gather: A[i] = B[C[i]] •  Scatter: A[C[i]] = B[i] •  Transpose: A[i][j] = B[j][i] •  Reduction: s = s + A[i] •  Stencil: A[i] = A[i-1] + A[i+1] •  Matrix Vector Multiply: C[i] = A[i][j]*B[i] •  Matrix Matrix Multiply: C[i][j] = A[i][j]*B[k][j]
  • 11. PMaC Performance Modeling and Characterization 11/35 AsHES 2012 Hardware Accelerator #1 – Convey HC-1 FPGA Commodity Intel Server Convey FPGA-based Co-processor
  • 12. PMaC Performance Modeling and Characterization 12/35 AsHES 2012 Hardware Accelerator #2 – NVIDIA TESLA 2070C GPU x86 Host Host Memory SM0 (32 Cores, 64KB L1 cache, Shared memory) SM1 (32 Cores, 64KB L1 cache, Shared memory) SM15 (32 Cores, 64KB L1 cache, Shared memory) L2 Cache Device Memory …
  • 13. PMaC Performance Modeling and Characterization 13/35 AsHES 2012 Accelerator Characterizations •  Simple benchmarks to profile capabilities of GPU, FPGA, and CPU to perform idiom operations •  Each benchmark ranges over different memory sizes
  • 14. PMaC Performance Modeling and Characterization 14/35 AsHES 2012 Stream, Stencil 0 10 20 30 40 50 60 4.E+03 8.E+03 2.E+04 3.E+04 7.E+04 1.E+05 3.E+05 5.E+05 1.E+06 2.E+06 4.E+06 8.E+06 2.E+07 3.E+07 7.E+07 MemoryBandwidth(GB/s) Data Size (Bytes) Stream: A[i]=B[i] BW_CPU BW_FPGA BW_GPU 0 5 10 15 20 25 30 35 40 45 4.E+3 8.E+3 2.E+4 3.E+4 7.E+4 1.E+5 3.E+5 5.E+5 1.E+6 2.E+6 4.E+6 8.E+6 2.E+7 3.E+7 7.E+7 MemoryBandwidth(GB/s) Data Size (Bytes) Stencil: A[i]=B[i-1]+B[i+1] BW_CPU BW_FPGA BW_GPU
  • 15. PMaC Performance Modeling and Characterization 15/35 AsHES 2012 Transpose, Reduction 0 20 40 60 80 100 120 2.E+04 7.E+04 3.E+05 1.E+06 4.E+06 2.E+07 7.E+07 3.E+08 1.E+09 4.E+09 MemoryBandwidth(GB/s) Data Size (Bytes) Transpose: A[i,j]=A[j,i] BW_CPU BW_FPGA BW_GPU 0 10 20 30 40 50 60 70 MemoryBandwidth(GB/s) Data Size (Bytes) Reduction: sum+=A[i] BW_CPU BW_FPGA BW_GPU
  • 16. PMaC Performance Modeling and Characterization 16/35 AsHES 2012 Gather, Scatter 0 10 20 30 40 50 60 2.E+4 8.E+5 2.E+6 3.E+6 1.E+7 3.E+7 5.E+7 1.E+8 MemoryBandwidth(GB/s) Data Size (Bytes) Gather: A[i] = B[C[j]] BW_CPU BW_FPGA BW_GPU 0 10 20 30 40 50 60 2.E+4 8.E+5 2.E+6 3.E+6 6.E+6 1.E+7 3.E+7 5.E+7 1.E+8 2.E+8 MemoryBandwidth(GB/s) Data Size (Bytes) Scatter: A[B[i]] = C[j] BW_CPU BW_FPGA BW_GPU
  • 17. PMaC Performance Modeling and Characterization 17/35 AsHES 2012 Cost of Data Migration 0 0.5 1 1.5 2 2.5 2.E+03 4.E+03 8.E+03 2.E+04 3.E+04 7.E+04 1.E+05 3.E+05 5.E+05 1.E+06 2.E+06 4.E+06 8.E+06 2.E+07 3.E+07 7.E+07 1.E+08 3.E+08 5.E+08 1.E+09 MemoryBandwidth(GB/s) Data Size (Bytes) Data Migration BW_FPGA BW_GPU Combining idiom plots and data migration costs illustrates the complexity of determining the best achievable performance from the GPU/FPGA for a given data size and it is interesting to note this space is complex – there is no clear winner among CPU, FPGA, GPU it depends on the idiom and the dataset size.
  • 18. PMaC Performance Modeling and Characterization 18/35 AsHES 2012 Application Characterizations – finding idioms •  PMaC Idiom Recognizer (PIR): –  GCC plugin recognizes idioms during compilation using IR tree analysis –  Users can specify different idioms using PIR’s idiom expression syntax File Line# Function Idiom Code foo.c 623 Func1 gather a[i] = b[d[j]] tmp.c 992 Func2 stream x[j]= c[i]
  • 19. PMaC Performance Modeling and Characterization 19/35 AsHES 2012 Application Characterizations – finding data size per idiom •  PEBIL – binary instrumentation tool –  To find data size for an idiom: •  Determine basic blocks belonging for the idiom •  Instrument those basic blocks to capture data range –  Run the instrumented binary and generate traces
  • 20. PMaC Performance Modeling and Characterization 20/35 AsHES 2012 Prediction Models
  • 21. PMaC Performance Modeling and Characterization 21/35 AsHES 2012 Model Validation – Fine-grained •  Hmmer: Protein sequence code, run with 8-tasks on GPU, FPGA systems •  Flash: astrophysics code, sequential version run on FPGA system. Application Idiom Measured Predicted % Error1 Hmmer Stream (FPGA) 384.7 337.0 12.3% Hmmer Stream (GPU) 18.4 18.5 0.3% Hmmer Gather/Scatter (GPU) 0.074 0.087 17.3% Flash Gather /Scatter (FPGA) 69 68 1.4%
  • 22. PMaC Performance Modeling and Characterization 22/35 AsHES 2012 Model Validation – Graph500 •  FPGA validated: –  We ran scale 24 problem, 13 MTEPS –  PIR analysis identifies scatter and stream idiom in make_bfs –  make_bfs ported by convey to FPGA, rest on CPU –  We use CPU and FPGA models to predict speedups G500 (CPU) actual G500 (Ported) actual Bfs speedup actual G500 (CPU) predicted G500 (Ported) predicted Bfs speedup predicted 5980 4686 (21.64%) 98X 5847 4757 (18.65%) 96x
  • 23. PMaC Performance Modeling and Characterization 23/35 AsHES 2012 Projection Study •  Study production HPC system – Jaguar a Cray XT5 –  224,256 AMD cores, 300 TB memory •  Applications: –  Hycom – 8 and 256 cpu runs –  Milc – 8 and 256 cpu runs •  Q: What would be projected speedup for an appliction running on machine like Jaguar but with FPGA and GPU on each node
  • 24. PMaC Performance Modeling and Characterization 24/35 AsHES 2012 Results – CPU predictions Application Measured Predicted % Error1 Milc (8cpu) 278 277 0.4% Milc (256cpu) 1,345 1,350 0.4% HYCOM (8cpu) 262 246 6.1% HYCOM (256cpu) 809 663 18.1%
  • 25. PMaC Performance Modeling and Characterization 25/35 AsHES 2012 Idiom instances and runtime % HYCOM (8cpu) HYCOM (256cpu) Milc (8cpu) Milc (256cpu) Gather/scatter 14.2% 4.6% 1.2% 0.7% stream 21.1% 16.9% 5.6% 3.0% Idiom HYCOM Milc G a t h e r / scatter 1,797 156 stream 1,300 105 Idiom instances in source code Contribution of idioms to runtime
  • 26. PMaC Performance Modeling and Characterization 26/35 AsHES 2012 Run times of all idiom instances on one device H Y C O M 256cpu CPU FPGA GPU Gather/Scatter 7,768 495 638 Stream 28,459 2,302 44,166 Total 36,556 2,798 44,803 MILC 256cpu CPU FPGA GPU Gather/Scatter 2,376 334 399 Stream 10,452 771 1,087 Total 12,827 1,104 1,487
  • 27. PMaC Performance Modeling and Characterization 27/35 AsHES 2012 Optimal mapping of device HYCOM 256cpu CPU vs. FPGA CPU vs. GPU Optimal of CPU, GPU, FPGA CPU FPGA GPU G a t h e r / Scatter 495 638 448 7,768 495 638 Stream 2,297 6,096 2,149 28,459 2,302 44,166 Total 2,792 6,734 2,596 36,556 2,798 44,803 MILC 256cpu CPU vs. FPGA CPU vs. GPU Optimal of CPU, GPU, FPGA CPU FPGA GPU G a t h e r / Scatter 334 399 334 2,376 334 399 Stream 770 1,087 765 10,452 771 1,087 Total 1,104 1,486 1,099 12,827 1,104 1,487
  • 28. PMaC Performance Modeling and Characterization 28/35 AsHES 2012 Summary of Results of porting Overall improvement of porting idioms results in improvements of 3.4% for Milc and 20% for HYCOM.
  • 29. PMaC Performance Modeling and Characterization 29/35 AsHES 2012 Conclusions and Future Work •  Device choice (CPU,GPU,FPGA) depends on idiom and data size •  Continue extension to other idioms and model data transfer costs •  Adapt the approach for modeling energy consumption
  • 30. PMaC Performance Modeling and Characterization 30/35 AsHES 2012 References •  M. R. Meswani, L. Carrington, et al., “Modeling and Predicting Performance on Hardware Accelerators,” To appear as a Poster at the International Symposium on Workload Characterization (IISWC), 2011. •  C. Olschanowsky, M. R. Meswani, et al., “PIR: PMaC's Idiom Recognizer,” in Proceedings of Parallel Software Tools and Tool Infrastructures (PSTI 2010) held in conjunction with ICPP 2010, San Diego, CA, Sep 2010. •  L. Carrington, M. Tikir, et al., “An Idiom-finding Tool for Increasing Productivity of Accelerators,” in Proceedings of Int’l Conf of Supercomputing (ICS), 2011
  • 31. PMaC Performance Modeling and Characterization 31/35 AsHES 2012 Questions ? •  Thank you for your attention ! •  PMaC URL: www.sdsc.edu/pmac •  Email: mitesh.meswani@gmail.com
  • 32. PMaC Performance Modeling and Characterization 32/35 AsHES 2012 Extra Slides
  • 33. PMaC Performance Modeling and Characterization 33/35 AsHES 2012 Model Validation – Graph500 •  GPU validation underway: –  Our CPU predictions are within 13% –  CUDA version implemented and measured GPU version speeds up make_bfs 3X-5X –  Optimization continues and is ongoing work
  • 34. PMaC Performance Modeling and Characterization 34/35 AsHES 2012 Model Validation – Entire Application Run •  Graph500: A data-intensive benchmark –  Two main kernels: make_bfs and verify_bfs –  Briefly the algorithm proceeds in two steps: –  First generates the graph based on the scale factor. –  Second step it samples 64 random key and for each key: •  Executes make_bfs to generate parent array, and then •  Executes validate routine on the parent array. –  Performance metric called TEPS (travelled edges per second) is calculated using time for make_bfs
  • 35. PMaC Performance Modeling and Characterization 35/35 AsHES 2012 Idiom Code Coverage HYCOM Milc Gather/scatter 1797 156 reduction 110 22 stream 1300 105 stencil 132 0 transpose 3986 286 Mat-Mat Mult 2161 6 Mat-Vec Mult 115 2 Fraction of Loops Covered 67.45% 37.44%
  • 36. PMaC Performance Modeling and Characterization 36/35 AsHES 2012 Idiom runtime coverage HYCOM (8cpu) HYCOM (256cpu) Milc (8cpu) Milc (256cpu) Gather/scatter 14.2% 4.6% 1.2% 0.7% reduction 0.0% 0.1% 15.7% 13.9% stream 21.1% 16.9% 5.6% 3.0% stencil 4.7% 11.1% 0.0% 0.0% transpose 0.9% 2.0% 0.0% 0.0% Mat-Mat Mult 23.7% 8.6% 61.2% 58.6% Mat-Vec Mult 0.0% 0.1% 10.5% 16.7% All Idioms 64.6% 43.4% 94.2% 93.2%
  • 37. PMaC Performance Modeling and Characterization 37/35 AsHES 2012 Memory Time