SlideShare a Scribd company logo
1 of 27
Download to read offline
1
SOFA - Basic Usage
SOFAArchitecture
The average profiling overhead for
frameworks (e.g. TensorFlow, MXNet,
PyTorch) is less than 5%!
Distributed DL
Frameworks
TensorFlow
PyTorch
MXNet
Caffe2
CNTK
Performance
Counter
Monitoring
Function and
Event Tracing
SOFA Record
--------------------
● perf
● strace
● /proc/* filesystem
● vmstat
● blktrace
● strace
● tcpdump
● nvprof
● nvidia-smi
SOFA Preprocess
--------------------------
1. Consistent representation for
performance data
2. Timestamp synchronization
among heterogeneous traces.
3. Function/Event/Counter
filtering and grouping
SOFA Analyze
--------------------
● Summary-based analysis
● Spatial Trace Pattern Analysis
○ swarm generation
○ swarm captioning
○ swarm diff
● Temporal Trace Pattern Analysis
○ per-iteration timing breakdown
○ per-iteration performance statistics
traces
Spark
SOFA Visualization
---------------------------
Web-based dashboard by Highchart & D3.js
Coloring filtered functions/events/counters
2
● git clone https://github.com/cyliustack/sofa
● cd sofa
● ./tools/prepare.sh
● ./tools/empower-tcpdump.sh $(whoami)
○ Logout and then login to make changes effective; Then, cd sofa
● ./install.sh /opt/sofa
● source /opt/sofa/tools/activate.sh
● [optional] ./tools/enable_strace_perf_pcm.py
● sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100"
● sofa report
3
Quick Start: Download, Install, and Run
4
X-axis = Unix Time Timestamps (seconds); Y-axis = Metrics with different Units (log10-scale)
CPU CPU time (seconds) NET Payload of each packet (bytes)
VMSTAT_CS/VMSTAT_BI/V
MSTAT_BO
counts/seconds STRACE duration (seconds)
MPSTAT_USR Seconds per 10-ms GPU Kernel,
CUDA_COPY_H2D (Host-to-Device)
CUDA_COPY_D2H (Device-to-Host)
Duration (seconds)
Heterogenous Traces Visualization in SOFA
5
GPU H2D
memcpy
GPU D2H
memcpy
GPU DNN Backward
Propagation
GPU DNN Forward
Propagation
CPU Utilization
Network
Bandwidth
SOFA v.s. Deep Learning
6
Assignment 1 (Due Date: 04/02)
1. Finish all exercises in the previous pages. Write a report for these
exercises.
2. Install SOFA, and use SOFA to profile “dd” with different
configurations like block size, read/write counts, etc. Write a report
for this experiment.
[Regulation]
● Problem 1,2 are accepted only in pdf format.
7
8
Case Study: Storage
Commands:
sudo sysctl -w vm.drop_caches=3
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
9
Case Study: Storage (cont.)
Commands:
sudo sysctl -w vm.drop_caches=3
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
10
Case Study: Storage (cont.)
Commands:
sudo sysctl -w vm.drop_caches=3
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
sofa report
[1] https://blog.csdn.net/hustyangju/article/details/40512467
[2] http://sylab-srv.cs.fiu.edu/lib/exe/fetch.php?media=paperclub:lkd3ch16.pdf
11
ls -lah /dev/mapper/
...
lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-home -> ../dm-2
lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-root -> ../dm-0
lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-swap -> ../dm-1
COMMAND:
sofa record "dd if=/dev/zero of=dummy.out bs=1M count=1000"
sofa report
10 Hz diskstat monitoring,
unit: read/write sectors.
diskstats:
http://ykrocku.github.io/blog/2014/04/11/diskstats/
12
Case Study: Storage (cont.)
The spatial relation, i.e., virtual addresses of the functions, is another
useful information when one wants to abstract the runtime program
behaviors.
Hierarchical Clustering makes application semantics more noticeable:
1. Page faults related function calls in kernel module
2. In TensorFlow, image adjustment functions, gRPC core functions,
and many other functions are within the same TensorFlow module
(i.e., pywrap_tensorflow_internal.so)
Hierarchical Clustering
13
Case Study: Storage (cont.)
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 0 0 97 0 0
1 0 9 75 14 0
2 0 0 99 0 0
3 1 3 88 6 0
4 1 7 90 0 0
5 0 54 32 12 0
6 0 6 46 46 0
7 0 0 95 3 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 0.03 0.03 3.11 0.02 0.00
1 0.00 0.32 2.41 0.46 0.00
2 0.01 0.02 3.17 0.00 0.00
3 0.06 0.10 2.84 0.22 0.00
4 0.04 0.25 2.91 0.00 0.00
5 0.00 1.72 1.02 0.40 0.00
6 0.03 0.20 1.48 1.48 0.00
7 0.00 0.03 3.06 0.11 0.00
Active CPU Time (s): 5.510
Active CPU ratio (%): 22
Def, Active CPU ratio = total non-idle time / ( elapsed time * CPU cores)
14
Case Study: Storage (cont.)
Exercise 1
● Clean up files cached in memory
○ sudo sysctl -w vm.drop_caches=3
● Write “zero bytes” into a file placed in local SSD 500 times with block size of 10MB
○ sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
○ sofa report
○ Please notice the overhead changes since time = 1550681154.2 in page 37
Exercise 2
● Mount ramdisk onto /mnt/tmpfs
○ mkdir /mnt/tmpfs
○ mount -t tmpfs -o size=10g none /mnt/tmpfs
● Write “zero bytes” into a file placed in ramdisk 500 times with block size of 10MB
○ sofa record "dd if=/dev/zero of=/mnt/tmpfs/dummy.out bs=10M count=500"
○ sofa report
Exercise 3
● Please check which choice of the block size is optimal on your computer regarding I/O
throughput (i.e. bytes/s)? Why? Can you use SOFA or the other profiling tools to explain?
15
Case Study: Multi-threading of PI Calculation
Exercise 1
● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8
● $ make -C benchmark/thread/exercise-04/
● $./benchmark/thread/exercise-04/exercise 123457890 8
Exercise 2
● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8
● Change random functions from erand48_r() to rand()
● $ make -C benchmark/thread/exercise-04/
● $./benchmark/thread/exercise-04/exercise 123457890 8
Exercise 3
When replace erand48() by rand(), the execution time become increase or decrease? Why?
Please prove it by using proper performance profiling tools.
16
Case Study: Multi-threading of PI Calculation (cont.)
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 97 2 0 0 0
1 99 0 0 0 0
2 0 0 100 0 0
3 0 0 100 0 0
4 0 0 99 0 0
5 0 0 100 0 0
6 0 0 100 0 0
7 0 0 99 0 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 11.10 0.24 0.04 0.00 0.00
1 11.28 0.07 0.04 0.00 0.00
2 0.00 0.00 11.40 0.00 0.00
3 0.00 0.00 11.40 0.00 0.00
4 0.00 0.03 11.38 0.00 0.00
5 0.00 0.00 11.40 0.00 0.00
6 0.00 0.00 11.40 0.00 0.00
7 0.00 0.01 11.39 0.00 0.00
Active CPU Time (s): 22.726
Active CPU ratio (%): 25
Final Performance Features
name value
0 elapsed_time 11.344319
1 active_cpu_ratio 25.000000
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 95 2 1 0 0
1 97 0 2 0 0
2 97 0 2 0 0
3 96 1 2 0 0
4 97 0 1 0 0
5 94 3 2 0 0
6 97 0 1 0 0
7 96 0 3 0 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 4.79 0.13 0.08 0.00 0.00
1 4.85 0.02 0.13 0.00 0.00
2 4.89 0.00 0.11 0.00 0.00
3 4.83 0.06 0.11 0.00 0.00
4 4.88 0.04 0.08 0.00 0.00
5 4.71 0.17 0.13 0.00 0.00
6 4.89 0.01 0.09 0.00 0.00
7 4.81 0.02 0.17 0.00 0.00
Active CPU Time (s): 39.090
Active CPU ratio (%): 99
Final Performance Features
name value
0 elapsed_time 4.919824
1 active_cpu_ratio 99.000000
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 97 1 0 0 0
1 97 0 1 0 0
2 98 0 1 0 0
3 94 2 3 0 0
4 0 0 100 0 0
5 0 0 100 0 0
6 0 0 100 0 0
7 0 0 99 0 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 5.73 0.11 0.05 0.00 0.00
1 5.76 0.02 0.11 0.00 0.00
2 5.80 0.04 0.06 0.00 0.00
3 5.54 0.15 0.20 0.00 0.00
4 0.00 0.00 5.90 0.00 0.00
5 0.00 0.00 5.90 0.00 0.00
6 0.00 0.00 5.90 0.00 0.00
7 0.00 0.01 5.89 0.00 0.00
Active CPU Time (s): 23.159
Active CPU ratio (%): 49
Final Performance Features
name value
0 elapsed_time 5.84204
1 active_cpu_ratio 49.00000
17
Case Study: Multi-threading of PI Calculation (cont.)
17
# of threads: 2
Active CPU Time (s): 22.726
Active CPU ratio (%): 25
Final Performance Features
name value
0 elapsed_time 11.344319
1 active_cpu_ratio 25.000000
# of threads: 8
Active CPU Time (s): 39.090
Active CPU ratio (%): 99
Final Performance Features
name value
0 elapsed_time 4.919824
1 active_cpu_ratio 99.000000
# of threads: 4
Active CPU Time (s): 23.159
Active CPU ratio (%): 49
Final Performance Features
name value
0 elapsed_time 5.84204
1 active_cpu_ratio 49.00000
Experiences:
● Active CPU time should be the same even if increasing # of threads
● Hyper-threading make CPU resources overprovisioned -> non-linear scale up
Sequential execution time
t t t
t
18
What is the reason that cause network traces (i.e. tcpdump traces)?
Command:
sofa record ~/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest/bandwidthTest
Case Study: CUDA Memory Copy
19
SOFA - Advanced Usage
20
SOFAAdvanced Usage
usage: sofa [-h] [--logdir /path/to/logdir/]
[--gpu_filters "keyword1:color1,keyword2:color2"]
[--cpu_filters "keyword1:color1,keyword2:color2"] [--cpu_top_k N]
[--num_iterations N] [--num_swarms N] [--cpu_time_offset_ms N]
[--plot_ratio N] [--viz_port N] [--profile_all_cpus] [--verbose]
[--enable_aisi] [--display_swarms] [--base_logdir BASE_LOGDIR]
[--match_logdir MATCH_LOGDIR] [--hsg_multifeatures]
[--enable_vmstat] [--skip_preprocess]
[--network_filters "ip1,ip2,ip3"] [--enable_pcm]
[--cuda_api_tracing]
[--perf_events "cycles,instructions,cache-misses"]
[--potato_server "ip:port"]
<stat|record|report|preprocess|analyze|diff|viz|clean>
[<PROFILED_COMMAND>]
21
SOFAAdvanced Usage (cont.)
More performance metrics:
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
--perf_events="cycles,instructions,cache-misses,branch-misses"
More performance metrics:
sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --cuda_api_tracing
More performance metrics:
sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --enable_strace
More performance metrics:
sofa record "sleep 5" --enable_tcpdump
Background recording for daemon or multiple-command bash file
sofa record "sleep 20" --profile_all_cpus
Then, execute the target command
22
SOFAAdvanced Usage (cont.)
Verbose mode to show more information, like the progress of generating report or displaying detailed
reports (e.g., total system call time)
sofa report --verbose
Automatically identification iterative swarm and then expose per-iteration performance summary
sofa report --enable_aisi --num_iterations 20
Display top-10 hotspot swarms which are highlighted with different colors
sofa report --verbose --display_swarms
Reduce the number of points shown on visualization interfaces
sofa report --plot_ratio 10
Absolute or Relative (default) Timestamp
sofa report
sofa report --absoluate_timestamp
23
SOFAAdvanced Usage (cont.)
Apply filters to highlight interested traces
sofa report --cpu_filters=’tensorflow:orange’ --gpu_filters=’fw:blue’ --gpu_filters=’bw:red’
--gpu_filters=nccl:purple’
Compare two-run traces swarm-by-swarm to find the affected swarms due to
hardware/software/system changes:
sofa record "dd if=/dev/zero of=dummy.out bs=100M count=10" --logdir log1
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100" --logdir log2
sofa diff --base_logdir log1 --match_logdir log2
Have performance tuning suggestion from POTATO Server (Foxconn In-house Performance
Optimization & Auto-tuning Orchestration)
sofa report --potato_server "192.168.0.100:5000"
Automatic Iterative Swarm Identification (AISI)
Command:
sofa record "dd if=/dev/zero of=dummy.out bs=100M count=10"
sofa report --enable_aisi --num_iterations 10
AISI: Automatic Iterative Swarm Analysis
Automatic Iterative Swarm Identification (AISI) (cont.)
Command:
● cp -r /usr/local/cuda/samples ~
● Edit ~/samples/1_Utilities/bandwidthTest/bandwidthTest ,
make a three-times loop.
● make -C ~/samples/1_Utilities/bandwidthTest/
● sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest
● sofa report --enable_aisi --num_iterations 3
int main(int argc, char **argv)
{
…
int iRetVal=0;
for(int k=0; k<3; k++){
iRetVal = runTest(argc, (const char **)argv);
}
....
}
Automatic Iterative Swarm Identification (AISI) (cont.)
Command:
● cp -r /usr/local/cuda/samples ~
● Edit ~/samples/1_Utilities/bandwidthTest/bandwidthTest ,
make a three-times loop.
● make -C ~/samples/1_Utilities/bandwidthTest/
● sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest
● sofa report --enable_aisi --num_iterations 3 --aisi_via_strace
int main(int argc, char **argv)
{
…
int iRetVal=0;
for(int k=0; k<3; k++){
iRetVal = runTest(argc, (const char **)argv);
}
....
}
Absolute or Relative (default) Timestamp
Command:
● sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest
● sofa report OR sofa report --absoluate_timestamp

More Related Content

What's hot

LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysisChris McEniry
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningMilind Koyande
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsBrendan Gregg
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesIO Visor Project
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014Amazon Web Services
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Anne Nicolas
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsBrendan Gregg
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance ToolsBrendan Gregg
 
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...Amazon Web Services
 

What's hot (20)

LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Intro to linux performance analysis
Intro to linux performance analysisIntro to linux performance analysis
Intro to linux performance analysis
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 
QCon 2015 Broken Performance Tools
QCon 2015 Broken Performance ToolsQCon 2015 Broken Performance Tools
QCon 2015 Broken Performance Tools
 
Introduction to Perf
Introduction to PerfIntroduction to Perf
Introduction to Perf
 
bcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challengesbcc/BPF tools - Strategy, current tools, future challenges
bcc/BPF tools - Strategy, current tools, future challenges
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
Linux System Troubleshooting
Linux System TroubleshootingLinux System Troubleshooting
Linux System Troubleshooting
 
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
ZFSperftools2012
ZFSperftools2012ZFSperftools2012
ZFSperftools2012
 
FreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame GraphsFreeBSD 2014 Flame Graphs
FreeBSD 2014 Flame Graphs
 
Linux Performance Tools
Linux Performance ToolsLinux Performance Tools
Linux Performance Tools
 
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
Best Practices for Benchmarking and Performance Analysis in the Cloud (ENT305...
 

Similar to SOFA Tutorial

Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applicationsMai Nishimura
 
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringNETWAYS
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...confluent
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner FischerNETWAYS
 
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner FischerNETWAYS
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time OptimizationKan-Ru Chen
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs systèmeLudovic Piot
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingMarian Marinov
 
Thomas+Niewel+ +Oracletuning
Thomas+Niewel+ +OracletuningThomas+Niewel+ +Oracletuning
Thomas+Niewel+ +Oracletuningafa reg
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLCommand Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
 

Similar to SOFA Tutorial (20)

Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
GPU profiling for computer vision applications
GPU profiling for computer vision applicationsGPU profiling for computer vision applications
GPU profiling for computer vision applications
 
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoringOSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
OSDC 2017 - Werner Fischer - Linux performance profiling and monitoring
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
Kafka Summit SF 2017 - One Day, One Data Hub, 100 Billion Messages: Kafka at ...
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015: Linux Performance Profiling and Monitoring by Werner Fischer
 
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner FischerOSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
OSMC 2015 | Linux Performance Profiling and Monitoring by Werner Fischer
 
Android Boot Time Optimization
Android Boot Time OptimizationAndroid Boot Time Optimization
Android Boot Time Optimization
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
PerfUG 3 - perfs système
PerfUG 3 - perfs systèmePerfUG 3 - perfs système
PerfUG 3 - perfs système
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
SiteGround Tech TeamBuilding
SiteGround Tech TeamBuildingSiteGround Tech TeamBuilding
SiteGround Tech TeamBuilding
 
Thomas+Niewel+ +Oracletuning
Thomas+Niewel+ +OracletuningThomas+Niewel+ +Oracletuning
Thomas+Niewel+ +Oracletuning
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 

Recently uploaded

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Recently uploaded (20)

Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

SOFA Tutorial

  • 2. SOFAArchitecture The average profiling overhead for frameworks (e.g. TensorFlow, MXNet, PyTorch) is less than 5%! Distributed DL Frameworks TensorFlow PyTorch MXNet Caffe2 CNTK Performance Counter Monitoring Function and Event Tracing SOFA Record -------------------- ● perf ● strace ● /proc/* filesystem ● vmstat ● blktrace ● strace ● tcpdump ● nvprof ● nvidia-smi SOFA Preprocess -------------------------- 1. Consistent representation for performance data 2. Timestamp synchronization among heterogeneous traces. 3. Function/Event/Counter filtering and grouping SOFA Analyze -------------------- ● Summary-based analysis ● Spatial Trace Pattern Analysis ○ swarm generation ○ swarm captioning ○ swarm diff ● Temporal Trace Pattern Analysis ○ per-iteration timing breakdown ○ per-iteration performance statistics traces Spark SOFA Visualization --------------------------- Web-based dashboard by Highchart & D3.js Coloring filtered functions/events/counters 2
  • 3. ● git clone https://github.com/cyliustack/sofa ● cd sofa ● ./tools/prepare.sh ● ./tools/empower-tcpdump.sh $(whoami) ○ Logout and then login to make changes effective; Then, cd sofa ● ./install.sh /opt/sofa ● source /opt/sofa/tools/activate.sh ● [optional] ./tools/enable_strace_perf_pcm.py ● sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100" ● sofa report 3 Quick Start: Download, Install, and Run
  • 4. 4 X-axis = Unix Time Timestamps (seconds); Y-axis = Metrics with different Units (log10-scale) CPU CPU time (seconds) NET Payload of each packet (bytes) VMSTAT_CS/VMSTAT_BI/V MSTAT_BO counts/seconds STRACE duration (seconds) MPSTAT_USR Seconds per 10-ms GPU Kernel, CUDA_COPY_H2D (Host-to-Device) CUDA_COPY_D2H (Device-to-Host) Duration (seconds)
  • 5. Heterogenous Traces Visualization in SOFA 5 GPU H2D memcpy GPU D2H memcpy GPU DNN Backward Propagation GPU DNN Forward Propagation CPU Utilization Network Bandwidth
  • 6. SOFA v.s. Deep Learning 6
  • 7. Assignment 1 (Due Date: 04/02) 1. Finish all exercises in the previous pages. Write a report for these exercises. 2. Install SOFA, and use SOFA to profile “dd” with different configurations like block size, read/write counts, etc. Write a report for this experiment. [Regulation] ● Problem 1,2 are accepted only in pdf format. 7
  • 8. 8 Case Study: Storage Commands: sudo sysctl -w vm.drop_caches=3 sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
  • 9. 9 Case Study: Storage (cont.) Commands: sudo sysctl -w vm.drop_caches=3 sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
  • 10. 10 Case Study: Storage (cont.) Commands: sudo sysctl -w vm.drop_caches=3 sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500" sofa report [1] https://blog.csdn.net/hustyangju/article/details/40512467 [2] http://sylab-srv.cs.fiu.edu/lib/exe/fetch.php?media=paperclub:lkd3ch16.pdf
  • 11. 11 ls -lah /dev/mapper/ ... lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-home -> ../dm-2 lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-root -> ../dm-0 lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-swap -> ../dm-1 COMMAND: sofa record "dd if=/dev/zero of=dummy.out bs=1M count=1000" sofa report 10 Hz diskstat monitoring, unit: read/write sectors. diskstats: http://ykrocku.github.io/blog/2014/04/11/diskstats/
  • 12. 12 Case Study: Storage (cont.) The spatial relation, i.e., virtual addresses of the functions, is another useful information when one wants to abstract the runtime program behaviors. Hierarchical Clustering makes application semantics more noticeable: 1. Page faults related function calls in kernel module 2. In TensorFlow, image adjustment functions, gRPC core functions, and many other functions are within the same TensorFlow module (i.e., pywrap_tensorflow_internal.so) Hierarchical Clustering
  • 13. 13 Case Study: Storage (cont.) MPSTAT Profiling: CPU Utilization (%): core USR SYS IDL IOW IRQ 0 0 0 97 0 0 1 0 9 75 14 0 2 0 0 99 0 0 3 1 3 88 6 0 4 1 7 90 0 0 5 0 54 32 12 0 6 0 6 46 46 0 7 0 0 95 3 0 CPU Time (s): core USR SYS IDL IOW IRQ 0 0.03 0.03 3.11 0.02 0.00 1 0.00 0.32 2.41 0.46 0.00 2 0.01 0.02 3.17 0.00 0.00 3 0.06 0.10 2.84 0.22 0.00 4 0.04 0.25 2.91 0.00 0.00 5 0.00 1.72 1.02 0.40 0.00 6 0.03 0.20 1.48 1.48 0.00 7 0.00 0.03 3.06 0.11 0.00 Active CPU Time (s): 5.510 Active CPU ratio (%): 22 Def, Active CPU ratio = total non-idle time / ( elapsed time * CPU cores)
  • 14. 14 Case Study: Storage (cont.) Exercise 1 ● Clean up files cached in memory ○ sudo sysctl -w vm.drop_caches=3 ● Write “zero bytes” into a file placed in local SSD 500 times with block size of 10MB ○ sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500" ○ sofa report ○ Please notice the overhead changes since time = 1550681154.2 in page 37 Exercise 2 ● Mount ramdisk onto /mnt/tmpfs ○ mkdir /mnt/tmpfs ○ mount -t tmpfs -o size=10g none /mnt/tmpfs ● Write “zero bytes” into a file placed in ramdisk 500 times with block size of 10MB ○ sofa record "dd if=/dev/zero of=/mnt/tmpfs/dummy.out bs=10M count=500" ○ sofa report Exercise 3 ● Please check which choice of the block size is optimal on your computer regarding I/O throughput (i.e. bytes/s)? Why? Can you use SOFA or the other profiling tools to explain?
  • 15. 15 Case Study: Multi-threading of PI Calculation Exercise 1 ● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8 ● $ make -C benchmark/thread/exercise-04/ ● $./benchmark/thread/exercise-04/exercise 123457890 8 Exercise 2 ● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8 ● Change random functions from erand48_r() to rand() ● $ make -C benchmark/thread/exercise-04/ ● $./benchmark/thread/exercise-04/exercise 123457890 8 Exercise 3 When replace erand48() by rand(), the execution time become increase or decrease? Why? Please prove it by using proper performance profiling tools.
  • 16. 16 Case Study: Multi-threading of PI Calculation (cont.) MPSTAT Profiling: CPU Utilization (%): core USR SYS IDL IOW IRQ 0 97 2 0 0 0 1 99 0 0 0 0 2 0 0 100 0 0 3 0 0 100 0 0 4 0 0 99 0 0 5 0 0 100 0 0 6 0 0 100 0 0 7 0 0 99 0 0 CPU Time (s): core USR SYS IDL IOW IRQ 0 11.10 0.24 0.04 0.00 0.00 1 11.28 0.07 0.04 0.00 0.00 2 0.00 0.00 11.40 0.00 0.00 3 0.00 0.00 11.40 0.00 0.00 4 0.00 0.03 11.38 0.00 0.00 5 0.00 0.00 11.40 0.00 0.00 6 0.00 0.00 11.40 0.00 0.00 7 0.00 0.01 11.39 0.00 0.00 Active CPU Time (s): 22.726 Active CPU ratio (%): 25 Final Performance Features name value 0 elapsed_time 11.344319 1 active_cpu_ratio 25.000000 MPSTAT Profiling: CPU Utilization (%): core USR SYS IDL IOW IRQ 0 95 2 1 0 0 1 97 0 2 0 0 2 97 0 2 0 0 3 96 1 2 0 0 4 97 0 1 0 0 5 94 3 2 0 0 6 97 0 1 0 0 7 96 0 3 0 0 CPU Time (s): core USR SYS IDL IOW IRQ 0 4.79 0.13 0.08 0.00 0.00 1 4.85 0.02 0.13 0.00 0.00 2 4.89 0.00 0.11 0.00 0.00 3 4.83 0.06 0.11 0.00 0.00 4 4.88 0.04 0.08 0.00 0.00 5 4.71 0.17 0.13 0.00 0.00 6 4.89 0.01 0.09 0.00 0.00 7 4.81 0.02 0.17 0.00 0.00 Active CPU Time (s): 39.090 Active CPU ratio (%): 99 Final Performance Features name value 0 elapsed_time 4.919824 1 active_cpu_ratio 99.000000 MPSTAT Profiling: CPU Utilization (%): core USR SYS IDL IOW IRQ 0 97 1 0 0 0 1 97 0 1 0 0 2 98 0 1 0 0 3 94 2 3 0 0 4 0 0 100 0 0 5 0 0 100 0 0 6 0 0 100 0 0 7 0 0 99 0 0 CPU Time (s): core USR SYS IDL IOW IRQ 0 5.73 0.11 0.05 0.00 0.00 1 5.76 0.02 0.11 0.00 0.00 2 5.80 0.04 0.06 0.00 0.00 3 5.54 0.15 0.20 0.00 0.00 4 0.00 0.00 5.90 0.00 0.00 5 0.00 0.00 5.90 0.00 0.00 6 0.00 0.00 5.90 0.00 0.00 7 0.00 0.01 5.89 0.00 0.00 Active CPU Time (s): 23.159 Active CPU ratio (%): 49 Final Performance Features name value 0 elapsed_time 5.84204 1 active_cpu_ratio 49.00000
  • 17. 17 Case Study: Multi-threading of PI Calculation (cont.) 17 # of threads: 2 Active CPU Time (s): 22.726 Active CPU ratio (%): 25 Final Performance Features name value 0 elapsed_time 11.344319 1 active_cpu_ratio 25.000000 # of threads: 8 Active CPU Time (s): 39.090 Active CPU ratio (%): 99 Final Performance Features name value 0 elapsed_time 4.919824 1 active_cpu_ratio 99.000000 # of threads: 4 Active CPU Time (s): 23.159 Active CPU ratio (%): 49 Final Performance Features name value 0 elapsed_time 5.84204 1 active_cpu_ratio 49.00000 Experiences: ● Active CPU time should be the same even if increasing # of threads ● Hyper-threading make CPU resources overprovisioned -> non-linear scale up Sequential execution time t t t t
  • 18. 18 What is the reason that cause network traces (i.e. tcpdump traces)? Command: sofa record ~/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest/bandwidthTest Case Study: CUDA Memory Copy
  • 20. 20 SOFAAdvanced Usage usage: sofa [-h] [--logdir /path/to/logdir/] [--gpu_filters "keyword1:color1,keyword2:color2"] [--cpu_filters "keyword1:color1,keyword2:color2"] [--cpu_top_k N] [--num_iterations N] [--num_swarms N] [--cpu_time_offset_ms N] [--plot_ratio N] [--viz_port N] [--profile_all_cpus] [--verbose] [--enable_aisi] [--display_swarms] [--base_logdir BASE_LOGDIR] [--match_logdir MATCH_LOGDIR] [--hsg_multifeatures] [--enable_vmstat] [--skip_preprocess] [--network_filters "ip1,ip2,ip3"] [--enable_pcm] [--cuda_api_tracing] [--perf_events "cycles,instructions,cache-misses"] [--potato_server "ip:port"] <stat|record|report|preprocess|analyze|diff|viz|clean> [<PROFILED_COMMAND>]
  • 21. 21 SOFAAdvanced Usage (cont.) More performance metrics: sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500" --perf_events="cycles,instructions,cache-misses,branch-misses" More performance metrics: sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --cuda_api_tracing More performance metrics: sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --enable_strace More performance metrics: sofa record "sleep 5" --enable_tcpdump Background recording for daemon or multiple-command bash file sofa record "sleep 20" --profile_all_cpus Then, execute the target command
  • 22. 22 SOFAAdvanced Usage (cont.) Verbose mode to show more information, like the progress of generating report or displaying detailed reports (e.g., total system call time) sofa report --verbose Automatically identification iterative swarm and then expose per-iteration performance summary sofa report --enable_aisi --num_iterations 20 Display top-10 hotspot swarms which are highlighted with different colors sofa report --verbose --display_swarms Reduce the number of points shown on visualization interfaces sofa report --plot_ratio 10 Absolute or Relative (default) Timestamp sofa report sofa report --absoluate_timestamp
  • 23. 23 SOFAAdvanced Usage (cont.) Apply filters to highlight interested traces sofa report --cpu_filters=’tensorflow:orange’ --gpu_filters=’fw:blue’ --gpu_filters=’bw:red’ --gpu_filters=nccl:purple’ Compare two-run traces swarm-by-swarm to find the affected swarms due to hardware/software/system changes: sofa record "dd if=/dev/zero of=dummy.out bs=100M count=10" --logdir log1 sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100" --logdir log2 sofa diff --base_logdir log1 --match_logdir log2 Have performance tuning suggestion from POTATO Server (Foxconn In-house Performance Optimization & Auto-tuning Orchestration) sofa report --potato_server "192.168.0.100:5000"
  • 24. Automatic Iterative Swarm Identification (AISI) Command: sofa record "dd if=/dev/zero of=dummy.out bs=100M count=10" sofa report --enable_aisi --num_iterations 10 AISI: Automatic Iterative Swarm Analysis
  • 25. Automatic Iterative Swarm Identification (AISI) (cont.) Command: ● cp -r /usr/local/cuda/samples ~ ● Edit ~/samples/1_Utilities/bandwidthTest/bandwidthTest , make a three-times loop. ● make -C ~/samples/1_Utilities/bandwidthTest/ ● sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest ● sofa report --enable_aisi --num_iterations 3 int main(int argc, char **argv) { … int iRetVal=0; for(int k=0; k<3; k++){ iRetVal = runTest(argc, (const char **)argv); } .... }
  • 26. Automatic Iterative Swarm Identification (AISI) (cont.) Command: ● cp -r /usr/local/cuda/samples ~ ● Edit ~/samples/1_Utilities/bandwidthTest/bandwidthTest , make a three-times loop. ● make -C ~/samples/1_Utilities/bandwidthTest/ ● sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest ● sofa report --enable_aisi --num_iterations 3 --aisi_via_strace int main(int argc, char **argv) { … int iRetVal=0; for(int k=0; k<3; k++){ iRetVal = runTest(argc, (const char **)argv); } .... }
  • 27. Absolute or Relative (default) Timestamp Command: ● sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest ● sofa report OR sofa report --absoluate_timestamp