This slide will show you how to use SOFA to do performance analysis of CPU/GPU cooperative programs, especially programs running with deep software stacks like TensorFlow, PyTorch, etc.
source code at:
https://github.com/cyliustack/sofa
2. SOFAArchitecture
The average profiling overhead for
frameworks (e.g. TensorFlow, MXNet,
PyTorch) is less than 5%!
Distributed DL
Frameworks
TensorFlow
PyTorch
MXNet
Caffe2
CNTK
Performance
Counter
Monitoring
Function and
Event Tracing
SOFA Record
--------------------
● perf
● strace
● /proc/* filesystem
● vmstat
● blktrace
● strace
● tcpdump
● nvprof
● nvidia-smi
SOFA Preprocess
--------------------------
1. Consistent representation for
performance data
2. Timestamp synchronization
among heterogeneous traces.
3. Function/Event/Counter
filtering and grouping
SOFA Analyze
--------------------
● Summary-based analysis
● Spatial Trace Pattern Analysis
○ swarm generation
○ swarm captioning
○ swarm diff
● Temporal Trace Pattern Analysis
○ per-iteration timing breakdown
○ per-iteration performance statistics
traces
Spark
SOFA Visualization
---------------------------
Web-based dashboard by Highchart & D3.js
Coloring filtered functions/events/counters
2
3. ● git clone https://github.com/cyliustack/sofa
● cd sofa
● ./tools/prepare.sh
● ./tools/empower-tcpdump.sh $(whoami)
○ Logout and then login to make changes effective; Then, cd sofa
● ./install.sh /opt/sofa
● source /opt/sofa/tools/activate.sh
● [optional] ./tools/enable_strace_perf_pcm.py
● sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100"
● sofa report
3
Quick Start: Download, Install, and Run
4. 4
X-axis = Unix Time Timestamps (seconds); Y-axis = Metrics with different Units (log10-scale)
CPU CPU time (seconds) NET Payload of each packet (bytes)
VMSTAT_CS/VMSTAT_BI/V
MSTAT_BO
counts/seconds STRACE duration (seconds)
MPSTAT_USR Seconds per 10-ms GPU Kernel,
CUDA_COPY_H2D (Host-to-Device)
CUDA_COPY_D2H (Device-to-Host)
Duration (seconds)
7. Assignment 1 (Due Date: 04/02)
1. Finish all exercises in the previous pages. Write a report for these
exercises.
2. Install SOFA, and use SOFA to profile “dd” with different
configurations like block size, read/write counts, etc. Write a report
for this experiment.
[Regulation]
● Problem 1,2 are accepted only in pdf format.
7
12. 12
Case Study: Storage (cont.)
The spatial relation, i.e., virtual addresses of the functions, is another
useful information when one wants to abstract the runtime program
behaviors.
Hierarchical Clustering makes application semantics more noticeable:
1. Page faults related function calls in kernel module
2. In TensorFlow, image adjustment functions, gRPC core functions,
and many other functions are within the same TensorFlow module
(i.e., pywrap_tensorflow_internal.so)
Hierarchical Clustering
13. 13
Case Study: Storage (cont.)
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 0 0 97 0 0
1 0 9 75 14 0
2 0 0 99 0 0
3 1 3 88 6 0
4 1 7 90 0 0
5 0 54 32 12 0
6 0 6 46 46 0
7 0 0 95 3 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 0.03 0.03 3.11 0.02 0.00
1 0.00 0.32 2.41 0.46 0.00
2 0.01 0.02 3.17 0.00 0.00
3 0.06 0.10 2.84 0.22 0.00
4 0.04 0.25 2.91 0.00 0.00
5 0.00 1.72 1.02 0.40 0.00
6 0.03 0.20 1.48 1.48 0.00
7 0.00 0.03 3.06 0.11 0.00
Active CPU Time (s): 5.510
Active CPU ratio (%): 22
Def, Active CPU ratio = total non-idle time / ( elapsed time * CPU cores)
14. 14
Case Study: Storage (cont.)
Exercise 1
● Clean up files cached in memory
○ sudo sysctl -w vm.drop_caches=3
● Write “zero bytes” into a file placed in local SSD 500 times with block size of 10MB
○ sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
○ sofa report
○ Please notice the overhead changes since time = 1550681154.2 in page 37
Exercise 2
● Mount ramdisk onto /mnt/tmpfs
○ mkdir /mnt/tmpfs
○ mount -t tmpfs -o size=10g none /mnt/tmpfs
● Write “zero bytes” into a file placed in ramdisk 500 times with block size of 10MB
○ sofa record "dd if=/dev/zero of=/mnt/tmpfs/dummy.out bs=10M count=500"
○ sofa report
Exercise 3
● Please check which choice of the block size is optimal on your computer regarding I/O
throughput (i.e. bytes/s)? Why? Can you use SOFA or the other profiling tools to explain?
15. 15
Case Study: Multi-threading of PI Calculation
Exercise 1
● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8
● $ make -C benchmark/thread/exercise-04/
● $./benchmark/thread/exercise-04/exercise 123457890 8
Exercise 2
● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8
● Change random functions from erand48_r() to rand()
● $ make -C benchmark/thread/exercise-04/
● $./benchmark/thread/exercise-04/exercise 123457890 8
Exercise 3
When replace erand48() by rand(), the execution time become increase or decrease? Why?
Please prove it by using proper performance profiling tools.
16. 16
Case Study: Multi-threading of PI Calculation (cont.)
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 97 2 0 0 0
1 99 0 0 0 0
2 0 0 100 0 0
3 0 0 100 0 0
4 0 0 99 0 0
5 0 0 100 0 0
6 0 0 100 0 0
7 0 0 99 0 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 11.10 0.24 0.04 0.00 0.00
1 11.28 0.07 0.04 0.00 0.00
2 0.00 0.00 11.40 0.00 0.00
3 0.00 0.00 11.40 0.00 0.00
4 0.00 0.03 11.38 0.00 0.00
5 0.00 0.00 11.40 0.00 0.00
6 0.00 0.00 11.40 0.00 0.00
7 0.00 0.01 11.39 0.00 0.00
Active CPU Time (s): 22.726
Active CPU ratio (%): 25
Final Performance Features
name value
0 elapsed_time 11.344319
1 active_cpu_ratio 25.000000
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 95 2 1 0 0
1 97 0 2 0 0
2 97 0 2 0 0
3 96 1 2 0 0
4 97 0 1 0 0
5 94 3 2 0 0
6 97 0 1 0 0
7 96 0 3 0 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 4.79 0.13 0.08 0.00 0.00
1 4.85 0.02 0.13 0.00 0.00
2 4.89 0.00 0.11 0.00 0.00
3 4.83 0.06 0.11 0.00 0.00
4 4.88 0.04 0.08 0.00 0.00
5 4.71 0.17 0.13 0.00 0.00
6 4.89 0.01 0.09 0.00 0.00
7 4.81 0.02 0.17 0.00 0.00
Active CPU Time (s): 39.090
Active CPU ratio (%): 99
Final Performance Features
name value
0 elapsed_time 4.919824
1 active_cpu_ratio 99.000000
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 97 1 0 0 0
1 97 0 1 0 0
2 98 0 1 0 0
3 94 2 3 0 0
4 0 0 100 0 0
5 0 0 100 0 0
6 0 0 100 0 0
7 0 0 99 0 0
CPU Time (s):
core USR SYS IDL IOW IRQ
0 5.73 0.11 0.05 0.00 0.00
1 5.76 0.02 0.11 0.00 0.00
2 5.80 0.04 0.06 0.00 0.00
3 5.54 0.15 0.20 0.00 0.00
4 0.00 0.00 5.90 0.00 0.00
5 0.00 0.00 5.90 0.00 0.00
6 0.00 0.00 5.90 0.00 0.00
7 0.00 0.01 5.89 0.00 0.00
Active CPU Time (s): 23.159
Active CPU ratio (%): 49
Final Performance Features
name value
0 elapsed_time 5.84204
1 active_cpu_ratio 49.00000
17. 17
Case Study: Multi-threading of PI Calculation (cont.)
17
# of threads: 2
Active CPU Time (s): 22.726
Active CPU ratio (%): 25
Final Performance Features
name value
0 elapsed_time 11.344319
1 active_cpu_ratio 25.000000
# of threads: 8
Active CPU Time (s): 39.090
Active CPU ratio (%): 99
Final Performance Features
name value
0 elapsed_time 4.919824
1 active_cpu_ratio 99.000000
# of threads: 4
Active CPU Time (s): 23.159
Active CPU ratio (%): 49
Final Performance Features
name value
0 elapsed_time 5.84204
1 active_cpu_ratio 49.00000
Experiences:
● Active CPU time should be the same even if increasing # of threads
● Hyper-threading make CPU resources overprovisioned -> non-linear scale up
Sequential execution time
t t t
t
18. 18
What is the reason that cause network traces (i.e. tcpdump traces)?
Command:
sofa record ~/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest/bandwidthTest
Case Study: CUDA Memory Copy
21. 21
SOFAAdvanced Usage (cont.)
More performance metrics:
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
--perf_events="cycles,instructions,cache-misses,branch-misses"
More performance metrics:
sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --cuda_api_tracing
More performance metrics:
sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --enable_strace
More performance metrics:
sofa record "sleep 5" --enable_tcpdump
Background recording for daemon or multiple-command bash file
sofa record "sleep 20" --profile_all_cpus
Then, execute the target command
22. 22
SOFAAdvanced Usage (cont.)
Verbose mode to show more information, like the progress of generating report or displaying detailed
reports (e.g., total system call time)
sofa report --verbose
Automatically identification iterative swarm and then expose per-iteration performance summary
sofa report --enable_aisi --num_iterations 20
Display top-10 hotspot swarms which are highlighted with different colors
sofa report --verbose --display_swarms
Reduce the number of points shown on visualization interfaces
sofa report --plot_ratio 10
Absolute or Relative (default) Timestamp
sofa report
sofa report --absoluate_timestamp
23. 23
SOFAAdvanced Usage (cont.)
Apply filters to highlight interested traces
sofa report --cpu_filters=’tensorflow:orange’ --gpu_filters=’fw:blue’ --gpu_filters=’bw:red’
--gpu_filters=nccl:purple’
Compare two-run traces swarm-by-swarm to find the affected swarms due to
hardware/software/system changes:
sofa record "dd if=/dev/zero of=dummy.out bs=100M count=10" --logdir log1
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100" --logdir log2
sofa diff --base_logdir log1 --match_logdir log2
Have performance tuning suggestion from POTATO Server (Foxconn In-house Performance
Optimization & Auto-tuning Orchestration)
sofa report --potato_server "192.168.0.100:5000"