SOFA Tutorial

SOFAArchitecture
The average profiling overhead for
frameworks (e.g. TensorFlow, MXNet,
PyTorch) is less than 5%!
Distributed DL
Frameworks
TensorFlow
PyTorch
MXNet
Caffe2
CNTK
Performance
Counter
Monitoring
Function and
Event Tracing
SOFA Record
--------------------
● perf
● strace
● /proc/* filesystem
● vmstat
● blktrace
● strace
● tcpdump
● nvprof
● nvidia-smi
SOFA Preprocess
--------------------------
1. Consistent representation for
performance data
2. Timestamp synchronization
among heterogeneous traces.
3. Function/Event/Counter
filtering and grouping
SOFA Analyze
--------------------
● Summary-based analysis
● Spatial Trace Pattern Analysis
○ swarm generation
○ swarm captioning
○ swarm diff
● Temporal Trace Pattern Analysis
○ per-iteration timing breakdown
○ per-iteration performance statistics
traces
Spark
SOFA Visualization
---------------------------
Web-based dashboard by Highchart & D3.js
Coloring filtered functions/events/counters
2

● git clone https://github.com/cyliustack/sofa
● cd sofa
● ./tools/prepare.sh
● ./tools/empower-tcpdump.sh $(whoami)
○ Logout and then login to make changes effective; Then, cd sofa
● ./install.sh /opt/sofa
● source /opt/sofa/tools/activate.sh
● [optional] ./tools/enable_strace_perf_pcm.py
● sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100"
● sofa report
3
Quick Start: Download, Install, and Run

4
X-axis = Unix Time Timestamps (seconds); Y-axis = Metrics with different Units (log10-scale)
CPU CPU time (seconds) NET Payload of each packet (bytes)
VMSTAT_CS/VMSTAT_BI/V
MSTAT_BO
counts/seconds STRACE duration (seconds)
MPSTAT_USR Seconds per 10-ms GPU Kernel,
CUDA_COPY_H2D (Host-to-Device)
CUDA_COPY_D2H (Device-to-Host)
Duration (seconds)

Heterogenous Traces Visualization in SOFA
5
GPU H2D
memcpy
GPU D2H
memcpy
GPU DNN Backward
Propagation
GPU DNN Forward
Propagation
CPU Utilization
Network
Bandwidth

Assignment 1 (Due Date: 04/02)
1. Finish all exercises in the previous pages. Write a report for these
exercises.
2. Install SOFA, and use SOFA to profile “dd” with different
configurations like block size, read/write counts, etc. Write a report
for this experiment.
[Regulation]
● Problem 1,2 are accepted only in pdf format.
7

8
Case Study: Storage
Commands:
sudo sysctl -w vm.drop_caches=3
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"

9
Case Study: Storage (cont.)
Commands:

10
Commands:
sofa report
[1] https://blog.csdn.net/hustyangju/article/details/40512467
[2] http://sylab-srv.cs.fiu.edu/lib/exe/fetch.php?media=paperclub:lkd3ch16.pdf

11
ls -lah /dev/mapper/
...
lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-home -> ../dm-2
lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-root -> ../dm-0
lrwxrwxrwx. 1 root root 7 1月 30 14:16 cl-swap -> ../dm-1
COMMAND:
sofa report
10 Hz diskstat monitoring,
unit: read/write sectors.
diskstats:
http://ykrocku.github.io/blog/2014/04/11/diskstats/

12
The spatial relation, i.e., virtual addresses of the functions, is another
useful information when one wants to abstract the runtime program
behaviors.
Hierarchical Clustering makes application semantics more noticeable:
1. Page faults related function calls in kernel module
2. In TensorFlow, image adjustment functions, gRPC core functions,
and many other functions are within the same TensorFlow module
(i.e., pywrap_tensorflow_internal.so)
Hierarchical Clustering

13
MPSTAT Profiling:
CPU Utilization (%):
core USR SYS IDL IOW IRQ
0 0 0 97 0 0
1 0 9 75 14 0
2 0 0 99 0 0
3 1 3 88 6 0
4 1 7 90 0 0
5 0 54 32 12 0
6 0 6 46 46 0
7 0 0 95 3 0
CPU Time (s):
0 0.03 0.03 3.11 0.02 0.00
1 0.00 0.32 2.41 0.46 0.00
2 0.01 0.02 3.17 0.00 0.00
3 0.06 0.10 2.84 0.22 0.00
4 0.04 0.25 2.91 0.00 0.00
5 0.00 1.72 1.02 0.40 0.00
6 0.03 0.20 1.48 1.48 0.00
7 0.00 0.03 3.06 0.11 0.00
Active CPU Time (s): 5.510
Active CPU ratio (%): 22
Def, Active CPU ratio = total non-idle time / ( elapsed time * CPU cores)

14
Exercise 1
● Clean up files cached in memory
○ sudo sysctl -w vm.drop_caches=3
● Write “zero bytes” into a file placed in local SSD 500 times with block size of 10MB
○ sofa record "dd if=/dev/zero of=dummy.out bs=10M count=500"
○ sofa report
○ Please notice the overhead changes since time = 1550681154.2 in page 37
Exercise 2
● Mount ramdisk onto /mnt/tmpfs
○ mkdir /mnt/tmpfs
○ mount -t tmpfs -o size=10g none /mnt/tmpfs
● Write “zero bytes” into a file placed in ramdisk 500 times with block size of 10MB
○ sofa record "dd if=/dev/zero of=/mnt/tmpfs/dummy.out bs=10M count=500"
○ sofa report
Exercise 3
● Please check which choice of the block size is optimal on your computer regarding I/O
throughput (i.e. bytes/s)? Why? Can you use SOFA or the other profiling tools to explain?

15
Case Study: Multi-threading of PI Calculation
Exercise 1
● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8
● $ make -C benchmark/thread/exercise-04/
● $./benchmark/thread/exercise-04/exercise 123457890 8
Exercise 2
● $ git clone https://github.com/cyliustack/benchmark/thread/exercise-04/exercise 123457890 8
● Change random functions from erand48_r() to rand()
● $ make -C benchmark/thread/exercise-04/
● $./benchmark/thread/exercise-04/exercise 123457890 8
Exercise 3
When replace erand48() by rand(), the execution time become increase or decrease? Why?
Please prove it by using proper performance profiling tools.

16
Case Study: Multi-threading of PI Calculation (cont.)
MPSTAT Profiling:
0 97 2 0 0 0
1 99 0 0 0 0
2 0 0 100 0 0
3 0 0 100 0 0
4 0 0 99 0 0
5 0 0 100 0 0
6 0 0 100 0 0
7 0 0 99 0 0
CPU Time (s):
0 11.10 0.24 0.04 0.00 0.00
1 11.28 0.07 0.04 0.00 0.00
2 0.00 0.00 11.40 0.00 0.00
3 0.00 0.00 11.40 0.00 0.00
4 0.00 0.03 11.38 0.00 0.00
5 0.00 0.00 11.40 0.00 0.00
6 0.00 0.00 11.40 0.00 0.00
7 0.00 0.01 11.39 0.00 0.00
Final Performance Features
name value
0 elapsed_time 11.344319
1 active_cpu_ratio 25.000000
MPSTAT Profiling:
0 95 2 1 0 0
1 97 0 2 0 0
2 97 0 2 0 0
3 96 1 2 0 0
4 97 0 1 0 0
5 94 3 2 0 0
6 97 0 1 0 0
7 96 0 3 0 0
CPU Time (s):
0 4.79 0.13 0.08 0.00 0.00
1 4.85 0.02 0.13 0.00 0.00
2 4.89 0.00 0.11 0.00 0.00
3 4.83 0.06 0.11 0.00 0.00
4 4.88 0.04 0.08 0.00 0.00
5 4.71 0.17 0.13 0.00 0.00
6 4.89 0.01 0.09 0.00 0.00
7 4.81 0.02 0.17 0.00 0.00
name value
MPSTAT Profiling:
0 97 1 0 0 0
1 97 0 1 0 0
2 98 0 1 0 0
3 94 2 3 0 0
4 0 0 100 0 0
5 0 0 100 0 0
6 0 0 100 0 0
7 0 0 99 0 0
CPU Time (s):
0 5.73 0.11 0.05 0.00 0.00
1 5.76 0.02 0.11 0.00 0.00
2 5.80 0.04 0.06 0.00 0.00
3 5.54 0.15 0.20 0.00 0.00
4 0.00 0.00 5.90 0.00 0.00
5 0.00 0.00 5.90 0.00 0.00
6 0.00 0.00 5.90 0.00 0.00
7 0.00 0.01 5.89 0.00 0.00
name value

17
Case Study: Multi-threading of PI Calculation (cont.)
17
# of threads: 2
name value
# of threads: 8
name value
# of threads: 4
name value
Experiences:
● Active CPU time should be the same even if increasing # of threads
● Hyper-threading make CPU resources overprovisioned -> non-linear scale up
Sequential execution time
t t t
t

18
What is the reason that cause network traces (i.e. tcpdump traces)?
Command:
sofa record ~/NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest/bandwidthTest
Case Study: CUDA Memory Copy

20
SOFAAdvanced Usage
usage: sofa [-h] [--logdir /path/to/logdir/]
[--gpu_filters "keyword1:color1,keyword2:color2"]
[--cpu_filters "keyword1:color1,keyword2:color2"] [--cpu_top_k N]
[--num_iterations N] [--num_swarms N] [--cpu_time_offset_ms N]
[--plot_ratio N] [--viz_port N] [--profile_all_cpus] [--verbose]
[--enable_aisi] [--display_swarms] [--base_logdir BASE_LOGDIR]
[--match_logdir MATCH_LOGDIR] [--hsg_multifeatures]
[--enable_vmstat] [--skip_preprocess]
[--network_filters "ip1,ip2,ip3"] [--enable_pcm]
[--cuda_api_tracing]
[--perf_events "cycles,instructions,cache-misses"]
[--potato_server "ip:port"]
<stat|record|report|preprocess|analyze|diff|viz|clean>
[<PROFILED_COMMAND>]

21
SOFAAdvanced Usage (cont.)
More performance metrics:
--perf_events="cycles,instructions,cache-misses,branch-misses"
sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --cuda_api_tracing
sofa record "~/samples/1_Utilities/bandwidthTest/bandwidthTest" --enable_strace
sofa record "sleep 5" --enable_tcpdump
Background recording for daemon or multiple-command bash file
sofa record "sleep 20" --profile_all_cpus
Then, execute the target command

22
Verbose mode to show more information, like the progress of generating report or displaying detailed
reports (e.g., total system call time)
sofa report --verbose
Automatically identification iterative swarm and then expose per-iteration performance summary
sofa report --enable_aisi --num_iterations 20
Display top-10 hotspot swarms which are highlighted with different colors
sofa report --verbose --display_swarms
Reduce the number of points shown on visualization interfaces
sofa report --plot_ratio 10
Absolute or Relative (default) Timestamp
sofa report
sofa report --absoluate_timestamp

23
Apply filters to highlight interested traces
sofa report --cpu_filters=’tensorflow:orange’ --gpu_filters=’fw:blue’ --gpu_filters=’bw:red’
--gpu_filters=nccl:purple’
Compare two-run traces swarm-by-swarm to find the affected swarms due to
hardware/software/system changes:
sofa record "dd if=/dev/zero of=dummy.out bs=100M count=10" --logdir log1
sofa record "dd if=/dev/zero of=dummy.out bs=10M count=100" --logdir log2
sofa diff --base_logdir log1 --match_logdir log2
Have performance tuning suggestion from POTATO Server (Foxconn In-house Performance
Optimization & Auto-tuning Orchestration)
sofa report --potato_server "192.168.0.100:5000"

Automatic Iterative Swarm Identification (AISI)
Command:
sofa report --enable_aisi --num_iterations 10
AISI: Automatic Iterative Swarm Analysis

Automatic Iterative Swarm Identification (AISI) (cont.)
Command:
● cp -r /usr/local/cuda/samples ~
● Edit ~/samples/1_Utilities/bandwidthTest/bandwidthTest ,
make a three-times loop.
● make -C ~/samples/1_Utilities/bandwidthTest/
● sofa record ~/samples/1_Utilities/bandwidthTest/bandwidthTest
● sofa report --enable_aisi --num_iterations 3
int main(int argc, char **argv)
{
…
int iRetVal=0;
for(int k=0; k<3; k++){
iRetVal = runTest(argc, (const char **)argv);
}
....
}

Automatic Iterative Swarm Identification (AISI) (cont.)
Command:
● cp -r /usr/local/cuda/samples ~
● Edit ~/samples/1_Utilities/bandwidthTest/bandwidthTest ,
make a three-times loop.
● make -C ~/samples/1_Utilities/bandwidthTest/
● sofa report --enable_aisi --num_iterations 3 --aisi_via_strace
int main(int argc, char **argv)
{
…
int iRetVal=0;
for(int k=0; k<3; k++){
iRetVal = runTest(argc, (const char **)argv);
}
....
}

Absolute or Relative (default) Timestamp
Command:
● sofa report OR sofa report --absoluate_timestamp

SOFA Tutorial

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SOFA Tutorial

Similar to SOFA Tutorial (20)

Recently uploaded

Recently uploaded (20)

SOFA Tutorial