WordPress Websites for Engineers: Elevate Your Brand
The power of linux advanced tracer [POUG18]
1. THE POWER OF LINUX
ADVANCED TRACER
HATEM MAHMOUD
HTTPS://MAHMOUDHATEM.WORDPRESS.COM
HIGH FIVE POUG
2. 2
WHO AM I
Oracle DBA
Oracle experience: 7 years
Located in TUNISIA
Oracle Certified Master
Oracle geek
https://mahmoudhatem.wordpress.com
3. 3
TAKE AWAYS
• Better understanding of Linux tracing landscape
• Getting an idea of what can be done.
As someone else said :
“Knowing what can be done is more important than knowing how to do it - you can
always google that”
4. 4
AGENDA
1. Linux tracing landscape
2. Static tracing
3. Dynamic tracing
4. Monkey patching
5. Deeper look at CPU utilization
8. 8
LINUX TRACING SYSTEMS
• systemtap,perf,bcc,pmu-tools,etcFront-end tools
• stap module,eBPF,perf_events (perf_event_open
syscall ),ftrace(/sys/kernel/debug/tracing),etc
Mechanisms for
extracting data
• kprobes and uprobes (dynamic tracing),
• tracepoints ,software events and USDT (static tracing)
• PMCs (hardware counters).
• Etc
Event source
https://jvns.ca/blog/2017/07/05/linux-tracing-systems/
Breakdown as suggested by Brendan Gregg and Julia Evans
9. 9
LINUX TRACING SYSTEMS
• systemtap,perf,bcc,pmu-tools,etcFront-end tools
• stap module,eBPF,perf_events (perf_event_open
syscall ),ftrace(/sys/kernel/debug/tracing),etc
Mechanisms for
extracting data
• kprobes and uprobes (dynamic tracing),
• tracepoints ,software events and USDT (static tracing)
• PMCs (hardware counters).
• Etc
Event source
https://jvns.ca/blog/2017/07/05/linux-tracing-systems/
Breakdown as suggested by Brendan Gregg and Julia Evans
10. 10
LINUX TRACING SYSTEMS
• systemtap,perf,bcc,pmu-tools,etcFront-end tools
• stap module,eBPF,perf_events (perf_event_open
syscall ),ftrace(/sys/kernel/debug/tracing),etc
Mechanisms for
extracting data
• kprobes and uprobes (dynamic tracing),
• tracepoints ,software events and USDT (static tracing)
• PMCs (hardware counters).
• Etc
Event source
https://jvns.ca/blog/2017/07/05/linux-tracing-systems/
Breakdown as suggested by Brendan Gregg and Julia Evans
11. 11
LINUX TRACING SYSTEMS
• systemtap,perf,bcc,pmu-tools,etcFront-end tools
• stap module,eBPF,perf_events (perf_event_open
syscall ),ftrace(/sys/kernel/debug/tracing),etc
Mechanisms for
extracting data
• kprobes and uprobes (dynamic tracing),
• tracepoints ,software events and USDT (static tracing)
• PMCs (hardware counters).
• Etc
Event source
https://jvns.ca/blog/2017/07/05/linux-tracing-systems/
Breakdown as suggested by Brendan Gregg and Julia Evans
13. 13
STATIC TRACING
Tracepoints :
• Kernel predefined trace probe
• Inserted by kernel developers at important locations in
the code (system calls, disk I/O, etc)
User Statically-Defined Tracing (USDT) :
• Application predefined trace probe
• Inserted by application developers at important
locations in the code,
Software Events :
• kernel counters (CPU migrations, minor faults, major
faults,etc)
http://www.brendangregg.com/perf.html
15. 15
BCC/TOOLS : BIOLATENCY SUMMARIZE BLOCK DEVICE I/O
LATENCY AS A HISTOGRAM
https://github.com/iovisor/bcc/blob/master/tools/biolatency_example.txt
• Traditional tools such iostat and
sar show average latency which
can be misleading (Hide latency
outliers)
• Need to study the full distribution
• Biolatency based on kernel
tracepoints (blk_start_request,
blk_account_io_completion,etc)
16. 16
BCC/TOOLS : EXT4SLOWER TRACE SLOW EXT4 OPERATIONS.
https://github.com/iovisor/bcc/blob/master/tools/ext4slower_example.txt
• Better measure of the latency
suffered by applications reading
from the file system.
• The measured Latency spans
• block device I/O (disk I/O)
• file system CPU cycles
• file system locks
• run queue latency
• etc
Great CPU
saturation metric !
17. 17
BCC/TOOLS : RUNQLAT: RUN QUEUE (SCHEDULER)
LATENCY AS A HISTOGRAM
https://github.com/iovisor/bcc/blob/master/tools/runqlat_example.txt
• The best CPU saturation metrics
are measures of run queue (or
scheduler) latency.
• Time a task spends waiting on a
run queue for a turn on-CPU,
• Better than the run queue length
metric for estimating the
magnitude of CPU saturation !
18. 18
BCC/TOOLS : RUNQLAT: RUN QUEUE (SCHEDULER)
LATENCY AS A HISTOGRAM
https://github.com/iovisor/bcc/blob/master/tools/runqlat_example.txt
20. 20
SYSTEMTAP : SCHEDTIMES_WSI.STP : TRACK TIME
PROCESSES SPEND IN VARIOUS STATES
https://mahmoudhatem.wordpress.com/2017/02/06/extending-systemtap-scripts-with-oracle-session-info/
• Bring application context to your monitoring tools !
22. 22
BCC/TOOLS : DBSLOWER: TRACE MYSQL/POSTGRESQL
QUERIES SLOWER THAN A THRESHOLD
https://github.com/iovisor/bcc/blob/master/tools/dbslower_example.txt
• dbslower is based USDT probes
(needs MySQL and PostgreSQL
built with USDT (DTrace) support.
25. 25
DYNAMIC TRACING
• Dynamically instrumenting (creating events
in) any software location.
• kprobes: kernel dynamic tracing
• uprobes: user-level dynamic tracing
• No need to modify the probed process's
binaries or restart the program.
26. 26
DYNAMIC TRACING (UPROBE)
• Function prologue of “kskthewt”(called at the end of an Oracle wait event) before inserting
probe point :
• After inserting a probe point at function call : The original opcode was replaced with
int3 (software interrupt).
https://mahmoudhatem.wordpress.com/2017/03/21/uprobes-issue-with-oracle-12c/
29. 29
SYSTEMTAP : AGGREGATIONS AND FILTERING OF
WAIT EVENT DATA
https://externaltable.blogspot.com/2014/09/systemtap-into-oracle-for-fun-and-profit.html
Collect and display microsec-precision histograms for all Oracle version (Note 12.1.0.2 has V$EVENT_HISTOGRAM_MICRO)
What this wait event and the
other I/O wait events are really
measuring ?
30. 30
SYSTEMTAP : WHAT ARE THE I/O-RELATED WAIT EVENTS
REALLY MEASURING? [TRACING LOGICAL AND PHYSICAL I/O ]
https://externaltable.blogspot.com/2014/11/life-of-oracle-io-tracing-logical-and.html
The elapsed time for the wait event
"direct path read" does not
accurately reflect I/O latency
32. 32
SYSTEMTAP : A SIMPLE USER/PASSWORD SNIFFER
https://mahmoudhatem.wordpress.com/2018/03/23/systemtap-probe-at-specific-oracle-function-offset-bonus/
• Powerful and scary at the same time !
36. 36
SYSTEMTAP : FROM MEMORY REQUEST TO PL/SQL SOURCE LINE
https://mahmoudhatem.wordpress.com/2018/01/15/from-memory-request-to-pl-sql-source-line/
Based on v$process_memory_detail
38. 38
SYSTEMTAP : A MINI ORACLE DB FIREWALL [LIVE PATCHING]
https://mahmoudhatem.wordpress.com/2016/04/18/systemtap-a-mini-oracle-db-firewall/
https://externaltable.blogspot.com/2016/03/systemtap-guru-mode-and-oracle-sql.html
39. 39
SYSTEMTAP : PLAYING WITH ORACLE DB 18C ON-PREMISES BEFORE
OFFICIAL RELEASE
https://mahmoudhatem.wordpress.com/2018/03/01/playing-with-oracle-db-18c-on-premises-before-official-release/
40. 40
DEEPER LOOK AT CPU
UTILIZATION
• Which code-paths are causing high CPU usage ?
• What’s my CPU bottleneck ?
• How much my CPU are stalled ? For what resource ?
41. 41
CPU PROFILING
• Linux advanced tracer tools are capable of lightweight profiling of CPU usage by stack
sampling such as :
• Systemtap
• Perf
• Bcc
• To quickly understand CPU usage the collected profiling data can be Visualized using a
Flame graphs.
http://www.brendangregg.com/flamegraphs.html
44. 44
EXTENDED FLAMEGRAPH : PL/SQL PROGRAM AND LINE NUMBER
https://mahmoudhatem.wordpress.com/2017/09/22/geeky-plsql-tracerprofiler-another-step/
45. 45
BUT WHAT THAT FUNCTIONS WAS DOING WHEN
THEY WHERE ON-CPU ? RUNNING OR STALLED ?
46. 46
CPU UTILIZATION IS WRONG
http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
47. 47
WHEN THE CPU UTILIZATION DOES NOT TELL YOU
THE UTILIZATION OF THE CPU
PERFORMANCE MONITOR COUNTER - A BETTER WAY TO MEASURE CPU UTILIZATION
*The next sections are only covering the Intel platforms
48. 48
HARDWARE EVENTS (PMC)
• PMCs instrument low-level processor activity
• Can be used to understand how efficiently a workload uses the processor resources (CPU caches,
MMU, memory busses, CPU interconnects,Execution units,etc)
• PMCs :
• Cores : Measure only values on a single core
• Uncore : The shared socket-wide values
51. 51
HIGH-LEVEL METRICS (IPC A GENERAL EFFICIENCY METRIC )
• Events can be observed and combined to create useful high-level metrics such as Instruction per
Cycle (IPC)
* Modern superscalar processors can issue multiples instructions per cycle
52. 52
CPI FLAME GRAPH
• The color now shows what that
function was doing when it was on-
CPU: running or stalled
• Highest CPI blue (slowest
instructions)
• Lowest CPI red (fastest
instructions)
• Visualization of CPU efficiency by
function.
https://mahmoudhatem.wordpress.com/2017/10/26/deeper-look-at-cpu-utilization-the-power-of-pmu-events/
get consistent read
53. 53
IPC INTERPRETATION AND ACTIONABLE ITEMS
http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
• A good starting point for identifying what the CPU is really doing is IPC (Instruction per cycle)
54. 54
WHERE ARE WE REALLY WASTING OUR PRECIOUS CPU CYCLES ?
False data sharing
Split Stores
Loads Blocked by Store Forwarding
4K Aliasing
DTLB miss
Microcode assists
Memory Bandwidth
Memory Latency
Bad speculation
Port Utilization
L1 miss
L2 miss
Vectorization
Remote DRAM
58. 58
MESURING IPC IS GOOD STARTING POINT BUT HOW
TO DRILL DOWN FURTHER ?
A specific microarchitecture may make available hundreds of events through its PMU !
Which events are useful in detecting the true bottleneck ?
Require and in-depth knowledge of both the microarchitecture design and PMU specifications !
“Analysis without a methodology can become a fishing expedition, where
metrics are examined ad hoc, until the issue is found –if it is at all.”
Source: Brendan D. Gregg,
http://www.brendangregg.com/methodology.html
59. 59
TOP-DOWN MICRO-ARCHITECTURE ANALYSIS
METHOD [ TMAM ]
• Systematically Find True Bottleneck (Eliminates guess work)
• Provide an hierarchical execution cycles breakdown (CPI breakdown)
• Avoids the µ-arch high-learning curve
• Correctly Characterizes All Workloads
• Frequent performance bottlenecks are organized in a hierarchical structure
https://software.intel.com/en-us/vtune-amplifier-help-tuning-applications-using-a-top-down-microarchitecture-analysis-method
63. 63
INTEL VTUNE : GENERAL EXPLORATION
https://software.intel.com/en-us/intel-vtune-amplifier-xe
64. 64
INTEL VTUNE : GROUPING BY FUNCTION/CALL STACK
https://software.intel.com/en-us/intel-vtune-amplifier-xe
get consistent read
kernel data scan table full
65. 65
TMAM EXAMPLE
TEST env : ORACLE 12.2.0.1/OEL 7.0 /kernel-3.10 /Processor i5-6500 /2*DDR3-1600 (4GB*2)
Testing the impact of huge pages with SLOB LIO test & intel vtune
67. 67
WITHOUT HUGEPAGES : LIOPS 3 099 420
DTLB overhead was measured using the following formula
68. 68
WITH HUGEPAGES : LIOPS 3 415 969 About 10% improvement
Workload Characterization
How much ??
69. 69
MEASURING MEMORY THROUGHPUT
https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Tools_Linux_Memory_Perf_Measure.md
• Other tools that can be used to measure memory throughput and many other metrics (QPI utilisation,
power consumption,local and remote memory bandwidth,etc) :
• Intel Processor Counter Monitor (PCM)
• Likwid
• pmu-tools
• Perf (ex:MEM_BW_READS = CAS_COUNT.RD*64 (size of cache line).)
https://yunmingzhang.wordpress.com/2015/07/22/measure-memory-bandwidth-using-uncore-counters/
High memory bandwidth
utilization can have an impact
on main memory latency !
70. 70
MEMORY BANDWIDTH VS LATENCY RESPONSE CURVE
• Even if this two concepts are often described independently they are inherently interrelated.
• According to Bruce Jacob in ” The memory system: you can’t avoid it, you can’t ignore it, you can’t
fake it” the bandwidth vs latency response curve for a system has three regions :
• Constant region: The latency response is fairly constant for the first 40% of the sustained bandwidth.
• Linear region: In between 40% to 80% of the sustained bandwidth, the latency response increases almost linearly with
the bandwidth demand of the system due to contention overhead by numerous memory requests.
• Exponential region: Between 80% to 100% of the sustained bandwidth, the memory latency is dominated by the
contention latency which can be as much as twice the idle latency or more.
• Maximum sustained bandwidth : Is 65% to 75% of the theoretical maximum bandwidth.
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
71. 71
MEMORY BANDWIDTH VS LATENCY RESPONSE CURVE
• Visualization of how memory latency is affected by the increase of the memory bandwidth
consumption.
• Armed with Intel Memory Latency Checker (MLC) let’s check our current system !
https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
72. 72
“PMCS ARE CRUCIAL FOR ANALYZING A (IF NOT THE)
MODERN SYSTEM BOTTLENECK: MEMORY I/O.”
http://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html
Brendan Gregg
73. 73
THANK YOU FOR YOUR
ATTENTION
https://mahmoudhatem.wordpress.com
@Hatem__Mahmoud
https://linkedin.com/in/mahmoudhatemoracle