Twan Koot - Beyond the % usage, an in-depth look into monitoring

Beyond the % usage, an
in-depth look into
monitoring.
Twan Koot

Introduction
• Twan Koot 26 Years
• Senior Performance tester / Engineer @
• 5 years of IT experience.
• Loves: fast IT solutions on small hardware.
• Hates: unfounded decisions on IT architecture.
• Apart from working, I love to photograph and drive motorcycle.

Beyond the % usage, an in-depth look into monitoring.
• Topics:
• Method for analyzing recourse monitoring.
• Showcasing some monitoring metrics.
• Introduction into BCC tools.

Monitoring- thebasics – Analyze
• So, you have run a performance test and even had monitoring running.
• Now we can do the basic 3 step dance.
• Check CPU, RAM and IO.
• Check if any counter exceeds a static threshold like % usage.
• Match if peaks in recourses overlap peaks in response times.

Monitoring- thebasics – Dashboard hype

Monitoring– USE – Brendan Gregg
• The USE method enables a Methodical approach to analyzing recourses.
• It is developed by Brendan Gregg. http://www.brendangregg.com
"Industry expert in computing performance and cloud computing. Solves
hard problems. Makes things faster."

Monitoring– USE – USE method
• Utilization – Average time a recourse was busy servicing work.
• Saturation – The degree of work which can't be handled, and which is
being queued.
• Errors – The amount of errors.
Recourse Utilization(Easy) Saturation(Moderate) Errors(Hard)
CPU CPU utilization (%) Run-queue Length /
scheduler latency
Correctable CPU
cache ECC events or
faulted CPUs
Memory Available free memory Anonymous paging or
thread swapping
Failed malloc()s
Storage device I/O Device busy % Wait queue length Device errors

Monitoring– USE – The flow
• How do we apply the USE method ?
• We can use the following flow:

Monitoring– Let’sgo deeper
Lots of tools

Monitoring– Let'sgo deeper - CPU Utilisation
• One of most measured metrics during a performance test.
• What does 80% utilization even mean ?
• Overloaded ?
One of the most misread metrics !
Util Sat
CPU
RAM
I/O

Monitoring– Let'sgo deeper - CPU Utilisation
• When checking the utilization counter we may observe:
• What is happening:
• What are we waiting for ?
• IO
• Memory
• So, CPU utilization is wrong ? It’s a good starting point to begin monitoring
Busy idle
Busy Waiting"stalled" Waiting"Idle"
Util Sat
CPU
RAM
I/O

Monitoring– Let'sgo deeper – CPU Saturation
• CPU saturation -> run queue
• Nmon "k"
• We see a run queue of 9, this will cause latency
Util Sat
CPU
RAM
I/O

Monitoring– Let'sgo deeper - Memory Utilisation
• Using Nmon “m”
Util Sat
CPU
RAM
I/O

Monitoring– Let'sgo deeper - Memory Saturation
• Using Nmon “m” “V”
• We can see lots of page activity and big usage of swap space.
Util Sat
CPU
RAM
I/O

Monitoring– Let'sgo deeper - IO Utilisation
• Using Nmon “d”
• We can see multiple counters for measuring IO utilization.
• We can measure the amount of data reads and writes to the disk and compare this to
specs.
• Reading 3726,2 transfers/sec.
Util Sat
CPU
RAM
I/O

Monitoring– Let'sgo deeper - IO Saturation
• Using iostat -d to filter to a specific disk.
• We can see we had a queue of ~ 43 requests (1 Sec interval).
• queueing means latency.
Util Sat
CPU
RAM
I/O

Monitoring– Let'sgo deeper – eBPF
• ‘Extended’ Berkeley Packet Filter.
• In-kernel virtual machine, to run mini filtered programs.
• Gives access to many new metrics about kernel, performance, scheduler
and more.
• Some use-cases:
• Deep performance analysis
• Network tracing
• DDOS mitigation/detection

Monitoring– Let'sgo deeper -BCC
BPF Compiler Collection (BCC)
“BCC is a toolkit for creating efficient kernel tracing and manipulation
programs and includes several useful tools and examples.”

Monitoring– BCC – Overview
Lots of tools

Monitoring– BCC– CPU saturation/ Runqlat
• Runqlat is used to measure schedular latency.
We can even filter to specific PID:
Util Sat
CPU
RAM
I/O

Monitoring– BCC– Cachestat / Cachetop
• We can observe cache stats.
• We can show the same stats with more information.
Util Sat
CPU
RAM
I/O

Monitoring– BCC– Biolatency/ Filetop
• BCC contains powerful tools such as Biolatency
• Which measures Disk I/O latency:
• Using Filetop we can observe metrics about file activity.
Util Sat
CPU
RAM
I/O

Monitoring– BCC– Fileslower / Filelife
• We can measure File read and writes slower than a threshold:
• We can also measure the reads and writes to files using Filetop:
Util Sat
CPU
RAM
I/O

Monitoring– BCC– TCPlife
• Used for tracing TCP sessions that open and close.

Monitoring– Hopes for afterthis showcase
• More performance engineers start using a methodical approach for
analyzing recourse monitoring.
• Performance testers/engineers use this presentation as a start point to
learn more about in-depth monitoring.
• We can gaze upon more dashboards with metrics from eBPF or following
the USE method

Monitoring– Let’srecap
• Use USE for analyzing recourses.
• Begin with analyzing Utilization of the recourses.
• Go deeper by checking the Saturation metrics.
• When available check the Error metrics for the recourses.
• Use BCC tools to analyze even more metrics.
• When analyzing monitoring data keep Yoda in mind.
Thank you!

Twan Koot - Beyond the % usage, an in-depth look into monitoring

Recommended

Recommended

More Related Content

Similar to Twan Koot - Beyond the % usage, an in-depth look into monitoring

Similar to Twan Koot - Beyond the % usage, an in-depth look into monitoring (20)

More from Neotys_Partner

More from Neotys_Partner (20)

Recently uploaded

Recently uploaded (20)

Twan Koot - Beyond the % usage, an in-depth look into monitoring

Editor's Notes