Twan Koot - Beyond the % usage, an in-depth look into monitoring
Since its beginning, the Performance Advisory Council aims to promote engagement between various experts from around the world, to create relevant, value-added content sharing between members. For Neotys, to strengthen our position as a thought leader in load & performance testing. During this event, 12 participants convened in Chamonix (France) exploring several topics on the minds of today’s performance tester such as DevOps, Shift Left/Right, Test Automation, Blockchain and Artificial Intelligence.
Again Welcome, my name is Twan Koot I’m 26 years old and a Senior performance tester/ engineer @ Sogeti Netherlands. I currently have 5 Years of IT experience and started my career as tester at a small IT firm. Quite quickly I was far more interested in technical aspect of testing rather then the functional side of software. Starting with test automation and security testing I found my drive in discovering and learning new technology and technical skills.
After 2 years I joined Sogeti and quickly encountered performance testing and ever since I’m hooked. Currently active for multiple clients where I’m implementing a CI/CD performance pipeline and Coach and training junior performance testers.
I really love fast IT solutions on small hardware and really hate unfounded decisions on IT architecture. Apart from working I love to photograph and drive motorcycle.
So now that we have the introduction out of the way we can focus on what’s interesting. This presentation will be going to mainly on monitoring hardware recourses and secondly, I will highlight a real interesting and handy method to analyze performance issues. And we will finish off with a selection of tools form the BCC collection to showcase some in-depth monitoring. After the presentation we have a moment to get into Q&A.
All right now we can really start. So we first I'm going to chat about my current vision on how many performance testers are monitoring and analyzing recourse monitoring.
So we have run our performance test and we even had some cool monitoring running during the test. Now we can do our what I call 3 step dance. First we check the CPU RAM and IO. We grab our graphs and check if any of our counters exceeds any threshold during the test. Well next thing is explaining peaks in response times and check if we can correlate this with the monitoring data. This analyzing approach isn't based on any method or best practices, it's normal approach because how are brains work. Studies have shown that we are constantly looking for patterns to make better and faster decisions. Taking the previous in to account is logical that we look at 2 graphs and try to find the correlation between cpu% and response times.
Then again in my opinion this is only the basics and doesn’t deserve to be called an analysis.
So we all know these awesome dashboards full of charts. It's full of information and data about the performance test or live production data. But what are we monitoring ? Do we have a clear idea why these metrics are selected to be on the dashboard ? In the past years, many APM tools and other convenient tools have gained popularity. But with them have come the lack of in-depth knowledge about what your fancy tool is showing you in the even fancier dashboard. This reminds me of the episode of How I Met Your Mother where Marshall is showing his favorite charts using charts, I find that quite accurate way of the state of many performance testers using monitoring tools. Its’s like a look a that cool pie chart or look we need a fancy graph to show our cpu usage.
Using one specific tool will shift your focus from the metrics that aren’t available in that particular tool and with that create a sort of tunnel vision on monitoring metrics.
In this presentation we go beyond these fancy dashboard metrics and show some cool features that will help you in analyzing performance issues.
First, we need a convient way to help us analyze performance counters. So let's introduce a method to analyze performance issues or results in a structure way. So let's use USE !
Well first start with in my opinion one of the gurus about performance engineering Brendan Gregg. His post and talks have triggered me to go explore, learn and talk about the great possibilities monitoring on a lower level has to offer. Brendan has introduced the USE method he has talked about this method on his personal website and in his Book: System Performance. This book Is filled to brim with knowledge every performance engineer should know some things about. His very short Bio is : “industry expert in computing performance and cloud computing. Solves hard problems. Makes things faster.”
So, USE in CAPS by the way because it’s an abbreviation. The U stands for utilization, the amount of usage of a particular recourse or the better explanation: Average time a recourse was busy servicing work. The S stands for Saturation – the amount of work which can’t be handled and which is being queued. To me this is the most important one of all three and particular the queued part. Because we all know that standing in line is real life latency waiting for the process that’s you need to be done. Like yesterday morning when I was waiting inline for a coffee, it’s time which isn’t spent on anything useful. And now we come to the last part of USE the E it stands for errors. Will this is really easy explanation it is the amount of errors
So now we have three new metrics on which we can analyze recourse counters. In the Table on the slide you can see USE being applied on CPU, Memory and Storage device IO. So let’s start with the first column Utilization, one of the easiest metric to monitor. For CPU we can use CPU utilization (%), for Memory we can measure free memory and for IO we can use Device busy(%). These metrics are quite easy to measure and on first glance are easy to understand. The second column is Saturation and we can measure the following Run-queue length/ schedular latency, for memory we can look at swapping and paging counters and for IO we can check to wait queue length. For errors we can look at CPU errors like ECC events or even faulty CPU errors. On the memory side we can check failed malloc()s and on the IO part Device errors is a good metric. As stated at every column Is labelled from easy to hard. During the next slides will go in to detail on the first 2 columns of the table. Because of the lack of availability of error counters in a cloud env. We will focus on the first 2.
We now have a good understanding about USE is about and then comes the time to apply it during an analysis. Well there are a couple of benefits in using USE. It will guide you to methodically analyze recourse counters and performance issues. To use USE we can apply the following flow. This is pretty straight forward flow which will guide you trough the process of recourse monitoring analysis. The flow start with selection an recourse you can usually create your own top 5 on which you want to start. Then you start with checking the error counters because a faulty memory bank is an easy and quick win on performance. Then you check for utilization and then saturation. As you can see the flow is pretty self explanatory.
Now I will go to in-depth into the metrics which where shown 2 slides back.
In order to Check some of the metrics which are used in the USE method. In this overview a functional diagram is shown of a system with al the hardware components. Each component has one of more tool which can be used to gather metrics on that component. As you can see the amount of tools is overwhelming. With all these tools we can measure allot and go really deep into checking which components may be the bottleneck in a performance issue.
Again here I have mentioned that queuing means latency. But how much saturation or measered queueing is a problem ? When does queuing will add a significant amount of time to the total transaction time of your web call ? We can also measure this whit some handy tools and kernel features. In comes the next part of this presentation.
Ideal to combine with run queue We can specify the queue to real latency We can even specify to a specific program so we can measure what the exact run queue latency was for that program. Via this way we can easily spot if we have a bottleneck on the schedular/cpu and can measure in a understandable metric which is easily compared.
The same applies for IO we have measured the disk queue. But we don’t have an idea how much 43 request queued means for the amount of latency. To measure the latency time we can use Biolatency. This will allow is to measure the amount of calls and see how long the latency was and we can combine this with the disk queue to get a good idea to check when a disk queue will give problems and how much this contributes to increase in response time for instance.
We Can also use Filetop to check all file read and writes which are called by certain programs. This will helps in specifying the workload we can see which files are called. How many times and by which program.
Twan Koot - Beyond the % usage, an in-depth look into monitoring
Beyond the % usage, an
in-depth look into
• Twan Koot 26 Years
• Senior Performance tester / Engineer @
• 5 years of IT experience.
• Loves: fast IT solutions on small hardware.
• Hates: unfounded decisions on IT architecture.
• Apart from working, I love to photograph and drive motorcycle.
Beyond the % usage, an in-depth look into monitoring.
• Method for analyzing recourse monitoring.
• Showcasing some monitoring metrics.
• Introduction into BCC tools.
Monitoring- thebasics – Analyze
• So, you have run a performance test and even had monitoring running.
• Now we can do the basic 3 step dance.
• Check CPU, RAM and IO.
• Check if any counter exceeds a static threshold like % usage.
• Match if peaks in recourses overlap peaks in response times.
Monitoring– USE – Brendan Gregg
• The USE method enables a Methodical approach to analyzing recourses.
• It is developed by Brendan Gregg. http://www.brendangregg.com
"Industry expert in computing performance and cloud computing. Solves
hard problems. Makes things faster."
Monitoring– USE – USE method
• Utilization – Average time a recourse was busy servicing work.
• Saturation – The degree of work which can't be handled, and which is
• Errors – The amount of errors.
Recourse Utilization(Easy) Saturation(Moderate) Errors(Hard)
CPU CPU utilization (%) Run-queue Length /
cache ECC events or
Memory Available free memory Anonymous paging or
Storage device I/O Device busy % Wait queue length Device errors
Monitoring– USE – The flow
• How do we apply the USE method ?
• We can use the following flow:
Monitoring– Let'sgo deeper - CPU Utilisation
• One of most measured metrics during a performance test.
• What does 80% utilization even mean ?
• Overloaded ?
One of the most misread metrics !
Monitoring– Let'sgo deeper - CPU Utilisation
• When checking the utilization counter we may observe:
• What is happening:
• What are we waiting for ?
• So, CPU utilization is wrong ? It’s a good starting point to begin monitoring
Busy Waiting"stalled" Waiting"Idle"
Monitoring– Let'sgo deeper – CPU Saturation
• CPU saturation -> run queue
• Nmon "k"
• We see a run queue of 9, this will cause latency
Monitoring– Let'sgo deeper - Memory Utilisation
• Using Nmon “m”
Monitoring– Let'sgo deeper - Memory Saturation
• Using Nmon “m” “V”
• We can see lots of page activity and big usage of swap space.
Monitoring– Let'sgo deeper - IO Utilisation
• Using Nmon “d”
• We can see multiple counters for measuring IO utilization.
• We can measure the amount of data reads and writes to the disk and compare this to
• Reading 3726,2 transfers/sec.
Monitoring– Let'sgo deeper - IO Saturation
• Using iostat -d to filter to a specific disk.
• We can see we had a queue of ~ 43 requests (1 Sec interval).
• queueing means latency.
Monitoring– Let'sgo deeper – eBPF
• ‘Extended’ Berkeley Packet Filter.
• In-kernel virtual machine, to run mini filtered programs.
• Gives access to many new metrics about kernel, performance, scheduler
• Some use-cases:
• Deep performance analysis
• Network tracing
• DDOS mitigation/detection
Monitoring– Let'sgo deeper -BCC
BPF Compiler Collection (BCC)
“BCC is a toolkit for creating efficient kernel tracing and manipulation
programs and includes several useful tools and examples.”
Monitoring– BCC– CPU saturation/ Runqlat
• Runqlat is used to measure schedular latency.
We can even filter to specific PID:
Monitoring– BCC– Cachestat / Cachetop
• We can observe cache stats.
• We can show the same stats with more information.
Monitoring– BCC– Biolatency/ Filetop
• BCC contains powerful tools such as Biolatency
• Which measures Disk I/O latency:
• Using Filetop we can observe metrics about file activity.
Monitoring– BCC– Fileslower / Filelife
• We can measure File read and writes slower than a threshold:
• We can also measure the reads and writes to files using Filetop:
Monitoring– BCC– TCPlife
• Used for tracing TCP sessions that open and close.
Monitoring– Hopes for afterthis showcase
• More performance engineers start using a methodical approach for
analyzing recourse monitoring.
• Performance testers/engineers use this presentation as a start point to
learn more about in-depth monitoring.
• We can gaze upon more dashboards with metrics from eBPF or following
the USE method
• Use USE for analyzing recourses.
• Begin with analyzing Utilization of the recourses.
• Go deeper by checking the Saturation metrics.
• When available check the Error metrics for the recourses.
• Use BCC tools to analyze even more metrics.
• When analyzing monitoring data keep Yoda in mind.