2. What is CPU
STEAL Time?
is the percentage of time a
virtual CPU waits for a real CPU
while the hypervisor is servicing
another virtual processor.
Your virtual machine (VM)
shares resources with other
instances on a single host in a
virtualized environment
3. Did you know !?
• Netflix tracks CPU Steal Time closely. In fact, if steal time
exceeds their chosen threshold, they shut down the virtual
machine and restart on a different physical server.
4. Where can I check my CPU
steal time ?
• When you run the Linux top command, you'll see a
realtime view of key performance metrics. One of the lines
is for the CPU:
5. Two metrics you might have some experience with already are %id
(percent idle) and %wa (percent I/O wait).
If %id is low, the CPU is working hard and doesn't have much excess
capacity.
If %wa is high, the CPU is ready to run, but is waiting on I/O access to
complete (like fetching rows from a database table stored on the disk).
%st, or percent steal time is the last CPU metric displayed.
6. CPU Steal Time - the ticket
booth analogy
• You've purchased tickets to the latest Hollywood
blockbuster. There are two lines and one ticket booth:
7. If we applied a CPU steal time-like metric to the ticketing process, it would look like this:
• 0% Steal Time - it's a Wednesday matinee: the ticket booth is picking a moviegoer
from line 1, then line 2, then line 1, then line 2, and so on. No one is waiting.
• 50% Steal Time - It's Friday night: instead of being able to purchase a ticket
immediately, half of the time a person in the line needs to wait for the person at the
booth to complete their purchase. Things are taking longer.
• 100% Steal Time - It's a Friday night and the cash register is broken: no one is
moving.
8. Why is high steal time
particularly bad for web apps?
If you have a long-running background computational task that is on an underutilized physical server, it may get access to
more than it's share of CPU cycles for a while.
Later on, the other VMs need their share of CPU Cycles, so the long-running task will run slower.
This might not be a deal-breaker for a long-running task: it might take a bit longer or it might even finish faster (since it was
able to use more resources earlier).
However, for web apps, this can bring things to halt. For tasks that need to be performed in real-time, like rapidly serving
many web requests, a 4x decrease in performance can cause major backups in request queues, which can lead to outages.
9. What if steal time is well
above zero?
There are two possible causes:
1. You need a larger VM with more CPU resources (you are the problem).
2. The physical server is over-sold and the virtual machines are aggressively
competing for resources (you are not the problem).
10. The catch:
• you can't tell which case your situation falls under
by just watching the impacted instance's CPU
metrics.
• This is easiest to tell when you have multiple, identical
servers performing the same roles, each residing on a
different host:
11. • Has %st (CPU Steal Time Percentage) increased on every virtual
server? This means your virtual machines are using more CPU.
You need to increase the CPU resources for your VMs.
• Has %st (CPU Steal Time Percentage) increased dramatically on
only a subset of servers? This means the physical servers may
be oversold. Move the VM to another physical server.
12. • Has %st (CPU Steal Time Percentage) increased on
every virtual server? This means your virtual machines
are using more CPU. You need to increase the CPU
resources for your VMs.
• Has %st (CPU Steal Time Percentage) increased
dramatically on only a subset of servers? This means
the physical servers may be oversold. Move the VM to
another physical server.
13. So, when should you be
worried?
• A general rule of thumb - if steal time is greater than 10%
for 20 minutes, the VM is likely in a state that it is
running slower than it should.
• When this happens:
➡Shut down the instance and move it to another physical
server
➡ If steal time remains high, increase the CPU resources
➡ If steal time remains high, contact your hosting provider.
Your host may be overselling physical servers.
14. TL;DR
• In a virtual environment, CPU cycles are shared across
virtual machines on the server.
• If your VM displays a high %st in top (steal time), this means
CPU cycles are being taken away from your VM to serve
other purposes.
• You may be using more than your share of CPU resources
or the physical server may be over-sold. Move the VM to
another physical server.
• If steal time remains high, try giving the VM more CPU
resources.
15. More servers? Or faster
code?
Adding servers can be a band-aid for slow code.Monitoring helps you
find and fix your inefficient and costly code