2. The Next 90 Minutes
Introduction
Servers, a mental model
Getting hands on
Processes
Wrapping it up
3. Caveats
Tutorial aimed at people barely familiar
with Linux consoles
Little server knowledge is assumed
Many advanced things are glossed over
...but feel free to ask!
The slides will be available online
4. Your Presenter
Mark Smith <mark@dreamwidth.org>
Co-founded Dreamwidth Studios, but
works at Bump Technologies
(http://bu.mp/)
Spent time at Google, Mozilla, others
Sysadmin, MySQL DBA, engineer, ...
6. Servers
Machines that take input and make output
Made up of components: RAM, CPU, I/O
Each component has various capacities
Systems Administration: the
understanding, care, and feeding of all
these disparate components (among other
things)
8. RAM
Capacity measured in bytes (GB usually)
Latency measured in nanoseconds
Throughput measured in bytes/second
Full state: can’t add more, but no real loss
of performance
Thrash state: not very relevant
9. Disk (Rotational)
Capacity measured in bytes (GB or TB)
Latency measured in milliseconds
Throughput measured in bytes/second
Full state: can’t add more, but otherwise
fine
Thrash state: server and process
starvation, performance drops drastically
10. Disk (SSD)
Capacity measured in bytes (GB or TB)
Latency measured in milliseconds (but
100x faster than rotational disks)
Throughput measured in bytes/second
Full state: can’t add more, but otherwise
fine
Thrash state: obviated by lack of rotation
11. CPU
Capacity measured in operations per
second, also known as hertz (MHz, GHz,
etc)
Throughput and latency of a CPU are very
advanced things most sysadmins don’t
need to worry about (e.g., optimizing for L1
cache and local RAM in NUMA systems)
Full/thrash state: system/process
starvation
12. Network
Capacity not relevant
Latency measured in milliseconds (usually)
Throughput measured in bits/second and
usually 1 Gbps (10 Gbps becoming
common)
Full state: dropped packets, behavior
depends on protocol (i.e., TCP or UDP)
Thrash state: not relevant
13. Timing Comparisons
1 second - tick, tock, tick, tock, ...
1,000 milliseconds (ms) per second
1,000,000 microseconds (µs) per second
1,000,000,000 nanoseconds (ns) per
second
14. Timing (Part 2)
One seek on a rotational disk is ~6ms
SSD seeks are about 100µs: 60x faster
than a rotational seek
RAM seeks are about 60ns: 1,666x faster
than an SSD seek (100,000x faster than a
rotational seek!)
16. SSH to the VM
Open your local terminal (PuTTY in
Windows, iTerm/Terminal/etc in Mac OS
X, whatever you like in Linux)
ssh -p 2222 demo@182.255.123.52
Password is “demo”
Please be nice :)
17. It’s dark in here.
Heartbeat the machine
uptime How’s it doing?
free -m How’s the RAM?
df -h How’re the disks?
18. Load Average
It’s a seat-of-the-pants number
Rule of thumb: low is good, high might be
bad
You have to learn how your machines
work for this number to mean much
19. Top of the World
Easy way to see what’s running and what
is consuming the most resources
top
Press “P” to sort by Processor usage
Press “M” to sort by Memory usage
20. Exhibit #1
Now I will do something on the machine
Run through your heartbeat steps again:
uptime, free -m, df -h, top
Remember to sort top by P and M
What has changed? What is going on?
21. Results #1
You probably noticed 1-cpu.pl
It’s pushing the CPU to 100%
Is it broken? Is this bad?
Know your software and systems (very
important to know what normal is)
22. Exhibit #2
Now I will do something else
Run through your heartbeat steps again:
uptime, free -m, df -h, top
Remember to sort top by P and M
What has changed? What is going on?
23. Results #2
Lots of memory is being consumed
It’s some 2-memory.pl command
Does the machine feel sluggish? Each
command takes a second to start and
stop?
What is going on here?
24. vmstat
The vmstat tool tells us useful things
about the state of the kernel and resource
usage
Try: vmstat -SM 1
Watch while I run the test again
Note the si/so and bi/bo columns
Now notice the CPU columns on the right
25. Swap
RAM is a finite resource
Not all RAM is used equally
Kernel tracks usage of pages
Kernel can write RAM to disk and free it up
This is called swapping: you store RAM on
disk. Remember the timing slide!
26. Swap (Part 2)
Swap is useful mostly on consumer
machines
In most server environments, swap is
death
Disks are hundreds to thousands of times
(or more!) slower than RAM
Generally, any active swapping is bad
27. Exhibit #3
Try uptime, free -m, df -h, top again
Also, try: iostat -kx 1
Watch the %util column as this test runs
Also the bi/bo columns in vmstat
What is going on here?
28. Results #3
Disk usage is high
RAM is not full
CPU is not pegged
Machine responds well
Disk utilization at 100%
29. What does it mean?
Based on the various data you’ve
gathered, is the machine healthy and
happy with this program running on it?
Why or why not?
Discussion.
30. Solutions?
This program is using more RAM or CPU
than the machine has available
Program can be optimized to use less
Machine can be upgraded to have more
Simple problem, straightforward solutions
(Straightforward does not always mean
easy)
32. Programs
Software that runs on a machine
Has traits such as single- or multi-
threaded, compiled or interpreted, etc
Requires certain resources and inputs
Makes certain outputs
33. More Constraints
Programs have more constraints to
consider
Open files and sockets (file descriptors)
Permissions (depend on user/group)
CPU limits (depends on threads)
34. Exhibit #4
There’s a program running now, but
something is wrong with it
Use the usual tools (uptime, free -m,
df -h, top)
System looks OK...
35. File Limits
Programs have certain limits
Get the PID of the 4-files.pl program
ps aufx | grep 4-files
cat /proc/PID/limits
36. lsof
See what files a program has open
lsof -np PID
Woah, lots! At the limit? Count them:
lsof -np PID | wc -l
37. But... a problem?
But is this a problem? Well, it is if the
program is trying to open more files
How do we tell?
Software calls open, which is a system
call
38. System Calls
The kernel provides certain services
Almost all I/O goes through the kernel
Current time, fork, cd, exec, etc etc
Requires a small context switch
Can lead to “sys” CPU usage
39. strace
System calls made by a process can be
traced
Let’s look at 4-files again:
sudo strace -p PID
Look at the “open” line, is it OK?
40. Results #4
Clearly this program is broken
Several fixes... open fewer files, raise your
limits, etc
(We won’t cover the specifics of raising
limits, you can search Google if you need
it)
41. It’s all turtles.
Linux uses “files” and “filesystems” a lot
Sockets are just “files”, they use the same
file descriptor number space
Result: “Max open files” includes sockets
They also show up in lsof, too!
42. Exhibit #5
Let me give us a new program
Get the PID, remember how?
ps aufx | grep 5-network
Look at the files: lsof -np PID
Note the “TCP” file!
43. Test the Server
telnet 182.255.123.52 7000
(This server is slow, it might take a bit)
A very simple timeserver
Now: strace -p PID
53. Results #5
Tracing shows you data, too
Can be very valuable for finding moving
parts that aren’t moving well
Combined with the other tools you can
really see what is going on in your system
55. Invisible Glue
Kernel issues are fairly rare, but usually
frustrating if they show up
Usually the result of some sort of limit hit
Tons of caches, buckets, and limits
Be suspicious of “powers of two” numbers
56. Common Checks
Try: sudo dmesg
Kernel message log shows many problems
Look for suspicious messages
57. “Suspicious”
Out of memory: Kill process
19393 (2-memory.pl) score 90 or
sacrifice child
nf_conntrack: Table full,
dropping packet
ata7.00: exception Emask 0x0
SAct 0x0 SErr 0x0 action 0x6
frozen
58. More Places to Look
The /var/log directory has much data
Generally in a problem state, look for
recently updated files: ls -lart
Loud logs are often unhappy logs
Hardware failure is often noted in one of
the log files
60. Process
Check the components: CPU, RAM, disks
Find what limits are being hit and by what
If the system is fine, it’s probably software
Trace the program, check the logs
Analyze well before you fix
61. Familiarity!
Systems administration done only as an
afterthought will be painful and hard
Be familiar with your servers and your
software
Keep a shell open, watch top throughout
the day, watch the disks, etc
62. Next Steps
Certain tools make life easier
Nagios for monitoring (e.g., alert you when
CPU exceeds 90%)
Cacti/Ganglia/OpenTSDB for trending
Fabric for multiple machine operations
Puppet/Chef for configuration
management