Servers and Processes: Behavior and Analysis

1. Servers and Processes Behavior and Analysis

2. The Next 90 Minutes Introduction Servers, a mental model Getting hands on Processes Wrapping it up

3. Caveats Tutorial aimed at people barely familiar with Linux consoles Little server knowledge is assumed Many advanced things are glossed over ...but feel free to ask! The slides will be available online

4. Your Presenter Mark Smith <mark@dreamwidth.org> Co-founded Dreamwidth Studios, but works at Bump Technologies (http://bu.mp/) Spent time at Google, Mozilla, others Sysadmin, MySQL DBA, engineer, ...

5. Servers

6. Servers Machines that take input and make output Made up of components: RAM, CPU, I/O Each component has various capacities Systems Administration: the understanding, care, and feeding of all these disparate components (among other things)

7. Components Capacity Latency Throughput Full state Thrash state

8. RAM Capacity measured in bytes (GB usually) Latency measured in nanoseconds Throughput measured in bytes/second Full state: can’t add more, but no real loss of performance Thrash state: not very relevant

9. Disk (Rotational) Capacity measured in bytes (GB or TB) Latency measured in milliseconds Throughput measured in bytes/second Full state: can’t add more, but otherwise fine Thrash state: server and process starvation, performance drops drastically

10. Disk (SSD) Capacity measured in bytes (GB or TB) Latency measured in milliseconds (but 100x faster than rotational disks) Throughput measured in bytes/second Full state: can’t add more, but otherwise fine Thrash state: obviated by lack of rotation

11. CPU Capacity measured in operations per second, also known as hertz (MHz, GHz, etc) Throughput and latency of a CPU are very advanced things most sysadmins don’t need to worry about (e.g., optimizing for L1 cache and local RAM in NUMA systems) Full/thrash state: system/process starvation

12. Network Capacity not relevant Latency measured in milliseconds (usually) Throughput measured in bits/second and usually 1 Gbps (10 Gbps becoming common) Full state: dropped packets, behavior depends on protocol (i.e., TCP or UDP) Thrash state: not relevant

13. Timing Comparisons 1 second - tick, tock, tick, tock, ... 1,000 milliseconds (ms) per second 1,000,000 microseconds (µs) per second 1,000,000,000 nanoseconds (ns) per second

14. Timing (Part 2) One seek on a rotational disk is ~6ms SSD seeks are about 100µs: 60x faster than a rotational seek RAM seeks are about 60ns: 1,666x faster than an SSD seek (100,000x faster than a rotational seek!)

15. Hands On Time!

16. SSH to the VM Open your local terminal (PuTTY in Windows, iTerm/Terminal/etc in Mac OS X, whatever you like in Linux) ssh -p 2222 demo@182.255.123.52 Password is “demo” Please be nice :)

17. It’s dark in here. Heartbeat the machine uptime How’s it doing? free -m How’s the RAM? df -h How’re the disks?

18. Load Average It’s a seat-of-the-pants number Rule of thumb: low is good, high might be bad You have to learn how your machines work for this number to mean much

19. Top of the World Easy way to see what’s running and what is consuming the most resources top Press “P” to sort by Processor usage Press “M” to sort by Memory usage

20. Exhibit #1 Now I will do something on the machine Run through your heartbeat steps again: uptime, free -m, df -h, top Remember to sort top by P and M What has changed? What is going on?

21. Results #1 You probably noticed 1-cpu.pl It’s pushing the CPU to 100% Is it broken? Is this bad? Know your software and systems (very important to know what normal is)

22. Exhibit #2 Now I will do something else Run through your heartbeat steps again: uptime, free -m, df -h, top Remember to sort top by P and M What has changed? What is going on?

23. Results #2 Lots of memory is being consumed It’s some 2-memory.pl command Does the machine feel sluggish? Each command takes a second to start and stop? What is going on here?

24. vmstat The vmstat tool tells us useful things about the state of the kernel and resource usage Try: vmstat -SM 1 Watch while I run the test again Note the si/so and bi/bo columns Now notice the CPU columns on the right

25. Swap RAM is a finite resource Not all RAM is used equally Kernel tracks usage of pages Kernel can write RAM to disk and free it up This is called swapping: you store RAM on disk. Remember the timing slide!

26. Swap (Part 2) Swap is useful mostly on consumer machines In most server environments, swap is death Disks are hundreds to thousands of times (or more!) slower than RAM Generally, any active swapping is bad

27. Exhibit #3 Try uptime, free -m, df -h, top again Also, try: iostat -kx 1 Watch the %util column as this test runs Also the bi/bo columns in vmstat What is going on here?

28. Results #3 Disk usage is high RAM is not full CPU is not pegged Machine responds well Disk utilization at 100%

29. What does it mean? Based on the various data you’ve gathered, is the machine healthy and happy with this program running on it? Why or why not? Discussion.

30. Solutions? This program is using more RAM or CPU than the machine has available Program can be optimized to use less Machine can be upgraded to have more Simple problem, straightforward solutions (Straightforward does not always mean easy)

31. Programs

32. Programs Software that runs on a machine Has traits such as single- or multi- threaded, compiled or interpreted, etc Requires certain resources and inputs Makes certain outputs

33. More Constraints Programs have more constraints to consider Open files and sockets (file descriptors) Permissions (depend on user/group) CPU limits (depends on threads)

34. Exhibit #4 There’s a program running now, but something is wrong with it Use the usual tools (uptime, free -m, df -h, top) System looks OK...

35. File Limits Programs have certain limits Get the PID of the 4-files.pl program ps aufx | grep 4-files cat /proc/PID/limits

36. lsof See what files a program has open lsof -np PID Woah, lots! At the limit? Count them: lsof -np PID | wc -l

37. But... a problem? But is this a problem? Well, it is if the program is trying to open more files How do we tell? Software calls open, which is a system call

38. System Calls The kernel provides certain services Almost all I/O goes through the kernel Current time, fork, cd, exec, etc etc Requires a small context switch Can lead to “sys” CPU usage

39. strace System calls made by a process can be traced Let’s look at 4-files again: sudo strace -p PID Look at the “open” line, is it OK?

40. Results #4 Clearly this program is broken Several fixes... open fewer files, raise your limits, etc (We won’t cover the specifics of raising limits, you can search Google if you need it)

41. It’s all turtles. Linux uses “files” and “filesystems” a lot Sockets are just “files”, they use the same file descriptor number space Result: “Max open files” includes sockets They also show up in lsof, too!

42. Exhibit #5 Let me give us a new program Get the PID, remember how? ps aufx | grep 5-network Look at the files: lsof -np PID Note the “TCP” file!

43. Test the Server telnet 182.255.123.52 7000 (This server is slow, it might take a bit) A very simple timeserver Now: strace -p PID

44. The Trace accept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4 ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ... lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) fcntl(4, F_SETFD, FD_CLOEXEC) = 0 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0 write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7fff73f28880) = 0 write(4, "Thank you for visiting!n", 24) = 24 close(4) = 0

53. Results #5 Tracing shows you data, too Can be very valuable for finding moving parts that aren’t moving well Combined with the other tools you can really see what is going on in your system

54. Kernel

55. Invisible Glue Kernel issues are fairly rare, but usually frustrating if they show up Usually the result of some sort of limit hit Tons of caches, buckets, and limits Be suspicious of “powers of two” numbers

56. Common Checks Try: sudo dmesg Kernel message log shows many problems Look for suspicious messages

57. “Suspicious” Out of memory: Kill process 19393 (2-memory.pl) score 90 or sacrifice child nf_conntrack: Table full, dropping packet ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

58. More Places to Look The /var/log directory has much data Generally in a problem state, look for recently updated files: ls -lart Loud logs are often unhappy logs Hardware failure is often noted in one of the log files

59. Summary

60. Process Check the components: CPU, RAM, disks Find what limits are being hit and by what If the system is fine, it’s probably software Trace the program, check the logs Analyze well before you fix

61. Familiarity! Systems administration done only as an afterthought will be painful and hard Be familiar with your servers and your software Keep a shell open, watch top throughout the day, watch the disks, etc

62. Next Steps Certain tools make life easier Nagios for monitoring (e.g., alert you when CPU exceeds 90%) Cacti/Ganglia/OpenTSDB for trending Fabric for multiple machine operations Puppet/Chef for configuration management

63. Thanks! Questions?

Servers and Processes: Behavior and Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Servers and Processes: Behavior and Analysis

Similar to Servers and Processes: Behavior and Analysis (20)

More from dreamwidth

More from dreamwidth (16)

Recently uploaded

Recently uploaded (20)

Servers and Processes: Behavior and Analysis