Servers and Processes: Behavior and Analysis

720 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
720
On SlideShare
0
From Embeds
0
Number of Embeds
35
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Servers and Processes: Behavior and Analysis

  1. 1. Servers and Processes Behavior and Analysis
  2. 2. The Next 90 MinutesIntroductionServers, a mental modelGetting hands onProcessesWrapping it up
  3. 3. CaveatsTutorial aimed at people barely familiarwith Linux consolesLittle server knowledge is assumedMany advanced things are glossed over...but feel free to ask!The slides will be available online
  4. 4. Your PresenterMark Smith <mark@dreamwidth.org>Co-founded Dreamwidth Studios, butworks at Bump Technologies(http://bu.mp/)Spent time at Google, Mozilla, othersSysadmin, MySQL DBA, engineer, ...
  5. 5. Servers
  6. 6. ServersMachines that take input and make outputMade up of components: RAM, CPU, I/OEach component has various capacitiesSystems Administration: theunderstanding, care, and feeding of allthese disparate components (among otherthings)
  7. 7. ComponentsCapacityLatencyThroughputFull stateThrash state
  8. 8. RAMCapacity measured in bytes (GB usually)Latency measured in nanosecondsThroughput measured in bytes/secondFull state: can’t add more, but no real lossof performanceThrash state: not very relevant
  9. 9. Disk (Rotational)Capacity measured in bytes (GB or TB)Latency measured in millisecondsThroughput measured in bytes/secondFull state: can’t add more, but otherwisefineThrash state: server and processstarvation, performance drops drastically
  10. 10. Disk (SSD)Capacity measured in bytes (GB or TB)Latency measured in milliseconds (but100x faster than rotational disks)Throughput measured in bytes/secondFull state: can’t add more, but otherwisefineThrash state: obviated by lack of rotation
  11. 11. CPUCapacity measured in operations persecond, also known as hertz (MHz, GHz,etc)Throughput and latency of a CPU are veryadvanced things most sysadmins don’tneed to worry about (e.g., optimizing for L1cache and local RAM in NUMA systems)Full/thrash state: system/processstarvation
  12. 12. NetworkCapacity not relevantLatency measured in milliseconds (usually)Throughput measured in bits/second andusually 1 Gbps (10 Gbps becomingcommon)Full state: dropped packets, behaviordepends on protocol (i.e., TCP or UDP)Thrash state: not relevant
  13. 13. Timing Comparisons1 second - tick, tock, tick, tock, ...1,000 milliseconds (ms) per second1,000,000 microseconds (µs) per second1,000,000,000 nanoseconds (ns) persecond
  14. 14. Timing (Part 2)One seek on a rotational disk is ~6msSSD seeks are about 100µs: 60x fasterthan a rotational seekRAM seeks are about 60ns: 1,666x fasterthan an SSD seek (100,000x faster than arotational seek!)
  15. 15. Hands On Time!
  16. 16. SSH to the VMOpen your local terminal (PuTTY inWindows, iTerm/Terminal/etc in Mac OSX, whatever you like in Linux)ssh -p 2222 demo@182.255.123.52Password is “demo”Please be nice :)
  17. 17. It’s dark in here.Heartbeat the machineuptime How’s it doing?free -m How’s the RAM?df -h How’re the disks?
  18. 18. Load AverageIt’s a seat-of-the-pants numberRule of thumb: low is good, high might bebadYou have to learn how your machineswork for this number to mean much
  19. 19. Top of the WorldEasy way to see what’s running and whatis consuming the most resourcestopPress “P” to sort by Processor usagePress “M” to sort by Memory usage
  20. 20. Exhibit #1Now I will do something on the machineRun through your heartbeat steps again:uptime, free -m, df -h, topRemember to sort top by P and MWhat has changed? What is going on?
  21. 21. Results #1You probably noticed 1-cpu.plIt’s pushing the CPU to 100%Is it broken? Is this bad?Know your software and systems (veryimportant to know what normal is)
  22. 22. Exhibit #2Now I will do something elseRun through your heartbeat steps again:uptime, free -m, df -h, topRemember to sort top by P and MWhat has changed? What is going on?
  23. 23. Results #2Lots of memory is being consumedIt’s some 2-memory.pl commandDoes the machine feel sluggish? Eachcommand takes a second to start andstop?What is going on here?
  24. 24. vmstatThe vmstat tool tells us useful thingsabout the state of the kernel and resourceusageTry: vmstat -SM 1Watch while I run the test againNote the si/so and bi/bo columnsNow notice the CPU columns on the right
  25. 25. SwapRAM is a finite resourceNot all RAM is used equallyKernel tracks usage of pagesKernel can write RAM to disk and free it upThis is called swapping: you store RAM ondisk. Remember the timing slide!
  26. 26. Swap (Part 2)Swap is useful mostly on consumermachinesIn most server environments, swap isdeathDisks are hundreds to thousands of times(or more!) slower than RAMGenerally, any active swapping is bad
  27. 27. Exhibit #3Try uptime, free -m, df -h, top againAlso, try: iostat -kx 1Watch the %util column as this test runsAlso the bi/bo columns in vmstatWhat is going on here?
  28. 28. Results #3Disk usage is highRAM is not fullCPU is not peggedMachine responds wellDisk utilization at 100%
  29. 29. What does it mean?Based on the various data you’vegathered, is the machine healthy andhappy with this program running on it?Why or why not?Discussion.
  30. 30. Solutions?This program is using more RAM or CPUthan the machine has availableProgram can be optimized to use lessMachine can be upgraded to have moreSimple problem, straightforward solutions(Straightforward does not always meaneasy)
  31. 31. Programs
  32. 32. ProgramsSoftware that runs on a machineHas traits such as single- or multi-threaded, compiled or interpreted, etcRequires certain resources and inputsMakes certain outputs
  33. 33. More ConstraintsPrograms have more constraints toconsiderOpen files and sockets (file descriptors)Permissions (depend on user/group)CPU limits (depends on threads)
  34. 34. Exhibit #4There’s a program running now, butsomething is wrong with itUse the usual tools (uptime, free -m,df -h, top)System looks OK...
  35. 35. File LimitsPrograms have certain limitsGet the PID of the 4-files.pl programps aufx | grep 4-filescat /proc/PID/limits
  36. 36. lsofSee what files a program has openlsof -np PIDWoah, lots! At the limit? Count them:lsof -np PID | wc -l
  37. 37. But... a problem?But is this a problem? Well, it is if theprogram is trying to open more filesHow do we tell?Software calls open, which is a systemcall
  38. 38. System CallsThe kernel provides certain servicesAlmost all I/O goes through the kernelCurrent time, fork, cd, exec, etc etcRequires a small context switchCan lead to “sys” CPU usage
  39. 39. straceSystem calls made by a process can betracedLet’s look at 4-files again:sudo strace -p PIDLook at the “open” line, is it OK?
  40. 40. Results #4Clearly this program is brokenSeveral fixes... open fewer files, raise yourlimits, etc(We won’t cover the specifics of raisinglimits, you can search Google if you needit)
  41. 41. It’s all turtles.Linux uses “files” and “filesystems” a lotSockets are just “files”, they use the samefile descriptor number spaceResult: “Max open files” includes socketsThey also show up in lsof, too!
  42. 42. Exhibit #5Let me give us a new programGet the PID, remember how?ps aufx | grep 5-networkLook at the files: lsof -np PIDNote the “TCP” file!
  43. 43. Test the Servertelnet 182.255.123.52 7000(This server is slow, it might take a bit)A very simple timeserverNow: strace -p PID
  44. 44. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  45. 45. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  46. 46. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  47. 47. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  48. 48. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  49. 49. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  50. 50. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  51. 51. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  52. 52. The Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474), sin_addr=inet_addr("127.0.0.1")}, [16]) = 4ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)fcntl(4, F_SETFD, FD_CLOEXEC) = 0stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0nanosleep({1, 0}, 0x7fff73f28880) = 0write(4, "Thank you for visiting!n", 24) = 24close(4) = 0
  53. 53. Results #5Tracing shows you data, tooCan be very valuable for finding movingparts that aren’t moving wellCombined with the other tools you canreally see what is going on in your system
  54. 54. Kernel
  55. 55. Invisible GlueKernel issues are fairly rare, but usuallyfrustrating if they show upUsually the result of some sort of limit hitTons of caches, buckets, and limitsBe suspicious of “powers of two” numbers
  56. 56. Common ChecksTry: sudo dmesgKernel message log shows many problemsLook for suspicious messages
  57. 57. “Suspicious”Out of memory: Kill process19393 (2-memory.pl) score 90 orsacrifice childnf_conntrack: Table full,dropping packetata7.00: exception Emask 0x0SAct 0x0 SErr 0x0 action 0x6frozen
  58. 58. More Places to LookThe /var/log directory has much dataGenerally in a problem state, look forrecently updated files: ls -lartLoud logs are often unhappy logsHardware failure is often noted in one ofthe log files
  59. 59. Summary
  60. 60. ProcessCheck the components: CPU, RAM, disksFind what limits are being hit and by whatIf the system is fine, it’s probably softwareTrace the program, check the logsAnalyze well before you fix
  61. 61. Familiarity!Systems administration done only as anafterthought will be painful and hardBe familiar with your servers and yoursoftwareKeep a shell open, watch top throughoutthe day, watch the disks, etc
  62. 62. Next StepsCertain tools make life easierNagios for monitoring (e.g., alert you whenCPU exceeds 90%)Cacti/Ganglia/OpenTSDB for trendingFabric for multiple machine operationsPuppet/Chef for configurationmanagement
  63. 63. Thanks! Questions?

×