Bugs from Outer Space | while42 SF #6


Published on

Presentation by Jerome Petazzoni for while42 San Francisco #6 at Kwarter

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bugs from Outer Space | while42 SF #6

  1. 1. Bugs From Outer Space While42 — SF chapter — #6
  2. 2. Why this talk? Codito, ergo erro I code, therefore I make mistakes
  3. 3. Outline I'll show some really nasty bugs, tell stories of unglorious battles. (Some of which I've actually fought!) Featuring: Node.js, EC2, LXC, pseudo- terminals and also: hardware bugs, dangerous bugs...
  4. 4. Our files, Node.js is truncating them! It all starts with an angry customer. “Sometimes, downloading this 700 KB JSON file will fail, because it’s truncated!” But… Do you even Content-Length? (The client library should scream, but it doesn’t.)
  5. 5. Gotta Sniff Some Packets Log into the load balancer (running Hipache)... # ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80 interface: any filter: (ip or ip6) and ( tcp port 80 ) match: /api/v1/download-all-the-things #### T 2013/08/22 04:11:27.848663 -> [AP] GET /api/v1/download-all-the-things.json HTTP/1.0. Host: angrystartup.com X-Forwarded-Port: 443. X-Forwarded-For: ::ffff: X-Forwarded-Proto: https. ...
  6. 6. Ngrep Doesn’t Cut It. FETCH THE WIRESHARKS! # tcpdump -peni any -s0 -wdump tcp port 80 (Wait a bit) ^C Transfer dump file DEMO TIME!
  7. 7. What did we find out? Truncated files happen because a chunk (probably exactly one) gets dropped. Impossible to reproduce locally. Only the customer sees the problem. THE PLOT THICKENS. GET YOUR SWIMSUITS, WE’RE DIVING INTO CODE!
  8. 8. This is Node.js.I have no idea what I’m doing. Add console.log() statements in Hipache. Add console.log() statements in node-http- proxy. Add console.log() statements in node/lib/http.js. The latter didn’t work. “Fix”: replace require(‘http’) with require(‘_http’) and add our own _http.js to our node_modules. Do the same to net.js (in “our” _http.js).
  9. 9. It’s all in the pauses Backend sends lots of data to Hipache. Hipache sends data to client, but client is slow. Hipache “pauses” the backend stream. (i.e. stops reading from the network socket.) When the client has read enough data, Hipache “resumes” the stream. etc. SO FAR, SO GOOD
  10. 10. It’s all in the awkward ……………………...pauses There are two layers in Node: tcp and http. When the tcp layer reads the last chunk, the socket is closed by the backend. The tcp layer notices, and sends an “end” event. The “end” event causes the “http” layer to finish what it was doing, without sending a “resume”. As a result, some chunks remain in the buffers
  11. 11. How do we fix this? Pester Node.js folks Catch that “end” event, and when it happens, send a “resume” to the stream to drain it. (Implementation detail: you only have the http socket, and you need to listen for an event on the tcp socket, so you need to do slightly dirty things with the http socket. But eh, it works!)
  12. 12. What did we learn? When you can’t reproduce a bug at will, record it in action (tcpdump) and dissect it (wireshark). Spraying code with print statements helps. (But it’s better to use the logging framework!) You don’t have to know Node.js to fix Node.js!
  13. 13. Hardware has bugs, too Pentium FDIV bug (1994): errors at 4th decimal place Pentium F00F bug (1997): using the wrong instruction hangs the machine ATA transfer speeds vary when you touch ribbon cables (SATA introduced in 2003)
  14. 14. A story of Go, PTYs, LXC: It never works the first time # docker run -t -i ubuntu echo hello world 2013/08/06 23:20:53 Error: Error starting container 06d642aae1a: fork/exec /usr/bin/lxc-start: operation not permitted # docker run -t -i ubuntu echo hello world hello world # docker run -t -i ubuntu echo hello world hello world # docker run -t -i ubuntu echo hello world hello world # docker run -t -i ubuntu echo hello world hello world
  15. 15. Strace to the rescue! Steps: 1. Boot the machine. 2. Find pid of process to analyze. (ps|grep, pidof docker...) 3. “strace -o log -f -p $PID” 4. “docker run -t -i run ubuntu echo hello world” 5. Ctrl-C the strace process. 6. Repeat steps 3-4-5, using a different log file.
  16. 16. Let’s compare the log files Thousands and thousands of lines. Look for the error message. (e.g. “operation not permitted”) Other approach: start from the end, and try to find the point when things started to diverge. That’s why we have dual 30” monitors.
  17. 17. Investigation results First time [pid 1331] setsid() = 1331 [pid 1331] dup2(10, 0) = 0 [pid 1331] dup2(10, 1) = 1 [pid 1331] dup2(10, 2) = 2[pid 1331] ioctl(0, TIOCSCTTY) = -1 EPERM (Operation not permitted)[pid 1331] write(12, "10000000", 8) = 8 [pid 1331] _exit(253) = ? Second time (and every following attempt) [pid 1414] setsid() = 1414 [pid 1414] dup2(14, 0) = 0 [pid 1414] dup2(14, 1) = 1 [pid 1414] dup2(14, 2) = 2[pid 1414] ioctl(0, TIOCSCTTY) = 0[pid 1414] execve("/usr/bin/lxc-start", ["lxc- start", "-n", ...]) <...>
  18. 18. What does that mean? For some reason, some part of the code wants file descriptor 0 (that’s stdin) to be a terminal. The first time we run, it fails, but in the process, we acquire a terminal. (UNIX 101: when you don’t have a controlling terminal and open a file which is a terminal, it becomes your controlling terminal, unless you open the file with flag O_NOCTTY) Next attempts are therefore successful.
  19. 19. … Really? To confirm that this is indeed the bug: ● start the process with “setsid” (which detaches from the controlling terminal) and see that the bug is back; ● check the output of “ps” (it shows controlling terminals) and see that indeed, before the first execution, we didn’t have a controlling terminal, and we have one after!
  20. 20. How to fix the bug? ¯_(ツ)_/¯ I don’t know — yet! (The bug was diagnosed last week, and honestly, it’s not a showstopper.)
  21. 21. What did we learn? strace is awesome to analyze behavior of running processes. ltrace can be used, too, if you want to analyze library calls rather than system calls. If you’re really desperate, gdb is your friend. (A very peculiar friend, but a friend nonetheless.)
  22. 22. “Errare humanum est, perseverare autem diabolicum” “To err is human, but to really foul things up, you need a computer”
  23. 23. Really nasty (and sad) bug: The Therac-25 Radiotherapy machine (shoots beams to cure cancer) Two modes: low energy and high energy. In high energy mode, a special filter is inserted. In other versions, a hardware system prevented the high energy beam from shooting if the filter was not in place. On the Therac-25, it’s in software.
  24. 24. Konami Code of Death On the keyboard, press (in less than 8 seconds) X ↑ E [ENTER] B ...And the high energy beam shoots, unfiltered! 6 accidents, 3 died. (This was 1985-1987.) Explanation: race condition in the software. Never happened during tests since this was
  25. 25. Aggravating details Many engineering and institutional issues.(No software review, no evaluation of possible failures, undocumented error codes, no sensor feedback…) After entering the sequence and sending one beam, the machine would display an error. But errors happened “all the time” (usually without adverse effect) so the operator would just proceed (equivalent of pressing “retry”).
  26. 26. Let’s get back to weird Linux Kernel bugs
  27. 27. Random crashes on EC2 Pool of ~50 identical instances, with same role. Sometimes, one of them would crash. Total crash: no SSH, no ping, no log, no nothing. EC2 console won’t show anything. REPRODUCE THE BUG? IMPOSSIBURU!
  28. 28. Try a million things... Different kernel versions Different filesystems tunings Different security settings (GRSEC) Different memory settings (overcommit, OOM) Different instance sizes Different EBS volumes Different differences NOTHING CHANGED
  29. 29. And one fine day... A random test machine seems to exhibit the bug very frequently (it would crash in a few days, sometimes just a few hours). CLONE IT! ONE MILLION TIMES!
  30. 30. But, still... We changed everything (again), but we couldn’t find anything (again). So we did something completely crazy: we contacted AWS support (imagine that). They asked us to repeat the tests with an “official” image (AMI). This required porting our runtime from Ubuntu 10.04 to 12.04.
  31. 31. And…(I’m running out of segues) We re-ran the tests with the official image, the machine crashed, we left it in crashed state, support analyzed the image. Almost instanty, they told us “oh yeah it’s a known issue, see that link.” U SERIOUS?
  32. 32. The explanation The bug happens: ● on workloads using spinlocks intensively; ● only on Xen VMs with many CPUs. It is linked to the special implementation of spinlocks in Xen VMs. When waking up CPUs waiting on a spinlock, the code would only wake up the 1st one, even if there were multiple CPUs waiting.
  33. 33. The patch (priceless) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d69cc6c..67bc7ba 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -328,7 +328,6 @@ static noinline void xen_spin_unlock_slow(struct xen_spinlock *xl) if (per_cpu(lock_spinners, cpu) == xl) { ADD_STATS(released_slow_kicked, 1); xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); - break; } } } --
  34. 34. What did we learn? We didn’t try all the combinations. (Trying on HVM machines would have helped!) AWS support can be helpful sometimes. (This one was a surprise.) Trying to debug a kernel issue without console output is like trying to learn to read in the dark.
  35. 35. Overall Conclusions When facing a mystic bug from outer space: ● reproduce it at all costs! ● collect data with tcpdump, ngrep, wireshark, strace, ltrace, gdb; and log files, obviously! ● don’t be afraid of uncharted places! ● document it, at least with a 2 AM ragetweet!
  36. 36. Thank you! Questions? Gotta follow them all: @kwarter @while_42 @GITSF @dot_cloud @docker Your speaker today was: Jérôme Petazzoni, dotCloud @jpetazzo