Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Debugging Complex Systems - Erlang Factory SF 2015

2,628 views

Published on

Debugging complex systems can be difficult. Luckily, the Erlang ecosystem is full of tools to help you out. With the right mindset and the right tools, debugging complex Erlang systems can be easy. In this talk, I'll share the debugging methodology I've developed over the years.

Published in: Engineering
  • Be the first to comment

Debugging Complex Systems - Erlang Factory SF 2015

  1. 1. Louis-Philippe Gauthier Director, Product Engineering Debugging Complex Systems
  2. 2. 2
  3. 3. Let’s fix the process. Systematic debugging! 3
  4. 4. Understand the system 1. Be familiar with your stack: OS, VM, application, protocols, external services, etc. 2. Know your tools: • Take time to experiment with new tools. • Match the tool to the bug. 3. What are/were the requirements? 4. Is it really a bug? 4
  5. 5. Reproduce the bug 1. What are the conditions that trigger the bug: • function input? • invalid state? • environment variable? • OS settings? 2. Try reproducing locally. 3. Try reproducing in production. 4. Reproducibility will greatly simplify the debugging process and the validation of the fix! 5
  6. 6. Collect data 1. Search on Google, it might be a known bug! 2. Don’t jump to conclusions, use observations to guide your intuition (you’re always wrong!). 3. Use debugging tools to get more insights. 4. Filter out the noise (especially important for performance bugs!!!). 5. Too much is like not enough. 6
  7. 7. Use process of elimination 1. Divide and conquer! 2. Start with macro observations: • Are all servers affected? • Are all datacenters affected? • Is there an external service involved? 3. From there, narrow down your search using data! 4. Watch out, one bug might be hiding another one. 7
  8. 8. Change one thing at a time 1. Don’t take any shortcuts; if you change too many variables, you won’t be able to correlate the results. 2. Be smart; pick the one change that cuts the search space the most. 3. If you want to test different theories in parallel, create a branch for each change and deploy on different nodes. 8
  9. 9. Keep an audit trail 1. Don’t trust your memory, you might get some facts wrong. 2. Can help if you’re debugging a similar problem in the future. 3. Useful for post-mortems! 4. Allows you to collaborate with co-workers (e.g. via Google Docs). 5. Can be used to coach teammates. 9
  10. 10. Verify your assumptions 1. Check the basics: • What code is deployed? • Same VM version? • Same application config? • Same kernel version? • Same system config? 2. Is the tool lying? Validate your tools! 3. Backtrack and go over your audit trail, you might have missed something! 10
  11. 11. Take a step back 1. Step on your ego and ask a co-worker for help! 2. Ask an expert. 3. Sleep on it; your thoughts will be clearer in the morning. 11
  12. 12. Validate your fix 1. Is it a side effect? Heisenbug? 2. Did you really fix the root cause or just work around it? 3. Validate in production: 1. start with one node per datacenter 2. slowly roll out the fix and monitor 4. Add regression tests if you can. 5. … 6. Go back to bed Zzzz. 12
  13. 13. Systematic debugging rules 1. Understand the system 2. Reproduce the bug 3. Collect data 4. Use process of elimination 5. Change only one thing at a time 6. Keep an audit trail 7. Verify your assumptions 8. Take a step back 9. Validate your fix 13
  14. 14. Tools of the trade 14
  15. 15. Erlang interactive shell lpgauth # erl -remsh rtb-gateway -name shell -setcookie monster Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false] Eshell V6.3 (abort with ^G) 1> Version = erlang:system_info(otp_release). "17" 2> b(). Version = "17" ok 3> f(Version). ok 4> b(). ok 15
  16. 16. Erlang interactive shell lpgauth # /etc/init.d/rtb-delivery shell Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false] Eshell V6.3 (abort with ^G) 1> ets:i(). id name type size mem owner ---------------------------------------------------------------------------- … swirl_flows swirl_flows set 15 1611 swirl_tracker swirl_mappers swirl_mappers set 15 481 swirl_tracker swirl_reducers swirl_reducers set 0 316 swirl_tracker swirl_streams swirl_streams set 15 908 swirl_tracker 2> ets:i(swirl_streams). 3> ets:tab2list(swirl_streams). 16
  17. 17. Erlang interactive shell lpgauth # /etc/init.d/rtb-gateway shell Erlang/OTP 17 [erts-6.3] [source] [64-bit] [smp:32:32] [hipe] [kernel-poll:false] Eshell V6.3 (abort with ^G) 1> F = fun Loop() -> 1> rp(process_info(whereis(code_server), message_queue_len)), 1> timer:sleep(1000), 1> Loop() 1> end. #Fun<erl_eval.44.90072148> 2> F(). {message_queue_len,0} {message_queue_len,2} {message_queue_len,0} 17
  18. 18. Loggers 1. io:format/2 2. error_logger + SASL 3. Lager Don’t forget system logs: /var/log/messages, /var/log/external_service. Good for: logic and typing bugs 18
  19. 19. Metric collectors 1. vmstats + statsderl + statsd 2. folsom / statsman / exometer 3. collectd 4. carbon + graphite 5. many others! Good for: resources and performance bugs 19
  20. 20. Debuggers 1. erlang:trace/3 2. dbg 3. redbug 4. system tap / dtrace / lttng Good for: logic and typing bugs 20
  21. 21. Program dumps 1. erl_crash.dump • observer • recon / script • cat / grep / strings 2. core dumps Good for: resources bugs 21
  22. 22. Profilers 1. fprof / eprof 2. eflame 3. system tap / dtrace / lttng / perf 4. many others! Good for: performance bugs 22
  23. 23. System utilities Good for: resources and performance bugs Some of my favourite ones: 1. top / htop 2. ngrep / netstat / tcpdump 3. strace 4. iotop / lsof 23
  24. 24. System utilities 24
  25. 25. Static analyzers 1. Dialyzer Good for: typing bugs 25
  26. 26. Tools for the bug 1. Logic: debuggers, loggers, shell 2. Typing: debuggers, shell, static analyzers 3. Resources: metric collectors, program dumps, shell, system utilities 4. Performance: metric collectors, shell, system utilities 26
  27. 27. If a request crashes and no one is around to monitor it, does it trigger an alert? 27
  28. 28. Example #1 Where do we start? 28
  29. 29. 1. Are the numbers in the database good? 
 No => Not a web application bug. 2. Are the numbers in the logs good?
 No => Not a ETL bug. 3. Are there other services on the same box affected by missing log events? 
 No => Probably not a filesystem bug. Example #1 29
  30. 30. 1> Tid = rtb_gateway_counter:table_id(bid_metrics_counters). 2> ets:tab2list(Tid). [{{6,4,undefined},0,0,6,6},{{6,3,2},6,6,12,12}] Is the data being aggregated? Yes. Bug is most likely in the function that serializes the ETS table to JSON. Example #1 30
  31. 31. 1> redbug:start("rtb_gateway_counter:map_counters(bid_metrics_counters, '_', '_', '_')"). 09:07:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters, [{{6,4,undefined},0,0,6,6},{{6,3,2},7,7,14,14}], <0.215.0>, …]} 09:08:20 <dead> {rtb_gateway_counter,map_counters, [bid_metrics_counters, [{{6,4,undefined},0,0,12,12},{{6,3,2},12,12,24,24}], <0.215.0>, …]} Let’s trace! Function should be calling itself recursively… Example #1 31
  32. 32. 32
  33. 33. Example #2 2014-07-31 19:50:39.915 [error] emulator Error in process <0.23676.390> on node 'rtb- gateway' with exit value: {function_clause,[{cowboy_protocol,parse_uri_path,[<<0 bytes>>, {state,#Port<0.13093005>,ranch_tcp,[cowboy_router,cowboy_handler],false, [{listener,rtbgw_lb},{dispatch,[{'_',[],[{[<<13 bytes>>,exchange], [],rtb_gateway_notification_handler,[ewr,<<4 bytes>>]},{[<<5 bytes>>,exchange... Where do we start? 33
  34. 34. 1. Google the error in case it’s a known bug. Not really… 2. Add extra logging in ranch to print out the request state when the bug occurs. 3. Use ngrep to validate that some HTTP requests are malformed (e.g. improper content-size). Bug is non-trivial to reproduce so let’s start by collecting data… Example #2 34
  35. 35. 1. Capture TCP streams of failing requests using tcpdump. 2. Build tool to replay TCP dump (httpreplay). 3. Replay traffic capture… 4. Can reproduce… but not deterministically. 5. Stepped on my ego and passed the flag to a teammate. Example #2 35
  36. 36. 1. Teammate tried tracing the problem… no dice. 2. Teammate took a step back… Example #2 36
  37. 37. Take a step backTeammate has an eureka moment while driving… The socket in the cowboy req record is mutable! p.s. Thanks, Jeremie! :) 37
  38. 38. 38
  39. 39. Example #3 1. Receive a “service trouble” email from Dynect Concierge (DNS)… 2. Receive a “DOWN/PROBLEM” email from Nagios… 3. SSH to server to find out beam is dumping a erl_crash.dump… Where do we start? 39
  40. 40. Example #3 While we’re waiting for the VM to finish writing the erl_crash.dump, let’s check graphite. 40
  41. 41. Example #3 lpgauth # ./erl_crashdump_analyzer.sh /erl_crash.dump analyzing /erl_crash.dump, generated on: Mon Feb 16 10:57:22 2015 Slogan: eheap_alloc: Cannot allocate 71302344 bytes of memory (of type "heap", thread 4). … Different message queue lengths (5 largest different): === 1 1357844 7 2 22 1 10180 0 … cat /erl_crash.dump | grep -10 1357844 41
  42. 42. Example #3 loop(State) -> receive Msg -> {ok, State2} = handle_msg(Msg, State), loop(State2) end. handle_msg({call, Ref, From, Msg}, State) -> gen_tcp:send(Socket, Packet) … handle_msg({tcp, Socket, Data}, State) -> decode_data(Data2, State) … What happens if gen_tcp:send/2 blocks? 42
  43. 43. Example #3 1. Validate that gen_tcp can actually block. 2. Mitigate by using gen_tcp option send_timeout. 3. Fix the problem by adding back-pressure (the joys of unbounded queues!). 43
  44. 44. Example #3 check(Tid, MaxBacklog) -> case increment(Tid, MaxBacklog) of [MaxBacklog, MaxBacklog] -> false; [_, Value] when Value =< MaxBacklog -> true; {error, tid_missing} -> false end. decrement(Tid) -> safe_update_counter(Tid, {2, -1, 0, 0}). increment(Tid, MaxBacklog) -> safe_update_counter(Tid, [{2, 0}, {2, 1, MaxBacklog, MaxBacklog}]). 44
  45. 45. Example #3 1> F = fun Loop() -> 1> [{backlog, Backlog}] = ets:tab2list(anchor_backlog), 1> {message_queue_len, Messages} = process_info(whereis(anchor_server), message_queue_len), 1> io:format("~p: ~p ~p~n", [time(), Backlog, Messages]), 1> timer:sleep(1000), 1> Loop() 1> end. #Fun<erl_eval.44.90072148> 2> F(). {17,4,31}: 5 1 {17,4,32}: 1 0 {17,4,33}: 6 0 45
  46. 46. Tips %% monitoring {lager, "2.0.0", {git, "http://github.com/basho/lager.git", {tag, "2.0.0"}}}, {lager_logstash, "0.1.3", {git, "https://github.com/rpt/lager_logstash.git", {tag, "0.1.3"}}}, {riak_sysmon, "1.1.3", {git, "http://github.com/lpgauth/riak_sysmon", {branch, "adgear"}}}, {statsderl, "0.3.4", {git, "http://github.com/lpgauth/statsderl.git", {branch, "adgear"}}}, {vmstats, "0.3.4", {git, "http://github.com/lpgauth/vmstats.git", {branch, “adgear"}}}, %% debuggers {eflame, “.*", {git, "http://github.com/lpgauth/eflame.git", {branch, "adgear"}}}, {eper, “.*", {git, "http://github.com/massemanet/eper.git", {branch, "master"}}}, {recon, “.*", {git, "http://github.com/ferd/recon.git", {branch, "master"}}}, {timing, “.*", {git, "http://github.com/lpgauth/timing.git", {branch, "master"}}} Have your debugging tools ready on production nodes! 46
  47. 47. Tips Take the time to build your own tools! 1. If you find yourself repeating the same commands often, write a script! 2. Add debugging functions to your modules: • ETS table tid accessor • gen_server state accessor 47
  48. 48. Thank you! github: lpgauth twitter: lpgauth dilbert.com 48

×