Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World

1. Monitoring Complex Systems Keeping Your Head on Straight in a Hard World

2. I do things to/with computers.

3. I build real-time systems.

4. I build fault- tolerant systems.

5. I build critical systems.

6. AdRoll

7. Less this.

8. More this.

9. Engineering + Mathematics = ads

10. Engineering + Mathematics = ads (you’re welcome)

11. R E A L - T I M E B I D D I N G

12. The Problem Domain • Low latency ( < 100 ms per transaction)

13. The Problem Domain • Low latency ( < 100 ms per transaction) • Firm real-time system

14. The Problem Domain • Low latency ( < 100 ms per transaction) • Firm real-time system • Highly concurrent (~2 million transactions per second, peak)

15. The Problem Domain • Low latency ( < 100 ms per transaction) • Firm real-time system • Highly concurrent (~2 million transactions per second, peak) • Global, 24/7 operation

16. I build Complex Systems

17. Complex Systems • Non-linear feedback • Tightly coupled to external systems • Difficult to model, understand

18. Bad things happen when Complex Systems fail.

19. Humans are bad at predicting the performance of complex systems(…). Our ability to create large and complex systems fools us into believing that we’re also entitled to understand them. C A R L O S B U E N O “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”

20. Complex Systems often create worse problems than those they solve.

21. The key challenge to sustaining a complex system is maintaining our understanding of it.

22. What can be done?

23. Compile-time guarantees are not sufficient.

24. don’t scrimp on them, though Compile-time guarantees are not sufficient.

25. Ahead of time verification is not sufficient.

26. don’t scrimp on these, either Ahead of time verification is not sufficient.

27. We need insight into the running system.

28. • VM killers What are we looking for?

29. • VM killers • Application performance regressions What are we looking for?

30. • VM killers • Application performance regressions • Abnormal application behavior What are we looking for?

31. • VM killers • Application performance regressions • Abnormal application behavior • Surprises What are we looking for?

32. INSTRUMENTATION

33. BEAM is ready to play.

34. erlang:memory/1

35. erlang:memory/1 • ets • binary • atom • total • processes • system

36. erlang:statistics/1

37. erlang:statistics/1 • run_queue • garbage_collection • io

38. erlang:system_info/1

39. erlang:system_info/1 • port_count • process_count • *_limit

40. What about our own work?

41. Exometer

42. Important Terms metric a measurement entry a receiver and aggregator of metrics reporter that which samples entries periodically and ships them to another system subscription the definition of the regular interval on which reporters sample entries

43. exometer • Responsive upstream (Ulf Wiger never sleeps?) • Metric collection, aggregation and reporting decoupled. • Static and dynamic configuration. • Very low, predictable runtime overhead.

44. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},

45. {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]}, Defining Entries

49. { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, Defining Reporters

52. { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} ]} ]} Defining Subscriptions

55. 1> exometer:new([a, histogram], histogram). ok 2> exometer:get_value([a, histogram]). {ok,[{n,0}, {mean,0}, {min,0}, {max,0}, {median,0}, {50,0}, {75,0}, {90,0}, {95,0}, {99,0}, {999,0}]} 3> exometer_report:add_reporter( exometer_report_tty, []). ok 4> exometer_report:subscribe( exometer_report_tty, [a, histogram], mean, 1000, []). ok exometer_report_tty: a_histogram_mean 1393627070:0 exometer_report_tty: a_histogram_mean 1393627071:0 exometer_report_tty: a_histogram_mean 1393627072:0 Doing it dynamically.

56. These are all loosely coupled at runtime.

57. Configuration is static, but you can adapt it on the fly.

58. Creating Reporters •Very easy to add your own reporters and entries. •Reporters and entries can be proprietary. Just have to be loaded at runtime. •Authors are responsive to issues.

59. Why not… …folsom? …statman? …vmstat?

60. Okay, great. We have instrumentation.

61. Now what?

62. MONITORING

63. This is the hard part.

64. Visualization

65. Alerting

66. Analysis

67. Visualization tells you how things look but not why.

68. Periodic Bid Timeouts

69. Consistent System Load

70. Correlated Network Traffic Spikes

71. Correlated Run Queue Spikes

72. What happened?

73. • Scheduler threads were locked to CPUs • Background process comes on every 20 minutes, consumes a lot of CPU time • No cpu-shield was set up on our production systems • OS bumped a scheduler thread off its CPU, backing up its run-queue

74. Alerting tells you that something happened, but not why.

75. A normal day.

76. Wat?

77. That’s some cliff.

78. Timeouts look good.

79. Errors prior are okay.

80. What happened?

81. “Uh, hey guys, you know Facebook is down, right?”

82. Analysis gives you why but only if you know how to ask for what.

83. The memory use of a bidder.

84. ಠ_ಠ

85. It’s all binaries.

86. Not in processes.

87. Not in ETS.

88. Come on now.

89. What happened?

90. A jiffy bug.

91. A jiffy bug. A byte here, a byte there eventually it turns into real memory.

92. Okay, great. We have monitoring and instrumentation.

93. Now all our problems are solved, right?

94. Not quite.

95. Instruments make up for our lack of insight.

96. Monitoring makes up for our frailty.

97. No solution is perfect.

98. Instruments may be misleading.

99. Instruments may be overwhelming.

100. Instruments may be inaccurate.

101. Instruments may be ignored.

102. What can be done?

103. A little paranoia never hurt anyone.

104. Use glass displays.

105. Train.

106. Keep sight of the main goal.

107. Recognize that even simple systems go wrong sometimes.

108. Have resources you’re willing to sacrifice.

109. Thanks, folks! <3 @bltroutwine

Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World

Recommended

Recommended

More Related Content

Similar to Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World

Similar to Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World (20)

More from Brian Troutwine

More from Brian Troutwine (9)

Recently uploaded

Recently uploaded (20)

Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World