Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Monitoring
Complex Systems
Keeping Your Head on Straight
in a Hard World
I do things to/with
computers.
I build real-time
systems.
I build fault-
tolerant systems.
I build critical
systems.
AdRoll
Less this.
More this.
Engineering + Mathematics = ads
Engineering + Mathematics = ads
(you’re welcome)
R E A L - T I M E
B I D D I N G
The Problem Domain
• Low latency ( < 100 ms per transaction)
The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
• Highly concurrent (~2 million trans...
The Problem Domain
• Low latency ( < 100 ms per transaction)
• Firm real-time system
• Highly concurrent (~2 million trans...
I build
Complex Systems
Complex Systems
• Non-linear feedback
• Tightly coupled to external systems
• Difficult to model, understand
Bad things happen when
Complex Systems fail.
Humans are bad at predicting the
performance of complex systems(…).
Our ability to create large and
complex systems fools ...
Complex Systems often
create worse problems
than those they solve.
The key challenge to
sustaining a complex
system is maintaining our
understanding of it.
What can be done?
Compile-time guarantees
are not sufficient.
don’t scrimp on them, though
Compile-time guarantees
are not sufficient.
Ahead of time verification is
not sufficient.
don’t scrimp on these, either
Ahead of time verification is
not sufficient.
We need insight into the
running system.
• VM killers
What are we looking for?
• VM killers
• Application performance regressions
What are we looking for?
• VM killers
• Application performance regressions
• Abnormal application behavior
What are we looking for?
• VM killers
• Application performance regressions
• Abnormal application behavior
• Surprises
What are we looking for?
INSTRUMENTATION
BEAM is ready to play.
erlang:memory/1
erlang:memory/1
• ets
• binary
• atom
• total
• processes
• system
erlang:statistics/1
erlang:statistics/1
• run_queue
• garbage_collection
• io
erlang:system_info/1
erlang:system_info/1
• port_count
• process_count
• *_limit
What about
our own work?
Exometer
Important Terms
metric a measurement
entry a receiver and aggregator of metrics
reporter that which samples entries period...
exometer
• Responsive upstream (Ulf Wiger never sleeps?)
• Metric collection, aggregation and reporting
decoupled.
• Stati...
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang,...
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{fu...
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{fu...
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{fu...
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},
{[erlang, statistics],
{fu...
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_que...
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_que...
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang,statistics,run_que...
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},
{exometer_report_statsd, [er...
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},
{exometer_report_statsd, [er...
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},
{exometer_report_statsd, [er...
1> exometer:new([a, histogram], histogram).
ok
2> exometer:get_value([a, histogram]).
{ok,[{n,0},
{mean,0},
{min,0},
{max,...
These are all loosely
coupled at runtime.
Configuration is static, but
you can adapt it on the fly.
Creating Reporters
•Very easy to add your own reporters and
entries.
•Reporters and entries can be proprietary.
Just have ...
Why not…
…folsom?
…statman?
…vmstat?
Okay, great. We have
instrumentation.
Now what?
MONITORING
This is the
hard part.
Visualization
Alerting
Analysis
Visualization tells you
how things look but
not why.
Periodic Bid Timeouts
Consistent System Load
Correlated Network Traffic Spikes
Correlated Run Queue Spikes
What happened?
• Scheduler threads were locked to CPUs
• Background process comes on every 20
minutes, consumes a lot of CPU time
• No cp...
Alerting tells you that
something happened,
but not why.
A
normal
day.
Wat?
That’s
some
cliff.
Timeouts look good.
Errors
prior
are
okay.
What happened?
“Uh, hey guys, you know
Facebook is down, right?”
Analysis gives you why
but only if you know
how to ask for what.
The
memory
use of a
bidder.
ಠ_ಠ
It’s all
binaries.
Not in processes.
Not in ETS.
Come on now.
What happened?
A jiffy bug.
A jiffy bug.
A byte here, a byte there eventually it turns into real memory.
Okay, great. We
have monitoring and
instrumentation.
Now all our problems
are solved, right?
Not
quite.
Instruments make up
for our lack of
insight.
Monitoring
makes up
for our
frailty.
No solution is perfect.
Instruments may
be misleading.
Instruments may be
overwhelming.
Instruments
may be
inaccurate.
Instruments may
be ignored.
What can
be done?
A little
paranoia never hurt
anyone.
Use glass displays.
Train.
Keep sight of
the main goal.
Recognize that
even simple
systems go
wrong
sometimes.
Have resources
you’re willing
to sacrifice.
Thanks, folks!
<3
@bltroutwine
Upcoming SlideShare
Loading in …5
×

Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World

1,767 views

Published on

This talk will provide motivation for the extensive instrumentation of complex computer systems and make the argument that such systems. This talk will provide practical starting points in Erlang projects and maintain a perspective on the human organization around the computer system. Brian will focus on getting started with instrumentation in a systematic way and follow up with the challenge of interpreting and acting on metrics emitted from a production system in a way which does not overwhelm operators’ ability to effectively control or prioritize faults in the system. He’ll use historical examples and case studies from my work to keep the talk anchored in the practical.

Talk objectives:

Brian hopes to convince the audience of two things:

* that monitoring and instrumentation is an essential component of any long-lived system and

* that it's not so hard to get started, after all.

He’ll keep a clear-eyed view of what works and is difficult in practice so that the audience can make a reasoned decision after the talk.

Target audience:

This talk would appeal to engineers with long-running production employments, operations folks and Erlangers in general.

Published in: Software

Monitoring Complex Systems: Keeping Your Head on Straight in a Hard World

  1. 1. Monitoring Complex Systems Keeping Your Head on Straight in a Hard World
  2. 2. I do things to/with computers.
  3. 3. I build real-time systems.
  4. 4. I build fault- tolerant systems.
  5. 5. I build critical systems.
  6. 6. AdRoll
  7. 7. Less this.
  8. 8. More this.
  9. 9. Engineering + Mathematics = ads
  10. 10. Engineering + Mathematics = ads (you’re welcome)
  11. 11. R E A L - T I M E B I D D I N G
  12. 12. The Problem Domain • Low latency ( < 100 ms per transaction)
  13. 13. The Problem Domain • Low latency ( < 100 ms per transaction) • Firm real-time system
  14. 14. The Problem Domain • Low latency ( < 100 ms per transaction) • Firm real-time system • Highly concurrent (~2 million transactions per second, peak)
  15. 15. The Problem Domain • Low latency ( < 100 ms per transaction) • Firm real-time system • Highly concurrent (~2 million transactions per second, peak) • Global, 24/7 operation
  16. 16. I build Complex Systems
  17. 17. Complex Systems • Non-linear feedback • Tightly coupled to external systems • Difficult to model, understand
  18. 18. Bad things happen when Complex Systems fail.
  19. 19. Humans are bad at predicting the performance of complex systems(…). Our ability to create large and complex systems fools us into believing that we’re also entitled to understand them. C A R L O S B U E N O “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
  20. 20. Complex Systems often create worse problems than those they solve.
  21. 21. The key challenge to sustaining a complex system is maintaining our understanding of it.
  22. 22. What can be done?
  23. 23. Compile-time guarantees are not sufficient.
  24. 24. don’t scrimp on them, though Compile-time guarantees are not sufficient.
  25. 25. Ahead of time verification is not sufficient.
  26. 26. don’t scrimp on these, either Ahead of time verification is not sufficient.
  27. 27. We need insight into the running system.
  28. 28. • VM killers What are we looking for?
  29. 29. • VM killers • Application performance regressions What are we looking for?
  30. 30. • VM killers • Application performance regressions • Abnormal application behavior What are we looking for?
  31. 31. • VM killers • Application performance regressions • Abnormal application behavior • Surprises What are we looking for?
  32. 32. INSTRUMENTATION
  33. 33. BEAM is ready to play.
  34. 34. erlang:memory/1
  35. 35. erlang:memory/1 • ets • binary • atom • total • processes • system
  36. 36. erlang:statistics/1
  37. 37. erlang:statistics/1 • run_queue • garbage_collection • io
  38. 38. erlang:system_info/1
  39. 39. erlang:system_info/1 • port_count • process_count • *_limit
  40. 40. What about our own work?
  41. 41. Exometer
  42. 42. Important Terms metric a measurement entry a receiver and aggregator of metrics reporter that which samples entries periodically and ships them to another system subscription the definition of the regular interval on which reporters sample entries
  43. 43. exometer • Responsive upstream (Ulf Wiger never sleeps?) • Metric collection, aggregation and reporting decoupled. • Static and dynamic configuration. • Very low, predictable runtime overhead.
  44. 44. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  45. 45. {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]}, Defining Entries
  46. 46. {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]}, Defining Entries
  47. 47. {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]}, Defining Entries
  48. 48. {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]}, Defining Entries
  49. 49. { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, Defining Reporters
  50. 50. { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, Defining Reporters
  51. 51. { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, Defining Reporters
  52. 52. { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} ]} ]} Defining Subscriptions
  53. 53. { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} ]} ]} Defining Subscriptions
  54. 54. { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} ]} ]} Defining Subscriptions
  55. 55. 1> exometer:new([a, histogram], histogram). ok 2> exometer:get_value([a, histogram]). {ok,[{n,0}, {mean,0}, {min,0}, {max,0}, {median,0}, {50,0}, {75,0}, {90,0}, {95,0}, {99,0}, {999,0}]} 3> exometer_report:add_reporter( exometer_report_tty, []). ok 4> exometer_report:subscribe( exometer_report_tty, [a, histogram], mean, 1000, []). ok exometer_report_tty: a_histogram_mean 1393627070:0 exometer_report_tty: a_histogram_mean 1393627071:0 exometer_report_tty: a_histogram_mean 1393627072:0 Doing it dynamically.
  56. 56. These are all loosely coupled at runtime.
  57. 57. Configuration is static, but you can adapt it on the fly.
  58. 58. Creating Reporters •Very easy to add your own reporters and entries. •Reporters and entries can be proprietary. Just have to be loaded at runtime. •Authors are responsive to issues.
  59. 59. Why not… …folsom? …statman? …vmstat?
  60. 60. Okay, great. We have instrumentation.
  61. 61. Now what?
  62. 62. MONITORING
  63. 63. This is the hard part.
  64. 64. Visualization
  65. 65. Alerting
  66. 66. Analysis
  67. 67. Visualization tells you how things look but not why.
  68. 68. Periodic Bid Timeouts
  69. 69. Consistent System Load
  70. 70. Correlated Network Traffic Spikes
  71. 71. Correlated Run Queue Spikes
  72. 72. What happened?
  73. 73. • Scheduler threads were locked to CPUs • Background process comes on every 20 minutes, consumes a lot of CPU time • No cpu-shield was set up on our production systems • OS bumped a scheduler thread off its CPU, backing up its run-queue
  74. 74. Alerting tells you that something happened, but not why.
  75. 75. A normal day.
  76. 76. Wat?
  77. 77. That’s some cliff.
  78. 78. Timeouts look good.
  79. 79. Errors prior are okay.
  80. 80. What happened?
  81. 81. “Uh, hey guys, you know Facebook is down, right?”
  82. 82. Analysis gives you why but only if you know how to ask for what.
  83. 83. The memory use of a bidder.
  84. 84. ಠ_ಠ
  85. 85. It’s all binaries.
  86. 86. Not in processes.
  87. 87. Not in ETS.
  88. 88. Come on now.
  89. 89. What happened?
  90. 90. A jiffy bug.
  91. 91. A jiffy bug. A byte here, a byte there eventually it turns into real memory.
  92. 92. Okay, great. We have monitoring and instrumentation.
  93. 93. Now all our problems are solved, right?
  94. 94. Not quite.
  95. 95. Instruments make up for our lack of insight.
  96. 96. Monitoring makes up for our frailty.
  97. 97. No solution is perfect.
  98. 98. Instruments may be misleading.
  99. 99. Instruments may be overwhelming.
  100. 100. Instruments may be inaccurate.
  101. 101. Instruments may be ignored.
  102. 102. What can be done?
  103. 103. A little paranoia never hurt anyone.
  104. 104. Use glass displays.
  105. 105. Train.
  106. 106. Keep sight of the main goal.
  107. 107. Recognize that even simple systems go wrong sometimes.
  108. 108. Have resources you’re willing to sacrifice.
  109. 109. Thanks, folks! <3 @bltroutwine

×