Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R

MONITORING REAL-TIME
B I D D I N G AT A D...
I DO THINGS WITH/TO COMPUTERS.
I

CARE ABOUT RELIABLE, COMPLEX

AND CRITICAL SYSTEMS.
ADROLL
LESS THIS
MORE THIS
WE’RE AN
ADTECH
C O M PA N Y .
R E TA R G E T I N G ,

IT’S

A L L A B O U T D ATA .
• Our customers want to show ads to people who might

care to see them.
• We partner with “exchanges” to enter ad-slot auc...
REAL-TIME
BIDDING
the nature of the problem domain:
• Low latency ( < 100ms per transaction )
• Firm real-time system
• Highly concurrent ( ...
“HUMANS

ARE BAD AT PREDICTING THE

PERFORMANCE OF COMPLEX SYSTEMS(…).

OUR

ABILITY TO CREATE LARGE AND COMPLEX

SYSTEMS ...
AHEAD

OF TIME

V E R I F I C AT I O N I S N O T
SUFFICIENT.
!

(DON’T SCRIMP ON IT, THOUGH.)
IGNORANCE AND
COMPLEX INTERACTIONS
WITH EXTERNAL SYSTEMS
ARE WHY WE CAN’T
H AV E N I C E T H I N G S .
AT

SCALE, EVEN RARE

EVENTS HAPPEN
FREQUENTLY.
AT

SCALE, BAD

THINGS HAPPEN
FA S T E R T H A N
HUMANS CAN
RESPOND.
THE RESULTS ARE
RARELY PRETTY.
THE CAUSES ARE RARELY
QUICK TO DISCOVER.
W H AT
CAN BE
DONE?
WE

H AV E T O O B S E R V E

OUR SYSTEMS WHILE
THEY RUN.
WE CAN BE
SCIENTIFIC
ABOUT THIS
NOT ALL SYSTEMS ARE
MONITORABLE, JUST AS NOT
A L L A R E T E S TA B L E .
!

I T ’ S A M AT T E R O F D E S I G N A N D
OF...
A DESIGN

T H AT I S M O N I T O R A B L E

TA K E S S T O C K O F I T S I N T E R N A L
TOLERANCES, EXTERNAL
I N T E R FA...
AN ENGINEER

IS RESPONSIBLE FOR

DESIGNING AND BUILDING
SYSTEMS.
!

A N O P E R AT O R

IS RESPONSIBLE FOR

RUNNING THEM.
A H E A D - O F - T I M E V E R I F I C AT I O N
—TESTING, TYPE-CHECKING—
GIVES THE ENGINEER
INTUITION ABOUT THE SYSTEM.
IF

YOU DON’T KNOW
HOW THE SYSTEM

S H O U L D B E H AV E Y O U
C A N ’ T S AY H O W I T
SHOULDN’T OR ISN’T.
MONITORING GIVES THE
O P E R AT O R I N F O R M AT I O N
A B O U T T H E B E H AV I O R O F
THE RUNNING SYSTEM.
T H E O P E R AT O R S A N D E N G I N E E R S
ARE OFTEN THE SAME PEOPLE.
THERE’S

A POTENTIAL FOR A POSITIVE

FEEDBACK LOOP OF QUALITY HERE.
“WHILE THE SKILL OF REENTRY WAS EASILY
HANDLED BY AUTOMATED SYSTEMS, THE
PILOT’S PRIMARY FUNCTION EVOLVED TO BE
A REDUNDAN...
exometer
exometer — github.com/Feuerlabs/exometer
• Created by Feuerlabs.
• Responsive upstream (Ulf Wiger never sleeps?)
• Metric ...
I M P O R TA N T T E R M S
• METRIC: a measurement
• ENTRY: a receiver and aggregator of metrics
• REPORTER: an entity whi...
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang...
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang...
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang...
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang...
Defining Entries
{predefined, [
{[erlang,memory],
{function, erlang, memory,
['$dp'], value,[ets,binary]},
[]
},

{[erlang...
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang...
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang...
Defining Reporters
{ reporters,
[
{ exometer_report_statsd,
[
{hostname, "localhost"},
{port, 8125},
{type_map,
[
{[erlang...
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
...
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
...
Defining Subscriptions
{ report,
[
{ subscribers,
[
{exometer_report_statsd, [erlang, statistics],
run_queue, 1000},

!
!
...
D o i n g i t d y n a m i c a l l y.
1> exometer:new([a, histogram], histogram).
ok
!

2> exometer:get_value([a, histogram...
W H AT A R E W E L O O K I N G F O R ?
• VM killers
• System performance regressions
• Abnormal system behavior
• Surprises
“ A B N O R M A L S Y S T E M B E H AV I O R ” ?
CASE STUDIES
CASE STUDY:
ADROLL RTB TIMEOUT SPIKES
Case Study: AdRoll RTB Timeout Spikes

PERIODIC BID TIMEOUTS
Case Study: AdRoll RTB Timeout Spikes

CONSISTENT SYSTEM LOAD
Case Study: AdRoll RTB Timeout Spikes

C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S
Case Study: AdRoll RTB Timeout Spikes

C O R R E L AT E D R U N Q U E U E S P I K E S
Case Study: AdRoll RTB Timeout Spikes

W H AT

HAPPENED?
Case Study: AdRoll RTB Timeout Spikes
• Scheduler threads were locked to CPUs
• Background process comes on every 20 minut...
CASE STUDY:
EXCHANGE THROTTLING
Case Study: Exchange Throttling

H E A L T H Y PAT T E R N O F B I D R E Q U E S T S
Case Study: Exchange Throttling

THE TROUGH OF THROTTLING
Case Study: Exchange Throttling

BAD

GOOD
Case Study: Exchange Throttling

PROBLEM CONFIRMED WITH EXCHANGE
Case Study: Exchange Throttling

• All other metrics (run-queue, CPU, network IO) were fine.
• Confirmed that no changes h...
Case Study: Exchange Throttling

W H AT

HAPPENED?
Case Study: Exchange Throttling

WE

HIT AN IMPLICIT EXCHANGE LIMIT.
(ARGUABLY,

A BUG.)
LESSONS LEARNED
IT

I S P O S S I B L E T O H AV E T O O L I T T L E

I N F O R M AT I O N .
“(THE

FIREFIGHTERS) TRIED TO BEAT

DOWN THE FLAMES (OF

REACTOR 4). THEY

CHERNOBYL

KICKED AT THE

BURNING GRAPHITE WITH...
IT

IS POSSIBLE TO COLLECT TOO

M U C H I N F O R M AT I O N , O R P R E S E N T
IT BADLY.
“SAFETY

SYSTEMS, SUCH AS WARNING LIGHTS,

ARE NECESSARY, BUT THEY HAVE THE POTENTIAL
FOR DECEPTION.
COMPLEX

(…) ONE OF T...
I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E
W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T
W H AT ’...
“THE

FIRST DISASTER IN SPACE HAD OCCURRED, AND

NO ONE KNEW WHAT HAD HAPPENED.

ON

THE

GROUND, THE FLIGHT CONTROLLERS W...
Things that don’t quite
work like you’d hope.
Problems

• Instrumenting code increases code size.
• While very low, runtime impact is not zero.
• Instrumentation of dep...
Solutions

•Dtrace / systemtap hookups
•Culture of instrumentation by default.
•Tracing BIFs
W O M B AT ?
Instrumentation by default strategies

• Application configuration callback module
• Behaviors callbacks
• Value-at-time f...
W H AT
ABOUT
CULTURE
LOOK FOR
SOLUTIONS,
NOT
S C A P E G O AT S .
DON’T

BE FOOLED BY

OVERCONFIDENCE IN YOUR
WORK.
DOCUMENT
AND DISCUSS
YOUR
FA I L U R E S .
EVERYONE
MUST WORK
T O WA R D T H E
SAME END.
KNOW SOME BASIC
M AT H E M AT I C S .
PREFER LOOSE
COUPLING WHEN
POSSIBLE.
!

BE EXPLICIT ABOUT
TIGHT COUPLING.
MEASURE

MUCH,

P R E S E N T O N LY T H AT
WHICH YOU’RE
C E R TA I N T O N E E D .
MEASURE,
DON’T GUESS.
BE
A C C U R AT E .
QUESTIONS?
<3
@bltroutwine
BRIAN@TROUTWINE.US
Bonus Bibliography!
XIII: The Apollo Flight that Failed

Henry S. F. Cooper, Jr.
David A. Mindell

Digital Apollo: Human a...
Upcoming SlideShare
Loading in …5
×

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

8,140 views

Published on

This is the talk I gave at Erlang Factory SF Bay Area 2014. In it I discussed the instrumentation by default approach taken in the AdRoll real-time bidding team, discuss the technical details of the libraries we use and lessons learned to adapt your organization to deal with the onslaught of data from instrumentation.

Published in: Technology
  • Be the first to comment

10 Billion a Day, 100 Milliseconds Per: Monitoring Real-Time Bidding at AdRoll

  1. 1. T E N B I L L I O N A D AY, O N E - H U N D R E D M I L L I S E C O N D S P E R MONITORING REAL-TIME B I D D I N G AT A D R O L L
  2. 2. I DO THINGS WITH/TO COMPUTERS.
  3. 3. I CARE ABOUT RELIABLE, COMPLEX AND CRITICAL SYSTEMS.
  4. 4. ADROLL
  5. 5. LESS THIS
  6. 6. MORE THIS
  7. 7. WE’RE AN ADTECH C O M PA N Y .
  8. 8. R E TA R G E T I N G , IT’S A L L A B O U T D ATA .
  9. 9. • Our customers want to show ads to people who might care to see them. • We partner with “exchanges” to enter ad-slot auctions. • These auctions are executed and finalized while consumer’s webpages load.
  10. 10. REAL-TIME BIDDING
  11. 11. the nature of the problem domain: • Low latency ( < 100ms per transaction ) • Firm real-time system • Highly concurrent ( > 30 billion transactions per day ) • Global, 24/7 operation
  12. 12. “HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM.” –C ARLOS BUENO “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
  13. 13. AHEAD OF TIME V E R I F I C AT I O N I S N O T SUFFICIENT. ! (DON’T SCRIMP ON IT, THOUGH.)
  14. 14. IGNORANCE AND COMPLEX INTERACTIONS WITH EXTERNAL SYSTEMS ARE WHY WE CAN’T H AV E N I C E T H I N G S .
  15. 15. AT SCALE, EVEN RARE EVENTS HAPPEN FREQUENTLY.
  16. 16. AT SCALE, BAD THINGS HAPPEN FA S T E R T H A N HUMANS CAN RESPOND.
  17. 17. THE RESULTS ARE RARELY PRETTY.
  18. 18. THE CAUSES ARE RARELY QUICK TO DISCOVER.
  19. 19. W H AT CAN BE DONE?
  20. 20. WE H AV E T O O B S E R V E OUR SYSTEMS WHILE THEY RUN.
  21. 21. WE CAN BE SCIENTIFIC ABOUT THIS
  22. 22. NOT ALL SYSTEMS ARE MONITORABLE, JUST AS NOT A L L A R E T E S TA B L E . ! I T ’ S A M AT T E R O F D E S I G N A N D OF CULTURE.
  23. 23. A DESIGN T H AT I S M O N I T O R A B L E TA K E S S T O C K O F I T S I N T E R N A L TOLERANCES, EXTERNAL I N T E R FA C E S A N D E X P O S E S T H E S E FOR AN O P E R AT O R .
  24. 24. AN ENGINEER IS RESPONSIBLE FOR DESIGNING AND BUILDING SYSTEMS. ! A N O P E R AT O R IS RESPONSIBLE FOR RUNNING THEM.
  25. 25. A H E A D - O F - T I M E V E R I F I C AT I O N —TESTING, TYPE-CHECKING— GIVES THE ENGINEER INTUITION ABOUT THE SYSTEM.
  26. 26. IF YOU DON’T KNOW HOW THE SYSTEM S H O U L D B E H AV E Y O U C A N ’ T S AY H O W I T SHOULDN’T OR ISN’T.
  27. 27. MONITORING GIVES THE O P E R AT O R I N F O R M AT I O N A B O U T T H E B E H AV I O R O F THE RUNNING SYSTEM.
  28. 28. T H E O P E R AT O R S A N D E N G I N E E R S ARE OFTEN THE SAME PEOPLE.
  29. 29. THERE’S A POTENTIAL FOR A POSITIVE FEEDBACK LOOP OF QUALITY HERE.
  30. 30. “WHILE THE SKILL OF REENTRY WAS EASILY HANDLED BY AUTOMATED SYSTEMS, THE PILOT’S PRIMARY FUNCTION EVOLVED TO BE A REDUNDANT SYSTEM (…) COORDINATING A VARIETY OF CONTROLS AS MUCH AS DIRECTLY CONTROLLING THE VEHICLE.” – DAV I D A . M I N D E L L D I G I TA L A P O L LO : H U M A N A N D M AC H I N E I N S PAC E F L I G H T
  31. 31. exometer
  32. 32. exometer — github.com/Feuerlabs/exometer • Created by Feuerlabs. • Responsive upstream (Ulf Wiger never sleeps?) • Metric collection, aggregation and reporting decoupled. • Static and dynamic configuration. • Very low, predictable runtime overhead.
  33. 33. I M P O R TA N T T E R M S • METRIC: a measurement • ENTRY: a receiver and aggregator of metrics • REPORTER: an entity which samples entries on a regular interval and optionally ships these samples onto a third-system • SUBSCRIPTION: the definition of the regular interval on which reporters sample entries
  34. 34. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  35. 35. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  36. 36. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  37. 37. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  38. 38. Defining Entries {predefined, [ {[erlang,memory], {function, erlang, memory, ['$dp'], value,[ets,binary]}, [] }, {[erlang, gc], {function, erlang, statistics, [garbage_collection], match, {total_coll, rec_wrd, '_'}}, [] }, ! {[erlang, statistics], {function, erlang, statistics, ['$dp'], value, [run_queue]}, [] }, {[boodah, freq_cap, not_found], spiral}, {[boodah, freq_cap, ok], spiral}, {[boodah, freq_cap, timeout], spiral} ]},
  39. 39. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  40. 40. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  41. 41. Defining Reporters { reporters, [ { exometer_report_statsd, [ {hostname, "localhost"}, {port, 8125}, {type_map, [ {[erlang,statistics,run_queue], histogram}, {[erlang, gc, tot_coll], histogram}, {[erlang, gc, rec_wrd], histogram}, ! {[erlang,memory,ets], gauge}, {[erlang,memory,binary],gauge}, ! {[boodah,freq_cap,not_found],gauge}, {[boodah,freq_cap,ok],gauge}, {[boodah,freq_cap,timeout],gauge} ]}, !
  42. 42. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  43. 43. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  44. 44. Defining Subscriptions { report, [ { subscribers, [ {exometer_report_statsd, [erlang, statistics], run_queue, 1000}, ! ! ! {exometer_report_statsd, [boodah, freq_cap, not_found], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, ok], one, 1000}, {exometer_report_statsd, [boodah, freq_cap, timeout], one, 1000} {exometer_report_statsd, [erlang, gc], tot_coll, 1000}, {exometer_report_statsd, [erlang, gc], rec_wrd, 1000}, ]} ! {exometer_report_statsd, [erlang, memory], ets, 10000}, {exometer_report_statsd, [erlang, memory], binary, 10000}, ! ! ]}
  45. 45. D o i n g i t d y n a m i c a l l y. 1> exometer:new([a, histogram], histogram). ok ! 2> exometer:get_value([a, histogram]). {ok,[{n,0}, {mean,0}, {min,0}, {max,0}, {median,0}, {50,0}, {75,0}, {90,0}, {95,0}, {99,0}, {999,0}]} ! 3> exometer_report:add_reporter( exometer_report_tty, []). ok ! 4> exometer_report:subscribe( exometer_report_tty, [a, histogram], mean, 1000, []). ok ! exometer_report_tty: a_histogram_mean 1393627070:0 exometer_report_tty: a_histogram_mean 1393627071:0 exometer_report_tty: a_histogram_mean 1393627072:0
  46. 46. W H AT A R E W E L O O K I N G F O R ?
  47. 47. • VM killers • System performance regressions • Abnormal system behavior • Surprises
  48. 48. “ A B N O R M A L S Y S T E M B E H AV I O R ” ?
  49. 49. CASE STUDIES
  50. 50. CASE STUDY: ADROLL RTB TIMEOUT SPIKES
  51. 51. Case Study: AdRoll RTB Timeout Spikes PERIODIC BID TIMEOUTS
  52. 52. Case Study: AdRoll RTB Timeout Spikes CONSISTENT SYSTEM LOAD
  53. 53. Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D N E T W O R K T R A F F I C S P I K E S
  54. 54. Case Study: AdRoll RTB Timeout Spikes C O R R E L AT E D R U N Q U E U E S P I K E S
  55. 55. Case Study: AdRoll RTB Timeout Spikes W H AT HAPPENED?
  56. 56. Case Study: AdRoll RTB Timeout Spikes • Scheduler threads were locked to CPUs • Background process comes on every 20 minutes, consumes a lot of CPU time • No cpu-shield was set up on our production systems • OS bumped a scheduler thread off its CPU, backing up its run-queue
  57. 57. CASE STUDY: EXCHANGE THROTTLING
  58. 58. Case Study: Exchange Throttling H E A L T H Y PAT T E R N O F B I D R E Q U E S T S
  59. 59. Case Study: Exchange Throttling THE TROUGH OF THROTTLING
  60. 60. Case Study: Exchange Throttling BAD GOOD
  61. 61. Case Study: Exchange Throttling PROBLEM CONFIRMED WITH EXCHANGE
  62. 62. Case Study: Exchange Throttling • All other metrics (run-queue, CPU, network IO) were fine. • Confirmed that no changes had been made to the running systems via deployment. • Amazon data showed no network issues to our machines.
  63. 63. Case Study: Exchange Throttling W H AT HAPPENED?
  64. 64. Case Study: Exchange Throttling WE HIT AN IMPLICIT EXCHANGE LIMIT. (ARGUABLY, A BUG.)
  65. 65. LESSONS LEARNED
  66. 66. IT I S P O S S I B L E T O H AV E T O O L I T T L E I N F O R M AT I O N .
  67. 67. “(THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF REACTOR 4). THEY CHERNOBYL KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS.” - - SVETLANA ALEXIEVICH - VO I C E S F RO M C H E R N O BY L : T H E O R A L H I S TO RY O F A N U C L E A R D I SA S T E R
  68. 68. IT IS POSSIBLE TO COLLECT TOO M U C H I N F O R M AT I O N , O R P R E S E N T IT BADLY.
  69. 69. “SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. COMPLEX (…) ONE OF THE LESSONS OF SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS.” - C HARLES PERROW - N ORMAL ACCIDENTS: LIVIN G WITH HIGH-RISK TEC HN OLOGIES
  70. 70. I N D I R E C T K N O W L E D G E M AY N O T T E L L T H E W H O L E S T O R Y , O R M AY M A K E Y O U D O U B T W H AT ’ S P L A I N L Y B E F O R E Y O U R E Y E S .
  71. 71. “THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS WERE NOT EVEN SURE THAT ANYTHING HAD. ONE REASON FOR THEIR IGNORANCE WAS THE IMPERFECT NATURE OF TELEMETRY FROM THE SPACECRAFT WHICH COULD NOT TELL THEM DIRECTLY THAT AN OXYGEN TANK HAD BLOWN UP.” - - H E N R Y S . F. C O O P E R , J R . - X I I I : T H E A P O L LO F L I G H T T H AT FA I L E D
  72. 72. Things that don’t quite work like you’d hope.
  73. 73. Problems • Instrumenting code increases code size. • While very low, runtime impact is not zero. • Instrumentation of dependencies is up to the library author.
  74. 74. Solutions •Dtrace / systemtap hookups •Culture of instrumentation by default. •Tracing BIFs
  75. 75. W O M B AT ?
  76. 76. Instrumentation by default strategies • Application configuration callback module • Behaviors callbacks • Value-at-time functions
  77. 77. W H AT ABOUT CULTURE
  78. 78. LOOK FOR SOLUTIONS, NOT S C A P E G O AT S .
  79. 79. DON’T BE FOOLED BY OVERCONFIDENCE IN YOUR WORK.
  80. 80. DOCUMENT AND DISCUSS YOUR FA I L U R E S .
  81. 81. EVERYONE MUST WORK T O WA R D T H E SAME END.
  82. 82. KNOW SOME BASIC M AT H E M AT I C S .
  83. 83. PREFER LOOSE COUPLING WHEN POSSIBLE. ! BE EXPLICIT ABOUT TIGHT COUPLING.
  84. 84. MEASURE MUCH, P R E S E N T O N LY T H AT WHICH YOU’RE C E R TA I N T O N E E D .
  85. 85. MEASURE, DON’T GUESS.
  86. 86. BE A C C U R AT E .
  87. 87. QUESTIONS?
  88. 88. <3 @bltroutwine BRIAN@TROUTWINE.US
  89. 89. Bonus Bibliography! XIII: The Apollo Flight that Failed Henry S. F. Cooper, Jr. David A. Mindell Digital Apollo: Human and Machine in Spaceflight Charles Perrow Normal Accidents: Living with High-Risk Technologies Voices from Chernobyl: The Oral History of a Nuclear Disaster Svetlana Alexievich Real-Time Systems: Design Principles for Distributed Embedded Applications Hermann Kopetz Command and Control: Nuclear Weapons, the Damascus Accident and the Illusion of Safety Eric Schlosser

×