Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

3,932 views

Published on

Instrumentation of Complex Systems is necessary and addresses the issues of static documentation of said systems. Instrumentation is flawed, flaws which are resolvable with an intentional kind of documentation.

Given at Write the Docs, Portland OR 2014.

Published in: Software, Technology

Instrumentation as a Living Documentation: Teaching Humans About Complex Systems

  1. 1. Instrumentation as a Living DocumentationTEACHING HUMANS ABOUT COMPLEX SYSTEMS
  2. 2. I do things to/with computers.
  3. 3. I build real-time systems.
  4. 4. I build distributed systems.
  5. 5. I build critical systems.
  6. 6. AdRoll
  7. 7. L E S S T H I S
  8. 8. M O R E T H I S
  9. 9. W E ’ R E A N A D T E C H C O M P A N Y .
  10. 10. R E A L - T I M E B I D D I N G
  11. 11. The nature of the problem domain: • Low latency ( < 100ms per transaction ) • Firm real-time system • Highly concurrent ( > 55 billion transactions per day ) • Global, 24/7 operation
  12. 12. I build Complex Systems
  13. 13. Complex Systems • Non-linear feedback • Tightly coupled to external systems • Difficult to model, understand • Usually a solution to some “wicked problem”
  14. 14. - - C . W E S T C H U R C H M A N , - G U E S T E D I T O R I A L : W I C K E D P R O B L E M S - M A N A G E M E N T S C I E N C E V O L . 4 , 1 9 6 7 [WICKED PROBLEMS ARE] SOCIAL PROBLEMS WHICH ARE ILL FORMULATED, WHERE THE INFORMATION IS CONFUSING, WHERE THERE ARE MANY CLIENTS AND DECISION-MAKERS WITH CONFLICTING VALUES, AND WHERE THE RAMIFICATIONS IN THE WHOLE SYSTEM ARE THOROUGHLY CONFUSING. […] THE ADJECTIVE ‘WICKED’ IS SUPPOSED TO DESCRIBE THE MISCHIEVOUS AND EVEN EVIL QUALITY OF THESE PROBLEMS, WHERE PROPOSED ‘SOLUTIONS’ OFTEN TURN OUT TO BE WORSE THAN THE SYMPTOMS.”
  15. 15. Bad things happen when Complex Systems fail.
  16. 16. Complex Systems often create worse problems than those they solve.
  17. 17. HUMANS ARE BAD AT PREDICTING THE PERFORMANCE OF COMPLEX SYSTEMS(…). OUR ABILITY TO CREATE LARGE AND COMPLEX SYSTEMS FOOLS US INTO BELIEVING THAT WE’RE ALSO ENTITLED TO UNDERSTAND THEM. C A R L O S B U E N O “ M AT U R E O P T I M I Z AT I O N H A N D B O O K ”
  18. 18. The key challenge to sustaining a complex system is maintaining our understanding of it.
  19. 19. We write documentation.
  20. 20. Complex systems are fiendishly difficult to communicate about.
  21. 21. Miscommunications are accidents in the making.
  22. 22. Documentation reduces accidents.
  23. 23. I F Y O U D O N ’ T K N O W H O W T H E S Y S T E M S H O U L D B E H A V E Y O U C A N ’ T S AY H O W I T S H O U L D N ’ T O R I S N ’ T .
  24. 24. Trouble is, documentation goes out of date.
  25. 25. Complex Systems evolve and written words “rot” as the system moves on.
  26. 26. Engineers fail to update documentation as the system changes.
  27. 27. D AV I D E . H O F F M A N “ T H E D E A D H A N D : T H E U N T O L D S T O R Y O F T H E C O L D WA R A R M S R A C E A N D I T ’ S D A N G E R O U S L E G A C Y ” ONE OPERATOR (…) WAS CONFUSED BY THE LOGBOOK. HE CALLED SOMEONE ELSE TO INQUIRE. ! “WHAT SHALL I DO?” HE ASKED. “IN THE PROGRAM THERE ARE INSTRUCTIONS OF WHAT TO DO, AND THEN A LOT OF THINGS CROSSED OUT.” ! THE OTHER PERSON THOUGHT FOR A MINUTE, THEN R E P L I E D , “ F O L L O W T H E C R O S S E D O U T INSTRUCTIONS.”
  28. 28. Engineers can be unaware of the system as it is actually used.
  29. 29. E R I C S C H L O S S E R C O M M A N D A N D C O N T R O L : N U C L E A R W E A P O N S , T H E D A M A S C U S A C C I D E N T, A N D T H E I L L U S I O N O F S A F E T Y CLEARLY THE TEXTBOOKS (…) DIDN’T TELL YOU WHAT REALLY HAPPENED IN THE FIELD. (…) (T)HERE WAS A WAY YOU WERE SUPPOSED TO DO THINGS – AND THE WAY THINGS GOT DONE. RFHCO SUITS WERE HOT AND CUMBERSOME (…) AND IF A MAINTENANCE TASK COULD BE ACCOMPLISHED QUICKLY WITHOUT AN OFFICER NOTICING, SOMETIMES THE SUITS WEREN’T WORN.
  30. 30. (Normal) Accidents happen.
  31. 31. H E N R Y S . F. C O O P E R , J R . X I I I : T H E A P O L L O F L I G H T T H AT FA I L E D THE FIRST DISASTER IN SPACE HAD OCCURRED, AND NO ONE KNEW WHAT HAD HAPPENED. ON THE GROUND, THE FLIGHT CONTROLLERS W E R E N O T E V E N S U R E T H AT ANYTHING HAD.
  32. 32. Documentation doesn’t necessarily reflect the reality of the system.
  33. 33. What can we do?
  34. 34. INSTRUMENTATION
  35. 35. Instrumentation reflects the reality of the system as it exists.
  36. 36. Instrumentation allows users and engineers to explore the system as it exists.
  37. 37. Exploration, done honestly, guides us to a new, better understanding of the system.
  38. 38. THIS “COLLECTIVE ENTITY” WAS ORGANIZED AROUND THE PILOT TO MAKE IT “SAFER AND MORE EFFICIENT IF THERE WAS A FOCAL POINT. AND I WAS THE FOCAL POINT. JIM FED THINGS INTO MY EARS. THE MOON FED THINGS INTO MY EYES AND I COULD FEEL THE MACHINE OPERATING.” C O M M A N D E R D AV I D S C O T T A S Q U O T E D I N D AV I D A . M I N D E L L ' S D I G I TA L A P O L L O : H U M A N A N D M A C H I N E I N S PA C E F L I G H T
  39. 39. Instrumentation democratizes the organization around a complex system.
  40. 40. Case Studies
  41. 41. Case Study: Exchange Throttling
  42. 42. Case Study: Exchange Throttling Healthy pattern of bid requests
  43. 43. Case Study: Exchange Throttling The trough of throttling
  44. 44. B A D G O O D Case Study: Exchange Throttling
  45. 45. Problem confirmed with Exchange Case Study: Exchange Throttling
  46. 46. Case Study: Exchange Throttling • All other metrics (run-queue, CPU, network IO) were fine. • Confirmed that no changes had been made to the running systems via deployment. • Amazon data showed no network issues to our machines.
  47. 47. What happened? Case Study: Exchange Throttling
  48. 48. We hit an implicit exchange limit. (Arguably, a bug.) Case Study: Exchange Throttling
  49. 49. Case Study: Timeout Jumps
  50. 50. Case Study: Timeout Jumps Healthy Pattern of Background Timeouts
  51. 51. Case Study: Timeout Jumps Unhealthy timeouts.
  52. 52. Case Study: Timeout Jumps Healthy Bid Requests
  53. 53. Case Study: Timeout Jumps Unhealthy Bid Requests Cliff of Throttling
  54. 54. Case Study: Timeout Jumps • Timeouts jump occurred only in US East, US West fine. • All other metrics (as above) checked out. • System deployment strongly correlated with timeout jump. • Rollback to previous release reduce timeouts to acceptable levels.
  55. 55. What happened? Case Study: Timeout Jumps
  56. 56. Who can say? ¯_(シ)_/¯ Case Study: Timeout Jumps
  57. 57. Lessons Learned
  58. 58. It is possible to have too little information.
  59. 59. (THE FIREFIGHTERS) TRIED TO BEAT DOWN THE FLAMES (OF CHERNOBYL REACTOR 4). THEY KICKED AT THE BURNING GRAPHITE WITH THEIR FEET. … THE DOCTORS KEPT TELLING THEM THEY’D BEEN POISONED BY GAS. - S V E T L A N A A L E X I E V I C H - V O I C E S F R O M C H E R N O B Y L : T H E O R A L H I S T O R Y O F A N U C L E A R D I S A S T E R
  60. 60. It is possible to collect too much information, or present it badly.
  61. 61. SAFETY SYSTEMS, SUCH AS WARNING LIGHTS, ARE NECESSARY, BUT THEY HAVE THE POTENTIAL FOR DECEPTION. (…) ONE OF THE LESSONS OF COMPLEX SYSTEMS AND (THREE MILE ISLAND) IS THAT ANY PART OF THE SYSTEM MIGHT BE INTERACTING WITH OTHER PARTS IN UNANTICIPATED WAYS. - C H A R L E S P E R R O W - N O R M A L A C C I D E N T S : L I V I N G W I T H H I G H - R I S K T E C H N O L O G I E S
  62. 62. Instrumentation is not a panacea.
  63. 63. Instruments may be misleading.
  64. 64. Must know some Mathematics.
  65. 65. Too much information hampers interpretation.
  66. 66. Instruments may be inaccurate.
  67. 67. Instruments may be ignored.
  68. 68. Instrumentation may be used for undesirable purposes.
  69. 69. What can we do?
  70. 70. Write documentation!
  71. 71. Context reduces misinterpretations. Misleading Instruments
  72. 72. Procedure manuals and visualizations reduce the need for math background. Must Know Math
  73. 73. The more contextual layers you add, the more you reduce “big boards of blinky lights”. Too Much Information
  74. 74. INSTRUMENTATION IS LIKE A SUIT. IT NEEDS TO FIT YOUR OWN MIND. VA L E N T I N O V O L O N G H I
  75. 75. Cross-checks and documented error margins mitigate instrument inaccuracy. Inaccuracy
  76. 76. IF YOU DON'T TRUST A COMPUTER BECAUSE SOMETIMES IT DOESN'T TELL YOU THE TRUTH, TELLING IT TO TELL YOU TO TRUST IT IS ASKING IT TO LIE TO YOU SOMETIMES. M I K E S A S S A K , C U R B S I D E
  77. 77. Checklists with references to instrumentation at decision points. May be Ignored
  78. 78. Collaborative Workplaces, Cooperatives, Unions, Laws etc. Undesirable Purposes
  79. 79. I PROPOSE THAT MEN AND WOMEN BE RETURNED TO WORK AS CONTROLLERS OF MACHINES, AND THAT THE CONTROL OF PEOPLE BY MACHINES BE CURTAILED. I PROPOSE, FURTHER, THAT THE EFFECTS OF CHANGES IN TECHNOLOGY AND ORGANIZATION ON LIFE PATTERNS BE TAKEN INTO CAREFUL CONSIDERATION, AND THAT THE CHANGES BE WITHHELD OR INTRODUCED ON THE BASIS OF THIS CONSIDERATION. K U R T V O N N E G U T P L AY E R P I A N O
  80. 80. Instrumentation addresses the problems of documentation, documentation the problems of instrumentation. TL;DR
  81. 81. Complex Systems need them both.
  82. 82. How do I get started?
  83. 83. Exometer
  84. 84. Dropwizard’s Metrics
  85. 85. Scales
  86. 86. DataDog NewRelic Librato
  87. 87. Questions?
  88. 88. Thanks! <3 @bltroutwine

×