Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSMC 2019 | Hot Potato by James Forman

Hot Potato is a message broker that sits in between monitoring systems and messaging providers to ensure consistent relaying of messages to on-call staff. It was designed and developed in New Zealand to survive the harshest worst-case scenarios we could come up with in a country prone to natural disasters.

The goal of the project is to give on-call people control and freedom while giving your notifications every chance of arriving, through any provider or connection you might have.

  • Be the first to comment

  • Be the first to like this

OSMC 2019 | Hot Potato by James Forman

  1. 1. You can talk to me about: ● Making on-call better for humans ● High Availability and Load Balancing ● Mirroring free software ● Zero trust networking ● Home Automation and garage racks ● Pretty much anything James Forman
  2. 2. I am a: Linux Sysadmin, Network Engineer and People Manager I work for:
  3. 3. https://en.wikipedia.org/wiki/File:Wellington_montage_2.jpghttps://en.wikipedia.org/wiki/File:New_Zealand_relief_map.jpg
  4. 4. Overlay by: Wikipedia User Hazhk
  5. 5. Catalyst’s Wellington Team - December 2017
  6. 6. OSMC.de 2019
  7. 7. First things first
  8. 8. Content Warning Photos of earthquake damage
  9. 9. Hot Potato is not a monitoring system
  10. 10. It’s a message broker
  11. 11. Life before Hot Potato
  12. 12. One pager team The “pager peeps”
  13. 13. Every person has a pager One number, multiple pagers
  14. 14. A range of different monitoring builds Customer managed, remote, out of country, in country with pager access, email to pager gateways
  15. 15. 18 Monitoring Servers Nagios 3, Icinga 1.x and Icinga2
  16. 16. Support Hotline Call number, leave voicemail, wake person up
  17. 17. IRC based handovers <jforman> Pager on <redacted> Going to sleep
  18. 18. Why did we build it?
  19. 19. We already wanted a replacement system
  20. 20. Aging technology The pager system was becoming unreliable
  21. 21. Image credit: Matthew Inman / theoatmeal.com
  22. 22. The top 3 questions
  23. 23. 1. Why not use a service?
  24. 24. SLAs None of the options could meet our requirements
  25. 25. 2. Why not use SMS?
  26. 26. 3. Why not go staffed 24/7?
  27. 27. Our other motivations
  28. 28. Open Source
  29. 29. Customer data Stop sending notifications in cleartext
  30. 30. https://techcrunch.com/2019/10/30/nhs-pagers-medical-health-data/
  31. 31. We thought we had time to find a replacement
  32. 32. https://www.spark.co.nz/content/dam/kb/public/docs/media-release-paging-network-closure.pdf
  33. 33. We thought we had time to find a replacement
  34. 34. At first it was good news
  35. 35. Aging technology
  36. 36. It was too good to be true
  37. 37. The pager network became unreliable
  38. 38. “In response to Radio New Zealand queries Spark said it had talked to many of its customers before the announcement was made and that included the Fire Service.”
  39. 39. Time ran out (in the middle of the night)
  40. 40. NO CARRIER :(
  41. 41. “1st cab off the rank was those pager numbers that had not signed up to the new pager network were disconnected.”
  42. 42. “We have then worked with the customers who have migrated across to replace their old access points (ways they send a pager message) to either Email or an API option.”
  43. 43. “This is because the old access points are being turned off.”
  44. 44. Photo by: BRENDON O'HAGAN/FAIRFAX NZ
  45. 45. Solving the immediate problem so people could sleep
  46. 46. eMail -> SMS We sent all notifications via SMS as an emergency measure
  47. 47. eMail == :( Nameless project == :)
  48. 48. The first version of Hot Potato
  49. 49. A really bad “API” The worst thing I’ve put into production
  50. 50. A dodgy script Rolled out to all the monitoring servers
  51. 51. Insert and Send Add to database, send pager message
  52. 52. select * from notifications A table of notifications
  53. 53. A handover button sends a message saying you have the pager
  54. 54. v0.1 - Much more reliable than email
  55. 55. It worked (mostly)
  56. 56. It gave us the time and opportunity to do better
  57. 57. We had some goals
  58. 58. Don’t get in the way make it easy to be on-call
  59. 59. Enable alert reduction let people sleep
  60. 60. Survive natural hazards the reality of building systems in NZ
  61. 61. Volcanoes
  62. 62. https://www.nationalgeographic.org/news/plate-tectonics-ring-fire/
  63. 63. Earthquakes
  64. 64. Recent fatal earthquakes 22 February 2011 - Christchurch - 185 people 13 June 2011 - Christchurch - 1 person 14 November 2016 - Kaikoura - 2 people
  65. 65. Diagrams by: Wikipedia User Mikenorton
  66. 66. Photo by: New Zealand Defence Force
  67. 67. Photo by: New Zealand Defence Force
  68. 68. Photo by: RNZ / Rebekah Parsons-King
  69. 69. Photo of: MP Stuart Smith
  70. 70. Photo by: RNZ / Simon Morton
  71. 71. Photo by: RNZ / Conan Young
  72. 72. Photo by: RNZ / Aaron Smale
  73. 73. Photo by: Phillip Pearson
  74. 74. Tsunamis
  75. 75. https://wremo.nz/hazards/tsunami-zones/
  76. 76. https://wremo.nz/hazards/tsunami-zones/
  77. 77. https://wremo.nz/hazards/tsunami-zones/
  78. 78. https://wremo.nz/hazards/tsunami-zones/
  79. 79. Survive any loss of International Connectivity we had 1 main undersea cable (2 landings)
  80. 80. Image credit: Tourism New Zealand
  81. 81. https://www.submarinecablemap.com/
  82. 82. https://www.submarinecablemap.com/
  83. 83. https://www.submarinecablemap.com/
  84. 84. Then we had some requirements
  85. 85. Survive disasters Earthquakes, tsunamis, volcanoes, team lunches..
  86. 86. Support existing monitoring Nagios3, Icinga 1.x, Icinga2
  87. 87. Get rid of email No more using email to deliver messages
  88. 88. Confirm message delivery Move from paging and SMS to Push Notifications
  89. 89. Improve handover Is your pager on yet? I want to go to sleep
  90. 90. #deathTo Pagers “I’d rather have a bee burrow into my skull than carry a pager again” - Me
  91. 91. What did we build?
  92. 92. A web app with an API built with Python and Flask
  93. 93. With a funky database and some queuing CockroachDB and RabbitMQ
  94. 94. Our production environment has 5 nodes NZ: Porirua, Wellington and Hamilton AU: Sydney US: California
  95. 95. How does it work?
  96. 96. Sending notifications
  97. 97. Heartbeats
  98. 98. How does it look?
  99. 99. What else can it do?
  100. 100. Failure notifications the pager network is down again!
  101. 101. Heartbeats ensuring connectivity
  102. 102. Teams put everyone on-call!
  103. 103. Team escalations because sometimes bad things happen
  104. 104. Reports A breakdown of the week that was
  105. 105. Promote alert reduction With the help of some neopixels
  106. 106. What notification providers does it support?
  107. 107. Twilio For delivery of SMS messages
  108. 108. Modica For delivery of SMS messages and pager messages
  109. 109. Pushover For delivery of push notifications to Android and iOS
  110. 110. What’s planned?
  111. 111. Mobile app for Android and iOS, no more pagers
  112. 112. Support hotline direct calls to the on-call person or take messages
  113. 113. Planned work stop forgetting to extend downtime on things
  114. 114. Language support German and Italian coming soon
  115. 115. What do I need to try it?
  116. 116. What do I need to deploy it to production?
  117. 117. One server If you don’t want redundancy, you don’t have to have it
  118. 118. Demo?
  119. 119. James Forman Callum Dickinson Filip Vujičić Zac Pullar-Strecker Opal Symes Rhys Davies Michael Fincham Tim Bruce Jamie McClymont Toni Gardener Manuela Spies Sapir Ben-Shahar Brynn Wilde Hemanth Sonthi Emanuel Evans Hazel Meehan Baxter Gray Sam Banks Thank you to our contributors
  120. 120. Open Source Academy
  121. 121. https://hotpotato.nz
  122. 122. Questions?
  123. 123. https://hotpotato.nz @teamHotPotato #hotpotato on freenode

×