Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Do not panic! (dealing with major incidents)

37 views

Published on

Index:
- Examples of incidents
- How to be prepared
- How to react

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Do not panic! (dealing with major incidents)

  1. 1. do not panic! dealing with major incidents 1 Sergio Arcos Sebastian 2017-07-06
  2. 2. challenge failed 2
  3. 3. Everything started here... > SELECT a.id, c.id FROM accounts a JOIN credentials c… 550 rows > SELECT string_agg(c.id::text, ‘,’) FROM accounts a JOIN credentials c… 1,2,3... $ a = Account.where(:id => [1,2,3, ...]) $ a.count 350 rows $ a.destroy_all 3
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. challenge considered 8
  9. 9. Define a Major Incident = << we don’t care about toilet paper, as long as there’s at least one roll left >> 9 Urgency Impact Severity Priority
  10. 10. Alert & monitoring system 10
  11. 11. Incident notification platform (phone, sms, push, ..) 11
  12. 12. Incident repository / Status page 12 GithubNewRelic
  13. 13. Landing page 13
  14. 14. Minimum contingency plan << The backup plan cost more than fix the incident >> 14 Model Affected Guests Business Repercussion Team Members ... Doorkeeper All Critical 1 AdminPanel Internal Low 1 Permitted Partners High 1 Uploads Paying High 2 Notifications Free Low 1
  15. 15. Follow best code practices - Version your endpoints - Split your endpoints (add/remove) (micro-service) - Apply small changes at once - Roll out frequency - Idempotency - Flag as deleted - Be paranoid 15
  16. 16. Follow best infrastructure practices - Defense in depth (also known as Castle Approach) - Use canaries (blue/green deployment) & rollback - Automatic fallbacks (reboot if is down) - Use API gateways - Backups, replication, redundancy, … - Dead letter queues - Logs (when, where, who, what) - Infrastructure by code (even ENV variables!) - Disaster-recovery testing (ex. Chaos Monkey) 16
  17. 17. challenge accepted 17
  18. 18. Workflow (template) 1. Stop! 2. Delay worse consequences 3. Communicate to your team 4. Pair 5. Write next steps 6. Log everything 7. Fix it 8. Add asserts 18
  19. 19. Easiests mistakes - Do not keep it hidden - Do not bypass your CI - Do not fix it at any cost - Interrupt your boss’ meeting if needed - Experience makes you feel more comfortable - Knowledge makes you fix the issue - Your stakeholders should be informed - Do not finger point 19
  20. 20. Iterate your custom process - Do a retrospective with your team - Survey your stakeholders - Review your statistics to ensure you don’t underestimate it - Do a post-mortem - Create or update your documentation - Increase your number of assertions - Automate 20
  21. 21. martes13.net 21 hjdl.space

×