Building Resilience: How Outages Shaped Etsy's Systems

440 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1iyFyTo.

Avleen Vig presents some of the most unexpected, confusing, hilarious and face-palming events during Etsy's outages to show what can be learnt from their problems to build more resilient systems. Filmed at qconlondon.com.

Avleen Vig is a Staff Operations Engineer at Etsy, where he spends much of his time growing the infrastructure for selling knitted gloves and cross-stitch periodic tables. Before joining Etsy he worked at several large tech companies, including EarthLink and Google, as well as a number of small successful startups.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
440
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Building Resilience: How Outages Shaped Etsy's Systems

  1. 1. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /resiliency-etsy
  2. 2. Presented at QCon London www.qconlondon.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  3. 3. Building resilience How outages shaped Etsy’s systems
  4. 4. Act 1
  5. 5. Quick! Be resilient! http://www.flickr.com/photos/niaid/11854196633/sizes/l/
  6. 6. Quick! Be resilient! • Actually, it’s a slow process • Iterative • Introspective • Horizontal and vertical development
  7. 7. Quick! Be resilient! http://www.flickr.com/photos/ogcodes/6091644301/sizes/l/
  8. 8. Quick! Be resilient! http://www.flickr.com/photos/studio360/1150744342/sizes/o/
  9. 9. Quick! Be resilient! http://www.flickr.com/photos/studio360/1150744368/sizes/o/
  10. 10. Quick! Be resilient! http://www.flickr.com/photos/ogcodes/6091644301/sizes/l/
  11. 11. Quick! Be resilient! Current generation Next generation
  12. 12. Quick! Be resilient! http://www.flickr.com/photos/jurvetson/8671257096/
  13. 13. Quick! Be resilient! http://cudebi.wordpress.com/2012/09/19/tah-pagh-tahbe-o-el-reconocimiento-de-william-shakespeare-en-el-universo-de-star-trek/
  14. 14. Resilience Engineering http://www.flickr.com/photos/freefoto/728651045/sizes/o/
  15. 15. Resilience Engineering • “To Engineer is Human”
 “To Forgive Design”
 - Henry Petroski • “The Field Guide to Understanding Human Error”
 “Just Culture”
 - Sidney Dekker
  16. 16. Act 2
  17. 17. Building resilience at Etsy • Continuous deployment • Metrics, metrics, metrics • Peer review • Postmortems
  18. 18. Building resilience at Etsy • Continuous deployment • Metrics, metrics, metrics • Peer review • Postmortems }Culture
  19. 19. Or: How to win at failing Postmortems
  20. 20. • No blame • Open discussion • Focus on improvements Constructive cultures
  21. 21. • No blame • Open discussion • Focus on improvements }Culture Constructive cultures
  22. 22. –Japanese proverb “The nail that sticks up,
 gets hammered down” Destructive cultures
  23. 23. The result?
  24. 24. • #23: Fortune’s “Top 50 best small and medium businesses to work for” • Rapid code iterations and deploys • Lasting relationships • Generousity of spirit • …and much more
  25. 25. Act 3
  26. 26. Doing postmortems? Get Morgue http://github.com/etsy/morgue
  27. 27. Morgue
  28. 28. Morgue
  29. 29. Morgue
  30. 30. Forkistan • Mean time to detect: 0 min • Mean time to recover: 10 mins
  31. 31. Yo Dawg, I Heard You Like Errors.. • Mean time to detect: 2 mins • Mean time to recover: 15 mins
  32. 32. Smashing INT for Fun and Profit • Mean time to detect: 0 min • Mean time to recover: 4 hrs 52 mins
  33. 33. Apache Amnesia • Mean time to detect: 2 hours • Mean time to recover: 5 mins
  34. 34. Continuously Upgrading Databases • Mean time to detect: 2 mins • Mean time to recover: 1 hour (but, not really..)
  35. 35. Q & A Avleen Vig Staff Operations Engineer Etsy, Inc @avleen
  36. 36. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/resiliency -etsy

×