Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Expedia's Journey toward Site Resiliency

73 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2oAuPOe.

Sahar Samiei and Willie Wheeler share Expedia’s resiliency journey, starting with resiliency as an afterthought and progressing toward resiliency as a first-class concern. They talk about the importance of partnering with the teams experiencing operational struggles, and equipping them with the data to make the right investments at the right time. Filmed at qconsf.com.

Sahar Samiei is a Senior Product Manager leading the Site Reliability Program at Expedia. She has been with Expedia for eight years, working across different teams, focusing on operations. Willie Wheeler is a Principal Application Engineer at Expedia, with 20 years of professional software development experience. At Expedia he serves as a resiliency champion within the engineering organization.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Expedia's Journey toward Site Resiliency

  1. 1. Willie Wheeler (@williewheeler) Principal Applications Engineer Sahar Samiei (@saharsamiei) Senior Technical Product Manager Expedia’s Journey Toward Site Resilience Copyright (c) 2017 Expedia, Inc. All rights reserved.
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ expedia-website-resiliency
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. In 2016, Expedia was the 11th-largest Internet company by revenue at $8.77B. Source: https://en.wikipedia.org/wiki/List_of_largest_Internet_companies `
  5. 5. Uptime Downtime Potential Revenue Loss 99% 3.65 days $87.7M 99.9% 8.76 hours $8.77M 99.99% 52.56 minutes $877K The Price of Downtime: A Back-of-the-Envelope Calculation
  6. 6. So keeping our site up protects tens of millions of dollars of revenue per year.
  7. 7. OVERVIEW Resilience Challenges Expedia’s Strategy Technical Approach Q & A Results to Date
  8. 8. Resilience Challenges
  9. 9. Resilience is not always treated like a first-class citizen.
  10. 10. In Expedia we have a test-and- learn culture.
  11. 11. Innovation at Expedia is about constantly iterating its products and features.
  12. 12. Resilience is an uphill battle.
  13. 13. Too many competing priorities.
  14. 14. There are major misconceptions about resilience.
  15. 15. Myth: By migrating to AWS your service will be resilient.
  16. 16. Myth: We can’t do anything when S3 has an outage.
  17. 17. Myth: Can’t build resilient services on flaky infrastructure.
  18. 18. Team Autonomy
  19. 19. Team autonomy is one of Expedia’s core beliefs. It’s great to have latitude to establish your own practices, tools and processes but it could add complexity.
  20. 20. We are still learning.
  21. 21. Resilience is not free.
  22. 22. Expedia Strategy
  23. 23. We put travelers at the heart of our thinking. To organize ourselves to ensure maximum availability of our websites – even during odd conditions – we started an initiative called “Perceived Zero Downtime”.
  24. 24. We make incremental improvements.
  25. 25. We empower teams by providing tools & best practices We humbly listen to different approaches We are not biased We are open to other teams’ suggestions
  26. 26. We embrace team autonomy. We set up a forum to bring teams from different brands together to share ideas, tools and best practices.
  27. 27. We create a shared learning space.
  28. 28. Introduce resilience champions.
  29. 29. We invest lots of time to gather and present data.
  30. 30. Tier 1 dependency failure Product with a fallback Product without a fallback
  31. 31. Pursue small wins by going after low-hanging fruit.
  32. 32. We are partnering with teams.
  33. 33. Technical Approach Photo: Hans Splinter
  34. 34. Lots of teams. Lots of tools.
  35. 35. How do we pull it all together?
  36. 36. Resilience Engineering Lifecycle Service Service Service Service Service Service Service Service Service Service Database Database 3rd-party service 3rd-party service
  37. 37. Resilience Engineering Lifecycle Service Service Service Service Service Service Service Service Service Service Database Database 3rd-party service 3rd-party service 1. Prioritize Services
  38. 38. Resilience Engineering Lifecycle Service Service Service Service Service Service Service Service Database Database 3rd-party service 3rd-party service 1. Prioritize Services 2. Investigate Vulnerabilities Service Service
  39. 39. Resilience Engineering Lifecycle Service Database 3rd-party service 1. Prioritize Services 2. Investigate Vulnerabilities 3. Apply Resilience Patterns Service
  40. 40. Resilience Engineering Lifecycle Service Database 3rd-party service 1. Prioritize Services 2. Investigate Vulnerabilities 3. Apply Resilience Patterns Service 4. Experiments in Test
  41. 41. Resilience Engineering Lifecycle Service Database 3rd-party service 1. Prioritize Services 2. Investigate Vulnerabilities 3. Apply Resilience Patterns Service 4. Experiments in Test 5. Experiments in Production
  42. 42. Let’s take a closer look…
  43. 43. 1. Prioritize Services Service Tiering Scorecards & Reporting
  44. 44. 1. Prioritize Services Service Tiering • Tier 1 – Essential • Tier 2 – Important • Tier 3 – Nice to have
  45. 45. 1. Prioritize Services Scorecards & Reporting • Incidents • Availability • Resilience
  46. 46. 1. Prioritize Services Resilience Scorecard Resilience Alerts Your app failed a Chaos Monkey experiment. Chaos Monkey terminated an EC2 instance, causing an outage. To resolve this, please use EC2 Auto Scaling Groups.
  47. 47. Focus areas: identified. What are the vulnerabilities?
  48. 48. 2. Investigate Vulnerabilities Resilience Maturity Model (analytic) Interactive Experiments (empirical)
  49. 49. 2. Investigate Vulnerabilities Resilience Maturity Model Survive instance loss Survive dependency loss Survive AZ loss Survive region loss And so forth…
  50. 50. 2. Investigate Vulnerabilities Interactive Experiments • Interactive attack tools • Discover vulnerabilities
  51. 51. Now we understand some vulnerabilities. How do we address them?
  52. 52. 3. Apply Resilience Patterns Multi-Geo Rate LimitingAuto-Scaling Database Failover Circuit Breakers Bulkheads
  53. 53. An open source circuit breaker implementation by Netflix: • Start closed • Trip (open) when downstream is unhealthy • Close when downstream is healthy again Defend your app with Hystrix UI API Database External API
  54. 54. • Isolate resources to avoid unintended interactions between apps • Avoid "tragedy of the commons" Defend your app with Bulkheads
  55. 55. On our side, it is not a production incident because no client started to use this new platform.
  56. 56. Now we need to test our defenses.
  57. 57. 4. Experiments in Test build deploy to test resilience tests perf tests security scan deploy to prod release to prod
  58. 58. Record test results
  59. 59. 4. Experiments in Test $ python test3.py Baseline city count is 8. Response during attack returned a city count of 1. Response after attack returned a city count of 8. ********* Test success. *********
  60. 60. Pipeline tests are great. But we do the real thing in prod.
  61. 61. 5. Production-Based Experiments Randomized Attacks Health Checks
  62. 62. 5. Production-Based Experiments Randomized Attacks • Netflix Simian Army • Self-service whitelist (autonomous teams)
  63. 63. 5. Production-Based Experiments Health Checks • Pre-, during and post-attack • Pluggable checks
  64. 64. Recap Service TestDevelopment Service Service Production
  65. 65. Results to Date Photo: Mycatkins
  66. 66. We've had a good start over the past year.
  67. 67. Chaos Monkey running daily in production since May 15 ResultstoDate Added resilience tests to four Tier 1 service pipelines Increased organizational awareness Established resilience community of practice
  68. 68. Dev team engagement has been a struggle: • Limited team capacity • Product > resilience ResultstoDate
  69. 69. Pivot for 2018: Automation Reduce the cost of resilience engineering through automation Test by discovery VisibilityProxy-based resilience
  70. 70. Resilience has a cost! Teams have multiple competing prioritizes There are major misconceptions Reduce friction Prepare for early effort Whatdidwelearn?
  71. 71. Add complexity incrementally Try multiple technical approaches Make adoption easy Drive for experiments in production Whatdidworkwell? Represent metrics Start with low-hanging fruit
  72. 72. Questions & Answers
  73. 73. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ expedia-website-resiliency

×