1Copyright © 2020 OverOps. All rights reserved.
Thank you for joining us!
We will begin in just a few moments
Move Fast. Fix Faster.
2Copyright © 2020 OverOps. All rights reserved.
How Expedia Improved Developer
Productivity and Reduced MTTR
By Over 90%
September 29, 2020
Move Fast. Fix Faster.
3Copyright © 2020 OverOps. All rights reserved.
Welcome...
Director of Technology
Gavan McLaughlin
Thanks for joining!
Ravi Vankamamidi
Site Reliability EngineerVP, Solution Engineering
Eric Mizell
4Copyright © 2020 OverOps. All rights reserved.
Today
We’re
Covering
● Engineering at Expedia
The role of engineering in delivering a
seamless experience to Expedia customers
– especially during COVID-19
● Code Quality and Reliability Challenges
A look into the Expedia CI/CD pipeline and
troubleshooting process
● OverOps & Continuous Reliability at Expedia
Expedia’s enhanced reliability strategy and
how OverOps helped reduce MTTI & MTTR
● Q&A
5Copyright © 2020 OverOps. All rights reserved.
Engineering at Expedia
6Copyright © 2020 OverOps. All rights reserved.
About Expedia
Expedia Group is the leading world travel platform with world-leading brands such as Expedia.com,
Hotwire.com and Vrbo.com, among others.
Serving travelers is at the center of Expedia’s business
Conversation Platform:
● Meeting traveler on a channel
of her choice
● Enabling travelers to self-serve
(using Virtual Agent )
● Connecting traveler to a human
agent for complex issues
7Copyright © 2020 OverOps. All rights reserved.
Conversation Platform enables exchanging of Multi-participant,
multi-channel messages: Social, Chat, Voice, Email, SMS etc.
● Event-driven architecture:
- Loose coupling
- Microservices (Fulfillment, Translation, Routing, Bot
runtime, Plug-ins etc.)
- Easy to introduce new technology
● Distributed and scalable: high complexity
Bottom line: Complex system, requires quick troubleshooting
Expedia Group - Conversation Platform
8Copyright © 2020 OverOps. All rights reserved.
SRE Principles at Expedia
Quality at the Source
Data Driven Decision Making
Continuous Improvement
Shared Responsibility
Protect The Platform
9Copyright © 2020 OverOps. All rights reserved.
What does an application error mean to
Expedia’s customers during COVID-19?
Booked a non-refundable trip to Europe.
10Copyright © 2020 OverOps. All rights reserved.
What does an application error mean to
Expedia’s customers during COVID-19?
Booked a non-refundable trip to Europe.
Pandemic hits; travellers can no longer get to their destination.
11Copyright © 2020 OverOps. All rights reserved.
What does an application error mean to
Expedia’s customers during COVID-19?
Booked a non-refundable trip to Europe.
Pandemic hits; travellers can no longer get to their destination.
What happens to their hotel reservation? What if they can’t
reach customer service?
12Copyright © 2020 OverOps. All rights reserved.
Code Quality and
Reliability Challenges
13Copyright © 2020 OverOps. All rights reserved.
Expedia’s Architecture & Tech Stack
Tech stack:
- Microservices ( Java, Kotlin, Python, Node.js )
- Kafka ( Stream processing, KSQL )
- Graphql, Restful services, Eventing
- AWS, NoSQL-DB
- Configuration Manager ( Versioning, Intelligent defaults )
- ReactJS, Redux, Typescript
14Copyright © 2020 OverOps. All rights reserved.
The CI/CD Pipeline & Troubleshooting Workflow
Code Build & Unit Tests Integration Tests Staging Production
Development CI/CD Pipeline
Go/
No Go
Go/
No Go
Go/
No Go
APM & TracingStatic Analysis Testing Log AnalysisMetrics
Release cycle progress →
Visibility→
Code Review
15Copyright © 2020 OverOps. All rights reserved.
Speed, quality and cost
are constantly at odds.
Top Challenges
Communication is difficult
across engineering teams.
Learning as a large
organization is challenging.
COST QUALITY
SPEED
16Copyright © 2020 OverOps. All rights reserved.
Continuous Reliability
at Expedia
17Copyright © 2020 OverOps. All rights reserved.
Prevention is better than the cure
● Issues are detected prior to reaching Production
● All context needed to reproduce is in the error snapshot (variables & code executed)
Minimal developer effort to onboard
● Exceptions are identified regardless whether they were logged or not
● Integrates automatically to existing log events
“All you can eat” license
● Ever growing in cloud, reduced maintenance/stress
● Not having to worry about exceeding license limits
1
2
3
Why Expedia Chose OverOps?
18Copyright © 2020 OverOps. All rights reserved.
OverOps Across the Expedia Pipeline
Automated error detection and complete context for every software defect
Code Build & Unit Tests Integration Tests Staging Production
Developers, QA, SRE/Ops
feedbackfix
Dynamic Analysis
CI/CD Pipeline
Is it new or critical? Why did it break? Who is responsible?
Development
19Copyright © 2020 OverOps. All rights reserved.
Getting Started with OverOps
⇨ Installation for each microservice baked into standard images
⇨ Toggleable installation by development teams (controlled in Git)
⇨ 2 week “soak” after installation to calibrate against known defects and create a baseline
⇨ OverOps data collected at each environment as code is executed
⇨ Go/no-go decision tree after E2E tests based on defined “Critical” exceptions
⇨ Joint effort to implement with support from OverOps services team
20Copyright © 2020 OverOps. All rights reserved.
Development Team A were hit with a production issue
caused by a NullPointerException that took over 4 hours
to resolve.
In Post Mortem, the team realized using OverOps would
have made it vastly easier to detect and resolve the issue.
Post Incident:
The team addresses NullPointerExceptions prior to
reaching production. And triages new production Null
Pointer Errors immediately following release.
Solving a New NullPointerException
21Copyright © 2020 OverOps. All rights reserved.
Team B were tasked with solving a P1 bug in production.
They identified the exception that was causing it in their
log aggregation tool.
Leveraged the tiny URL to take them to the OverOps error
snapshot with the related code and variables.
Result:
The team was able to reproduce the error and began
working on hotfix < 15 minutes.
P1 Bug in Production
22Copyright © 2020 OverOps. All rights reserved.
OverOps’ Business Value for Expedia
Quantifiable measurement of quality leads
to a better experience by the customer
Prevention of impact & time saved on
troubleshooting translates to more time
spent on innovation and opens opportunities
(IE: deploying a hotfix instead of rolling back)
Improvement of quality and reliability
without major impacts to speed of delivery
1
2
3
23Copyright © 2020 OverOps. All rights reserved.
Demo
24Copyright © 2020 OverOps. All rights reserved.
We’re Hiring!
https://lifeatexpediagroup.com/jobs
Come along on our journey to bring the world within reach
25Copyright © 2020 OverOps. All rights reserved.
Questions?
Director of Technology
Gavan McLaughlinRavi Vankamamidi
Site Reliability EngineerVP, Solution Engineering
Eric Mizell
26Copyright © 2020 OverOps. All rights reserved.
Thanks for attending!
Start a free trial at overops.com
Follow Expedia Group Technology @ExpediaGroupEng

How Expedia Improved Developer Productivity and Reduced MTTR by Over 90%

  • 1.
    1Copyright © 2020OverOps. All rights reserved. Thank you for joining us! We will begin in just a few moments Move Fast. Fix Faster.
  • 2.
    2Copyright © 2020OverOps. All rights reserved. How Expedia Improved Developer Productivity and Reduced MTTR By Over 90% September 29, 2020 Move Fast. Fix Faster.
  • 3.
    3Copyright © 2020OverOps. All rights reserved. Welcome... Director of Technology Gavan McLaughlin Thanks for joining! Ravi Vankamamidi Site Reliability EngineerVP, Solution Engineering Eric Mizell
  • 4.
    4Copyright © 2020OverOps. All rights reserved. Today We’re Covering ● Engineering at Expedia The role of engineering in delivering a seamless experience to Expedia customers – especially during COVID-19 ● Code Quality and Reliability Challenges A look into the Expedia CI/CD pipeline and troubleshooting process ● OverOps & Continuous Reliability at Expedia Expedia’s enhanced reliability strategy and how OverOps helped reduce MTTI & MTTR ● Q&A
  • 5.
    5Copyright © 2020OverOps. All rights reserved. Engineering at Expedia
  • 6.
    6Copyright © 2020OverOps. All rights reserved. About Expedia Expedia Group is the leading world travel platform with world-leading brands such as Expedia.com, Hotwire.com and Vrbo.com, among others. Serving travelers is at the center of Expedia’s business Conversation Platform: ● Meeting traveler on a channel of her choice ● Enabling travelers to self-serve (using Virtual Agent ) ● Connecting traveler to a human agent for complex issues
  • 7.
    7Copyright © 2020OverOps. All rights reserved. Conversation Platform enables exchanging of Multi-participant, multi-channel messages: Social, Chat, Voice, Email, SMS etc. ● Event-driven architecture: - Loose coupling - Microservices (Fulfillment, Translation, Routing, Bot runtime, Plug-ins etc.) - Easy to introduce new technology ● Distributed and scalable: high complexity Bottom line: Complex system, requires quick troubleshooting Expedia Group - Conversation Platform
  • 8.
    8Copyright © 2020OverOps. All rights reserved. SRE Principles at Expedia Quality at the Source Data Driven Decision Making Continuous Improvement Shared Responsibility Protect The Platform
  • 9.
    9Copyright © 2020OverOps. All rights reserved. What does an application error mean to Expedia’s customers during COVID-19? Booked a non-refundable trip to Europe.
  • 10.
    10Copyright © 2020OverOps. All rights reserved. What does an application error mean to Expedia’s customers during COVID-19? Booked a non-refundable trip to Europe. Pandemic hits; travellers can no longer get to their destination.
  • 11.
    11Copyright © 2020OverOps. All rights reserved. What does an application error mean to Expedia’s customers during COVID-19? Booked a non-refundable trip to Europe. Pandemic hits; travellers can no longer get to their destination. What happens to their hotel reservation? What if they can’t reach customer service?
  • 12.
    12Copyright © 2020OverOps. All rights reserved. Code Quality and Reliability Challenges
  • 13.
    13Copyright © 2020OverOps. All rights reserved. Expedia’s Architecture & Tech Stack Tech stack: - Microservices ( Java, Kotlin, Python, Node.js ) - Kafka ( Stream processing, KSQL ) - Graphql, Restful services, Eventing - AWS, NoSQL-DB - Configuration Manager ( Versioning, Intelligent defaults ) - ReactJS, Redux, Typescript
  • 14.
    14Copyright © 2020OverOps. All rights reserved. The CI/CD Pipeline & Troubleshooting Workflow Code Build & Unit Tests Integration Tests Staging Production Development CI/CD Pipeline Go/ No Go Go/ No Go Go/ No Go APM & TracingStatic Analysis Testing Log AnalysisMetrics Release cycle progress → Visibility→ Code Review
  • 15.
    15Copyright © 2020OverOps. All rights reserved. Speed, quality and cost are constantly at odds. Top Challenges Communication is difficult across engineering teams. Learning as a large organization is challenging. COST QUALITY SPEED
  • 16.
    16Copyright © 2020OverOps. All rights reserved. Continuous Reliability at Expedia
  • 17.
    17Copyright © 2020OverOps. All rights reserved. Prevention is better than the cure ● Issues are detected prior to reaching Production ● All context needed to reproduce is in the error snapshot (variables & code executed) Minimal developer effort to onboard ● Exceptions are identified regardless whether they were logged or not ● Integrates automatically to existing log events “All you can eat” license ● Ever growing in cloud, reduced maintenance/stress ● Not having to worry about exceeding license limits 1 2 3 Why Expedia Chose OverOps?
  • 18.
    18Copyright © 2020OverOps. All rights reserved. OverOps Across the Expedia Pipeline Automated error detection and complete context for every software defect Code Build & Unit Tests Integration Tests Staging Production Developers, QA, SRE/Ops feedbackfix Dynamic Analysis CI/CD Pipeline Is it new or critical? Why did it break? Who is responsible? Development
  • 19.
    19Copyright © 2020OverOps. All rights reserved. Getting Started with OverOps ⇨ Installation for each microservice baked into standard images ⇨ Toggleable installation by development teams (controlled in Git) ⇨ 2 week “soak” after installation to calibrate against known defects and create a baseline ⇨ OverOps data collected at each environment as code is executed ⇨ Go/no-go decision tree after E2E tests based on defined “Critical” exceptions ⇨ Joint effort to implement with support from OverOps services team
  • 20.
    20Copyright © 2020OverOps. All rights reserved. Development Team A were hit with a production issue caused by a NullPointerException that took over 4 hours to resolve. In Post Mortem, the team realized using OverOps would have made it vastly easier to detect and resolve the issue. Post Incident: The team addresses NullPointerExceptions prior to reaching production. And triages new production Null Pointer Errors immediately following release. Solving a New NullPointerException
  • 21.
    21Copyright © 2020OverOps. All rights reserved. Team B were tasked with solving a P1 bug in production. They identified the exception that was causing it in their log aggregation tool. Leveraged the tiny URL to take them to the OverOps error snapshot with the related code and variables. Result: The team was able to reproduce the error and began working on hotfix < 15 minutes. P1 Bug in Production
  • 22.
    22Copyright © 2020OverOps. All rights reserved. OverOps’ Business Value for Expedia Quantifiable measurement of quality leads to a better experience by the customer Prevention of impact & time saved on troubleshooting translates to more time spent on innovation and opens opportunities (IE: deploying a hotfix instead of rolling back) Improvement of quality and reliability without major impacts to speed of delivery 1 2 3
  • 23.
    23Copyright © 2020OverOps. All rights reserved. Demo
  • 24.
    24Copyright © 2020OverOps. All rights reserved. We’re Hiring! https://lifeatexpediagroup.com/jobs Come along on our journey to bring the world within reach
  • 25.
    25Copyright © 2020OverOps. All rights reserved. Questions? Director of Technology Gavan McLaughlinRavi Vankamamidi Site Reliability EngineerVP, Solution Engineering Eric Mizell
  • 26.
    26Copyright © 2020OverOps. All rights reserved. Thanks for attending! Start a free trial at overops.com Follow Expedia Group Technology @ExpediaGroupEng