Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed
quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent.
Scaling DevOps at
Kiran Bondalapati, Uber
Igniting opportunity by setting the world in motion
10+ billion trips
15M+ trips per day
6 continents, 65 countries and 600+ cities
75M active monthly users
3M+ active drivers
16,000+ employees worldwide
3000+ developers worldwide
Bits + Atoms
0101010101010101010101
1010101010101011100111
0001110010101010101001
5/20/17 - 5 Billion Trips
6/18/16 - 2 Billion Trips
12/31/15 - 1 Billion Trips
6/10/18 - 10 Billion Trips
Business
1000s of Microservices
1000s of builds per day
10000+ deployments per day
100K+ service containers per cluster
~1M batch containers per day
DevOps
CODE
BUILD
DEPLOY
TEST
RUN
MONITOR
DevOps
Pre-history PHP (outsourced)
Marketplace Node.JS, moving to Go
Core Services Python, moving to Go, Java
Maps Python and Java
Data Python and Java
Metrics Go
Code
20000+ repos
Multiple languages and frameworks
Multiple communication protocols
Microservices
March 2016
Microservices
March 2016
March 2018
4000+ builds per day
Build times affect developer productivity
Build sizes affect deployments
Build
Build without docker
Optimize layer generation
Distributed cache for intermediate layers
100s of services pulling 1000s images from Registry
Deploy
Vertical Scaling
Horizontal Scaling
P2P Distribution - Scales with Load
Reproduce Halloween and New Year
Systemic issues are hard in unit tests
Cascading failures are common in real life
Test
Hailstorm load testing framework
uDestroy random failure injection framework
Regular failure and failover drills
no testee … no workee
Containers are sized for peak load
Dynamic utilization affects cluster efficiency
Typical auto-scaling does not help
Run
Combine responsive and revocable tasks
Oversubscribe resources
Rate limiting of revocation
M3 metrics platform
~5B time series
~10M metrics/sec
Changing services, metrics, infrastructure, ...Monitoring
Rule based alert generators
Git based review and update
Measure oncall quality
HW/SW has tendency to have faults
100M+ alerts per month across Uber stack
Many faults are transient/temporary
Remediate
Smart alert prioritization
Automate manual tasks - reboot, restart, ...
SLA aware remediation
Provision
Deploy
Config
Scale
Update
Detect
Recovery
Remedy
Auto-
Standards based innovation
Layercake architecture
Avoid cyclic dependencies
Avoid cascading failures while designing
Incremental deployments - code and config
Test often … including production
Add guardrails to automation
Design for understandability
Learnings
Larger systems
Bigger impact of changeScale
Larger teams
Less each person knows
Our understanding of systems
breaks more often than
actual systems do
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
We are hiring!
www.uber.com/careers/

Scaling DevOps of Microservices at Uber (Code Conf 2018)

  • 1.
    Edit or deletefooter text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Scaling DevOps at Kiran Bondalapati, Uber Igniting opportunity by setting the world in motion
  • 2.
    10+ billion trips 15M+trips per day 6 continents, 65 countries and 600+ cities 75M active monthly users 3M+ active drivers 16,000+ employees worldwide 3000+ developers worldwide
  • 3.
  • 4.
    5/20/17 - 5Billion Trips 6/18/16 - 2 Billion Trips 12/31/15 - 1 Billion Trips 6/10/18 - 10 Billion Trips Business
  • 5.
    1000s of Microservices 1000sof builds per day 10000+ deployments per day 100K+ service containers per cluster ~1M batch containers per day DevOps
  • 6.
  • 7.
    Pre-history PHP (outsourced) MarketplaceNode.JS, moving to Go Core Services Python, moving to Go, Java Maps Python and Java Data Python and Java Metrics Go Code 20000+ repos Multiple languages and frameworks Multiple communication protocols
  • 8.
  • 9.
  • 10.
    4000+ builds perday Build times affect developer productivity Build sizes affect deployments Build Build without docker Optimize layer generation Distributed cache for intermediate layers
  • 11.
    100s of servicespulling 1000s images from Registry Deploy Vertical Scaling Horizontal Scaling P2P Distribution - Scales with Load
  • 12.
    Reproduce Halloween andNew Year Systemic issues are hard in unit tests Cascading failures are common in real life Test Hailstorm load testing framework uDestroy random failure injection framework Regular failure and failover drills no testee … no workee
  • 13.
    Containers are sizedfor peak load Dynamic utilization affects cluster efficiency Typical auto-scaling does not help Run Combine responsive and revocable tasks Oversubscribe resources Rate limiting of revocation
  • 14.
    M3 metrics platform ~5Btime series ~10M metrics/sec Changing services, metrics, infrastructure, ...Monitoring Rule based alert generators Git based review and update Measure oncall quality
  • 15.
    HW/SW has tendencyto have faults 100M+ alerts per month across Uber stack Many faults are transient/temporary Remediate Smart alert prioritization Automate manual tasks - reboot, restart, ... SLA aware remediation
  • 16.
  • 17.
    Standards based innovation Layercakearchitecture Avoid cyclic dependencies Avoid cascading failures while designing Incremental deployments - code and config Test often … including production Add guardrails to automation Design for understandability Learnings
  • 18.
    Larger systems Bigger impactof changeScale Larger teams Less each person knows Our understanding of systems breaks more often than actual systems do
  • 19.
    Proprietary and confidential© 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. We are hiring! www.uber.com/careers/