Placing a container on a train at 200mph
Casper S. Jensen
Software Engineer, Uber
About Me
● Joined Uber January 2015,
Compute Platform
Denmark, Aarhus office
● PhD, CS
On a completely unrelated topic
● Linux aficionado
● Docker “user” since February
About UBER
Why all the fuzz?
The UBER app
4
339 Cities
5
61 Countries
6
2.000.000+ Trips/day
7
4000+ Employees
8
Not that hard...
10
You just have to handle
● 24/7 availability across the globe
● Very different markets
● 1000s of developers and teams
● Adding new features like there’s no tomorrow
UberPOOL, UberKITTEN, UberICECREAM, UberEATS,
UberWHATEVERYOUCANIMAGINE
● Hypergrowth in all dimensions
● Datacenters, servers, infrastructure, etc
Basically, you have to make magic happen every time a user
opens the application
Software
Development
The old UBER way
A fair amount of frustration
12
1)Write service RFC
2)Wait for feedback
3)Do all necessary scaffolding by hand
4)Start developing your service
5)Wait for infra team to write service scaffolding
6)Wait for IT to allocate servers
7)Wait for infra team to provision servers
8)Deploy to development servers and test
9)Deploy to production
10)Monitor and iterate
Steps 5–7 could take days or weeks...
It's just not scalable
13
But you have to start somewhere
—Internal e-mail, February 2015
“Make it easier for service
owners to manage their local
service environments.”
14
New development process
16
1)Write service RFC
2)Wait for feedback
3)Do all necessary scaffolding using tools
4)Start developing your service
5)Deploy to development servers and test
6)Deploy to production
7)Monitor and iterate
No silver bullets
All the things you did not consider
19
● Routing
● Dynamic service discovery
● Deployment
● Placement engine
● Logging and tracing
● Dual build environments
● Handling of secrets
● Security updates
● Private repositories
● Replicating images across multiple datacenters
Also, how much freedom do you really want to give your developers?
Change
all the things!
Let's go through some examples
uDeploy
21
● Rolling upgrades
● Automatic rollbacks on failure
● Health checks, stats, exceptions,
○ Load-, and system-tests
● Service building
● Build replication
● 4.000+ upgrades/week
● 3.000+ builds/week
● 300+ rollbacks/week
● 600+ managed services
Our in-house deployment/cluster
management system
Moving to docker with zero downtime
22
Build multiplexing
We want to keep on trucking while migrating to docker
Build process & scaffolding
23
Declarative build scripts
● Service configuration in git
● Preset service frameworks
● Many options
● Generator creating
○ Dockerfile
○ Health checks
○ Entry point scripts inside container
○ In general, all glue between host and service
● Possible to supply custom Dockerfile
service_name: test-uber-service
owning_team: udeploy
backend_port: 123
frontend_port: 456
service_type: clay_wheel
clay_wheel:
celeries:
- queue: test-uber-service
has_celerybeat: true
Image replication
24
● Multiple datacenters
● Images must be stored within DCs
● Build once, replicate everywhere
● Traffic restrictions, push but not pull
Current setup
● Stock docker registry
● File back-end
● Docker-mover
● Syncing images using pull/push
● Use notification API to speed up replication
Service discovery & routing
25
● Previously, we used HAProxy + scripts to do this
● Now, we use Hyberbahn + TChannel RPC
https://github.com/uber/{hyperbahn|tchannel}
○ Used for docker and legacy services
○ Required in order to move containers around in seconds
○ Dynamic routing, circuit breaking, retries, rate limiting,
load balancing
○ Completely dynamic, no fixed ports
Key Take-Aways
27
● Remove team dependencies
● More freedom
● Not tied to specific frameworks
or versions (hi, Python 3)
● Easy to experiment with new
technologies
● Too much freedom
● Non-trivial integrating with a
large running system
● Infrastructure must be dynamic
throughout
● Containers are only a minor
part of the infrastructure,
don't forget that
The good & the bad
Current and future wins
● Today, 30% of all services in docker
● Soon-ish, 100%
● Great improvements in provisioning time (done)
● Framework and service owners can manage their own
environment (done)
● Faster and automatic scaling of capacity (in progress)
Thank you!
Casper S. Jensen
caspersj@uber.com

DockerCon EU 2015: Placing a container on a train at 200mph

  • 1.
    Placing a containeron a train at 200mph Casper S. Jensen Software Engineer, Uber
  • 2.
    About Me ● JoinedUber January 2015, Compute Platform Denmark, Aarhus office ● PhD, CS On a completely unrelated topic ● Linux aficionado ● Docker “user” since February
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 10.
    Not that hard... 10 Youjust have to handle ● 24/7 availability across the globe ● Very different markets ● 1000s of developers and teams ● Adding new features like there’s no tomorrow UberPOOL, UberKITTEN, UberICECREAM, UberEATS, UberWHATEVERYOUCANIMAGINE ● Hypergrowth in all dimensions ● Datacenters, servers, infrastructure, etc Basically, you have to make magic happen every time a user opens the application
  • 11.
  • 12.
    A fair amountof frustration 12 1)Write service RFC 2)Wait for feedback 3)Do all necessary scaffolding by hand 4)Start developing your service 5)Wait for infra team to write service scaffolding 6)Wait for IT to allocate servers 7)Wait for infra team to provision servers 8)Deploy to development servers and test 9)Deploy to production 10)Monitor and iterate Steps 5–7 could take days or weeks...
  • 13.
    It's just notscalable 13 But you have to start somewhere
  • 14.
    —Internal e-mail, February2015 “Make it easier for service owners to manage their local service environments.” 14
  • 16.
    New development process 16 1)Writeservice RFC 2)Wait for feedback 3)Do all necessary scaffolding using tools 4)Start developing your service 5)Deploy to development servers and test 6)Deploy to production 7)Monitor and iterate
  • 18.
  • 19.
    All the thingsyou did not consider 19 ● Routing ● Dynamic service discovery ● Deployment ● Placement engine ● Logging and tracing ● Dual build environments ● Handling of secrets ● Security updates ● Private repositories ● Replicating images across multiple datacenters Also, how much freedom do you really want to give your developers?
  • 20.
    Change all the things! Let'sgo through some examples
  • 21.
    uDeploy 21 ● Rolling upgrades ●Automatic rollbacks on failure ● Health checks, stats, exceptions, ○ Load-, and system-tests ● Service building ● Build replication ● 4.000+ upgrades/week ● 3.000+ builds/week ● 300+ rollbacks/week ● 600+ managed services Our in-house deployment/cluster management system
  • 22.
    Moving to dockerwith zero downtime 22 Build multiplexing We want to keep on trucking while migrating to docker
  • 23.
    Build process &scaffolding 23 Declarative build scripts ● Service configuration in git ● Preset service frameworks ● Many options ● Generator creating ○ Dockerfile ○ Health checks ○ Entry point scripts inside container ○ In general, all glue between host and service ● Possible to supply custom Dockerfile service_name: test-uber-service owning_team: udeploy backend_port: 123 frontend_port: 456 service_type: clay_wheel clay_wheel: celeries: - queue: test-uber-service has_celerybeat: true
  • 24.
    Image replication 24 ● Multipledatacenters ● Images must be stored within DCs ● Build once, replicate everywhere ● Traffic restrictions, push but not pull Current setup ● Stock docker registry ● File back-end ● Docker-mover ● Syncing images using pull/push ● Use notification API to speed up replication
  • 25.
    Service discovery &routing 25 ● Previously, we used HAProxy + scripts to do this ● Now, we use Hyberbahn + TChannel RPC https://github.com/uber/{hyperbahn|tchannel} ○ Used for docker and legacy services ○ Required in order to move containers around in seconds ○ Dynamic routing, circuit breaking, retries, rate limiting, load balancing ○ Completely dynamic, no fixed ports
  • 26.
  • 27.
    27 ● Remove teamdependencies ● More freedom ● Not tied to specific frameworks or versions (hi, Python 3) ● Easy to experiment with new technologies ● Too much freedom ● Non-trivial integrating with a large running system ● Infrastructure must be dynamic throughout ● Containers are only a minor part of the infrastructure, don't forget that The good & the bad
  • 28.
    Current and futurewins ● Today, 30% of all services in docker ● Soon-ish, 100% ● Great improvements in provisioning time (done) ● Framework and service owners can manage their own environment (done) ● Faster and automatic scaling of capacity (in progress)
  • 29.
    Thank you! Casper S.Jensen caspersj@uber.com